Stylometric Fingerprints and Privacy Behavior in Textual Data
A Thesis
Submitted to the Faculty
of
Drexel University
by
Aylin Caliskan-Islam
in partial fulfillment of the
requirements for the degree
of
Doctor of Philosophy in Computer Science
May 2015
c© Copyright May 2015Aylin Caliskan-Islam.
This work is licensed under the terms of the Creative Com-mons Attribution-NonCommercial-ShareAlike 4.0 International Li-cense. The license is available at http://creativecommons.org/licenses/by-nc-sa/4.0/.
ii
Dedications
To my love Seha Islam, who inspired me to follow my dream.
To the geniuses in my life, my parents Mitka and Osman Caliskan, my brother
Alaattin Caliskan, and again my loving husband Seha Islam, for sharing their
creativity and wisdom while always supporting me. I am so lucky to have you.
To my dear grandmother Kuna Garev, for teaching me perseverance and opti-
mism. You are dearly missed.
To my 94 year old grandfather Rad Garev and uncle Stoyan Garev for always
being there and crossing fingers for me.
To Dr. Ben Taskar, for introducing me to machine learning. Machine learning
lost a star. You will always be my role model.
Finally to Dr. Rachel Greenstadt and Dr. Arvind Narayanan, two of the best
things that ever happened to me. Rachel taught me the power of believing in good.
Arvind inspired me to be a good scientist. I wouldn’t be here if it weren’t for you.
iii
Acknowledgments
I would like to extend my deepest gratitude to my advisor Dr. Rachel Greenstadt
for giving me the opportunity to be her Ph.D student. I have learned so many positive
things from her along my journey. She helped me grow as a researcher and gain new
perspectives to understand the insights behind results. She always encouraged me
to pursue my curiosity, especially in source code authorship attribution. She showed
me the value of believing in the greater good and always giving an opportunity to
ideas, places, and people. I will forever be thankful to her for her support, optimism,
invaluable ideas, and encouragement.
It is an honor to have such a great dissertation committee. I am grateful to
Dr. Spiros Mancoridis, Dr. Damon McCoy, Dr. Arvind Narayanan, and Dr. Dario
Salvucci for assisting me close this chapter in my life as I move on to the next one. I
would also like to extend my gratitude to my candidacy exam committee members Dr.
Damon McCoy, Santiago Ontanon, Dr. Dario Salvucci, and Dr. Ali Shokoufandeh,
who helped me navigate more efficiently through research with valuable advice.
I would like to thank the U.S. Army Research Laboratory, especially William
Glodek, Dr. Richard Harang, Dr. Melissa Holland, Jeffrey Micher, and Dr. Clare
Voss for making it possible for me to work at the Adelphi Laboratory Center under
the open campus program with the network security and natural language processing
divisions. I would like to thank them for their hospitality and kindness. I learned
many new useful things from the experts in different fields while enjoying a very
productive summer. I would like to express my special appreciation to Dr. Richard
Harang, whose expertise and resourcefulness in mathematics and machine learning is
fascinating to me. His precision and clarity in explaining mathematical theories and
machine learning is astounding. It is such an honor that we will keep working with
him and the network security division in the upcoming years.
iv
My co-author Dr. Arvind Narayanan, with his genius ideas and perfect papers, has
been very motivating and inspirational during my Ph.D. His enthusiasm for science
is contagious. I cannot find words to express how thankful I am for the countless
hours he spent with me to come up with new approaches and to discuss new research
directions and results. I could not have wished for a better mentor. I am also so
lucky to have Dr. Damon McCoy as my co-author. His expertise on cybercrime
and resources on underground forums have made my experiments so much fun that
they have started becoming a hobby. Collaboration has become essential and very
productive after seeing his openness to new directions and dedication. My co-author
Fabian Yamaguchi has been very helpful with his valuable contributions and support
during our collaboration on de-anonymizing programmers. Being surrounded with so
many inspiring people makes research a lot more exciting and rewarding.
My senior colleagues Dr. Michael Brennan and Dr. Sadia Afroz have been very
helpful since the day I met them. Mike’s presentation and writing skills have been
helpful examples throughout my academic career, while submitting a paper, present-
ing at the Chaos Communication Congress, or at a conference. My co-author Sadia
has set a great example for realizing the opportunities in problems faced during re-
search to come up with novel approaches. My co-author, a good programmer Ariel
Stolerman has always come up with effective solutions whenever we were stuck in
research. Understanding privacy behavior in social networks would not have been
possible without the help of my disciplined co-author Jonathan Walsh. I am lucky
to have so many great labmates at the Privacy, Security, and Automation Labora-
tory at Drexel University, especially Rebekah Overdorf, Edwin Dauber, and Andrew
McDonald. I am grateful that they will continue to carry the exciting work forward.
Finally, I would like to thank DARPA, NSF, ARO/ARL, and Amazon for provid-
ing generous funding.
v
Table of Contents
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiLIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xABSTRACT .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Statement of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2. Linguistic Style Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.1 Author Identification and Multiple Identity Detection in Cyber Crim-
inal Forums. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Underground Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Authorship Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Detecting Fraudulent Accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 Overview of Underground Forums. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Properties of Antichat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Properties of BlackhatWorld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Properties of Carders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Properties of L33tCrew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Member Overlap Between Forums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Identity Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Public and Private Messages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.3 Authorship Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Removing Product Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Minimum Text Requirement for Authorship Attribution . . . . . . . . . 31Attribution Within Forums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Importance of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2 Detecting Multiple Identities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.2.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2.3 Probability Score Calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.2.4 Multiple Identities in Underground Forums . . . . . . . . . . . . . . . . . . . . . . . 382.2.5 Multiple Identities Across Forums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.2.6 Multiple Identities Within a Forum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.7 Automating Forum Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
vi
2.2.8 Lessons Learned about Underground Markets. . . . . . . . . . . . . . . . . . . . . 48Reasons for Creating Multiple Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2.9 Lessons Learned about Stylometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502.2.10 Doppelgänger Detection by Forum Administrators . . . . . . . . . . . . . . . 502.2.11 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.2.12 Hybrid Doppelgänger Finder Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.2.13 Methods to Evade Doppelgänger Finder . . . . . . . . . . . . . . . . . . . . . . . . . . . 522.2.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3 Author Identification in Translated Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.3.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.3.3 Corpus Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552.3.4 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Translator Attribution Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Authorship Attribution Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.5 Results and Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Translator Attribution Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Translator Attribution Results in One-way Translations . . . . . . . . . 60Authorship Attribution Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Authorship Attribution Results in One-way Translations . . . . . . . . 64
2.3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662.4 Author Identification in Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.4.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672.4.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692.4.3 De-anonymizing Programmers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Fuzzy Abstract Syntax Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Lexical and Layout Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Syntactic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.4.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Random Forest Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.4.5 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Programmer De-anonymization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Training Data and Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87Obfuscation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90Two-class Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92Verification/Open World Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Relaxed Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Generalizing the Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Software Engineering Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
vii
2.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1002.4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022.4.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3. Modeling and Quantifying Privacy Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1123.3 Problem Statement and Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153.5 Amazon Mechanical Turk Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1183.6 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.6.1 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223.6.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Feature Normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124Topic Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124Privacy Dictionary Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128Quote, URL, Handle, Retweet, Hashtag Count . . . . . . . . . . . . . . . . . . . 128
3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293.7.1 Twitter Database User Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303.7.2 Correlation between User’s Privacy Score and User’s Friends’
Privacy Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303.7.3 Correlation between User’s Privacy Score and Mentioned Users’
Privacy Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1333.7.4 Correlation between User’s Privacy Score and Number of Fol-
lowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1333.7.5 Inter-Annotator Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1353.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1383.10 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4. Future Work and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.1.1 Design “Nudges” and NLP-based Privacy Protection Policies . . . . 1414.1.2 De-classification and Sanitization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1414.1.3 Source Code and Binaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.2 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143BIBLIOGRAPHY .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145APPENDIX A: COPYRIGHT INFORMATION .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
viii
List of Tables
2.1 Summary of Forums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 AntiChat Members Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Blackhat Members Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Carders Members Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 L33tCrew Members Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Author Attribution Within a Forum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.8 Features with Highest Information Gain Ratio in Different Forums . . . . . . . . 34
2.9 Criteria for Verifying Multiple Accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.10 Manual Analysis of Users: X indicates same, – indicates different, emptymeans the result is inconclusive or complicated with many values. . . . . . . . . . . 43
2.11 A Product and Quantity Cluster from Public Carders Messages . . . . . . . . . . . . 47
2.12 Words Closest to the Word:‘weed’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.13 Translation Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.14 ‘Translation Feature Set’ Translator Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.15 ‘Translation Feature Set’ Translator Attribution on One-way Translations . 60
2.16 ‘Translation Feature Set’ Authorship Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.17 ‘Translation Feature Set’ Authorship Attribution on One-way Translations 65
2.18 Overview of Applications for Code Stylometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.19 Abstract Syntax Tree Node Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.20 Lexical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
ix
2.21 Layout Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.22 Syntactic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.23 C++ Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.24 Validation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.25 Effect of Obfuscation on De-anonymization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.26 Generalizing to Other Programming Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
2.27 Effect of Problem Difficulty on Coding Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.28 Comparison to Previous Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.1 Data Set Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.2 Tweet Privacy Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.3 Privacy Feature Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.4 Some Private and Public Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.5 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.6 Topics with High Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
x
List of Figures
1.1 Overview of My Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Duplicate Account Groups Within Carders as Identified by the AE Detec-tor. (Each dot is one user. There is an edge between two users if the AEdetector considered them as duplicate users.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Effect of Number of Words Per User on Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 User Attribution on 50 Randomly Chosen Authors. . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Doppelgänger Finder: With Common Users in Carders and L33tCrew: 179Users with 28 Pairs. AUC is 0.82. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Combined Probability Scores of the Top 100 Pairs from Carders. . . . . . . . . . . 40
2.6 Comparison of the Effectiveness of ‘Translation Feature Set’ in TranslatorAttribution vs. Authorship Attribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.7 Comparison of Authorship Attribution Using the ‘Translation Feature Set’and Excluding Function Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.8 Sample Code Listing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.9 Corresponding Abstract Syntax Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.10 Large Scale De-anonymization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.11 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.12 A Code Sample X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.13 Code Sample X After Obfuscation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.14 Relaxed Classification with 250 Programmers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.1 AMT Annotation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.2 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.3 Tweet Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
xi
3.4 Twitter Privacy Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.5 User with Privacy Score-1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.6 User with Privacy Score-2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.7 User with Privacy Score-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
xii
AbstractStylometric Fingerprints and Privacy Behavior in Textual Data
Aylin Caliskan-IslamAdvisor: Rachel Greenstadt, Ph.D.
Machine learning and natural language processing can be used to characterize
and quantify aspects of human behavior expressed in language. Linguistic features
exhibited in any kind of text can be used to study individuals’ behavior as well as to
identify an author among thousands of authors. Studying aspects of human behavior
can be automated by incorporating machine learning techniques and well-engineered
features that represent behavior of interest. Human behavior analysis can be used
to enhance security by detecting malware programmers, malicious users, or abusive
multiple account holders in online networks. At the same time, such an automated
analysis is a serious threat to privacy, especially to the privacy of persons that would
like to remain anonymous. Nevertheless, privacy enhancing technologies can be built
by first and foremost understanding privacy infringing methods in-depth to create
countermeasures.
Authorship attribution through stylometry, the study of writing style, in trans-
lated or unconventional text yields as high accuracy as the state-of-the-art accuracy in
authorship attribution in English prose. Applying stylometry to the more structured
domain of programming languages is also possible through a robust and principled
method introduced in this thesis. Code stylometry is able to de-anonymize thousands
of programmers with high accuracy while providing insight into software engineering.
Programmer de-anonymization can aid in forensic analysis, resolving plagiarism cases,
or copyright investigations. On the other hand, de-anonymizing programmers consti-
tutes a privacy threat for anonymous contributors of open source repositories. Bridg-
xiii
ing the gap between natural language processing and machine learning is a powerful
step towards designing feature sets that represent aspects of human behavior. Fea-
tures obtained through natural language processing methods can be used to study
the privacy behavior of users in large social networks. Aggregate privacy analysis
shows that people with similar privacy behavior appear in clusters. This knowledge
can be used to design privacy nudges and effective privacy preserving technologies.
Machine learning can be incorporated on any kind of textual data to automate human
behavior extraction in large scale.
1
1. Introduction
Machine learning and natural language processing can be used to characterize
and quantify aspects of human behavior expressed in language. This work touches
two main realms, security and privacy. Its applications complement each other by
enhancing security and preserving privacy. My work is grounded in computer science,
draws on computational social science, human computer interaction, and behavioral
economics, and has applications to public policy. Figure-1.1 shows a bottom-up
overview of my work. My work builds upon the key element of feature extraction by
incorporating language parsing, abstract syntax trees, topic modeling, word cluster-
ing, named entity recognition, and semantic classification. These features along with
rigorous analysis of large textual data sets makes it possible to extract aspects of
human behavior. These techniques are tailored to investigate the semi-structured do-
main of natural languages. I have also ported these techniques to the more structured
domain of programming languages by analyzing source code.
De-anonymization This thesis introduces code stylometry, a principled and robust
method for de-anonymizing programmers, in addition to advancing the state-of-the-
art in stylometry and social behavior analysis by bridging the gap between natural
language processing and machine learning. Experts in linguistics, forensics, and eco-
nomics have been interested in de-anonymizing the decentralized digital currency Bit-
coin’s anonymous founder Satoshi Nakamoto. Satoshi Nakamoto is the pseudonym
of Bitcoin’s founder(s) who prefer(s) to remain anonymous. Bitcoin’s white paper
has been published in 2008 and its open source code has been released in 2009. Bit-
coin has attracted the attention of researchers, financial regulators, legislators, law
enforcement, and criminals while Satoshi Nakamoto remained anonymous. As the
2
Figure 1.1: Overview of My Research
digital currency had a significant impact on the market and raised security questions,
forensic experts became more interested in de-anonymizing Satoshi Nakamoto. Ex-
perts have started looking for possible candidates that can be Bitcoin’s inventor by
first identifying researchers working at the intersection of computer science and digital
currencies, that included computer scientists, cryptographers, and mathematicians.
After identifying a set of candidates by research topics and online communications,
experts have analyzed the writing styles of the candidates by using stylometry. Sty-
lometry, the study of writing style, can unveil the candidate that has the closest
writing style exhibited in Bitcoin’s white paper that was published in 2008. Some
linguistic analysts have suggested that Satoshi Nakamoto is actually Nick Szabo, a
legal scholar and cryptographer. Numerous people have been suggested to be Satoshi
Nakamoto after performing stylometric analysis on essays, blogs, and different forms
of available prose with an effort to de-anonymize this famous anonymous inventor.
No one has gone one step further to analyze coding style as Satoshi’s fingerprint in
Bitcoin’s initial git repository released in 2009. If we have a set of programmers who
3
we think might be Satoshi, and samples of source code from each of these program-
mers, we could use the initial versions of Bitcoin’s source code to try to determine
Satoshi’s identity. Of course, this assumes that Satoshi didn’t make any attempt to
obfuscate his or her coding style.
Fingerprints Internet users leave fingerprints on the Internet as they share any
form of textual data. When users share textual data on social networks, they will-
ingly reveal private information. Even in cases when personal information is not
shared, users can be de-anonymized by low properties of their text through stylomet-
ric analysis. Stylometry is the study of writing style. Writing style is unique to each
individual which makes it possible to identify individuals in large data sets. Advanced
natural language processing and machine learning methods make it possible to profile
individuals and track them even when they are careful about not revealing any pri-
vate information in social media. This work focuses on two cases that carry personal
fingerprints, namely textual data in social networks that contain private information
and text that reveals identity through personal style.
Stylometric Fingerprints Privacy savvy people might refrain from sharing any
high level personal information on social media to minimize their Internet presence.
Nevertheless, their identity is still preserved in the low level properties of their text
through their writing style unless they effectively obfuscate their writing. Stylometric
methods for authorship attribution can successfully identify authors of anonymous
documents in large data sets [99; 6; 80]. Here, we consider the supervised authorship
attribution problem which relies on correct ground truth authorship information, that
given a document D and a set of unique authors A = {A1, ..., An}, where Ai 6= Aj
when i 6= j, determines who among the authors in A wrote D. The algorithm has
two steps: training and testing. During training, the algorithm trains a classifier
4
from the sample documents of the authors in A. In the testing step, it determines
the probability of each author in A being the author of D and assigns the author
with the highest probability as the author of D. The success of this method depends
on how well the features express the writing style of the authors in the data set. In
the presence of correct features, this method can be used to identify the authors or
translators in text that has been translated to other languages and back to English,
which rules out the possibility of obfuscating text by translating it. Linguistic features
can be used not only to infer authorship but also to detect an author’s native language
[108] or find accounts owned by this author across different domains [84]. Being able
to identify authors across domains facilitates linking identities across the Internet,
making this a key privacy concern. On the other hand, being able to find multiple
accounts of a user within a domain can help detect abusive accounts. Such stylometric
techniques can even be used in challenging data sets where text is a mixture of different
languages, slang, product information, and l33t-speak [10].
Code Stylometry Source code is a form of structured textual data that preserves
personal coding style to a great extent. Source code is becoming more easily acces-
sible as open source software, online version control and bug tracking repositories
are becoming widely used. Source code authorship attribution can be constructed
as an identical machine learning problem to supervised authorship attribution of
anonymous documents. The general case is again a supervised authorship attribu-
tion problem that relies on correct ground truth authorship information, that given
a source code file C and a set of programmers P = {P1, ..., Pn}, where Pi 6= Pj when
i 6= j, determines who among the authors in P wrote C. A classifier is trained on
features extracted from source files with known programmers. Extracting syntactic
features from source code requires parsing the code to generate abstract syntax trees.
Features are extracted for each programmer to generate a numeric representation of
5
their coding style. In the testing step, the probability of each programmer in P being
the programmer of C is calculated and the programmer with the highest probability
is assigned as the author of C. Code stylometry holds important implications for
protecting intellectual property as well as for identifying malware authors, resolv-
ing copyright disputes or aid in plagiarism investigations. Source code authorship
attribution spurs a cross-cutting area involving natural language processing and ma-
chine learning. Identifying features of coding style also reveals information about
how coding style changes under certain circumstances. Modeling the coding styles of
programmers, who introduce bugs or security vulnerabilities to code repositories, can
be used to automatically detect problematic code.
Privacy Behavior Social network participants consciously share private informa-
tion in public or private settings, such as location or family information. Sharing
information through private channels still comes with the risk of that information
being exposed to the public through the people it was shared with. As a result, enor-
mous amounts of personal data about individuals is accumulating on the Internet and
is being collected by data aggregators, which is a serious threat to privacy. Companies
sell personal information for marketing purposes and sometimes for reasons that we
are not even informed about. Personal information on social media can be regulated
in a more informed manner by end users if they have the tools to analyze what type
of an Internet identity they are forming on the Internet. Understanding the private
information revealing behavior of friends in social networks can help a user decide
when to share what type of information with whom. Quantifying private information
can provide insight into the social dynamics of private information sharing. As pri-
vate information is quantified, it can be used to associate certain privacy behaviors
with users in social networks, and one example is associating a privacy score with
each user to symbolize the amount of private information she shares. A machine
6
learning classifier can be trained on a set of textual data T = {T1, ..., Tn} with known
privacy score, to predict the privacy scores of users A = {A1, ..., An} by extracting
features that reveal private information. Quantifying privacy behavior can show how
it is constructed and influenced, and this knowledge can be used to effectively design
privacy enhancing technologies and target educational interventions. Awareness on
privacy behavior can help users avoid posts that they later regret which might cause
loss of jobs or relationships.
Advanced natural language processing and machine learning methods make pro-
filing and tracking users easier and faster than ever. Personal style and private in-
formation are in all types of textual data. End users need to be aware of how easily
identifiable they are by their low level features to self regulate their information shar-
ing behavior accordingly by the help of privacy enhancing technologies.
De-anonymization of authors and automated methods of behavioral analysis can
be used to enhance security. These security enhancing methods can also be considered
privacy infringing. Increased awareness of such security enhancing and privacy in-
fringing methods lead to demand for counteracting privacy enhancing methods. The
demand for privacy enhancing technologies, such as anonymization [76], sanitization,
and other censorship evasion tools can be expected to increase as awareness of pri-
vacy threats increases. Effective development of tools and policies to increase privacy
are possible by first and foremost identifying these threats. As technology evolves,
new privacy and security threats emerge that require more sophisticated methods for
detection and evasion.
1.1 Statement of Thesis
This thesis argues the following statement:
Style and privacy behavior expressed in language can be quantified and characterized.
7
Fingerprints in natural and programming languages can be numerically repre-
sented by stylometric analysis of textual data. Privacy behavior can be extracted
from contextual properties of language. Such quantified stylistic fingerprints and
behavioral representations can be used to characterize human behavior.
1.2 Key Contributions
1. Information about individuals can be extracted or inferred both from high level
content that is consciously shared and low level linguistic properties of text that
are unconsciously revealed.
• Authors of anonymous text can be predicted even from challenging data,
such as translated text, micro-text, a mixture of languages, and also source
code.
• Identifying authors in social networks can be used to link the accounts of
same users within a network or across different networks.
• In the case of source code, software forensics, copyright disputes, and pla-
giarism investigations can be resolved more effectively with stylometric
proof.
• Code stylometry provides software engineering insights such as how pro-
gramming style changes while implementing sophisticated functionality or
the differences in coding styles of programmers with different skill sets.
2. High level properties of text such as topics and named entities along with sen-
timent analysis makes it possible to quantify private information to associate a
privacy score with each user.
8
• Privacy behavior in social networks can be analyzed through privacy scores
and preliminary results show that privacy is a collective behavior.
• All kinds of textual data contain user fingerprints and create an online
persona for each Internet user. Being able to identify these fingerprints
helps an end user answer three main questions.
– What data do I consider sensitive?
– In what contexts should I share sensitive data?
– What does my data say about me?
1.3 Thesis Organization
In chapter 2, I discuss how de-anonymizing authors is possible through creating
numeric representations of their writing style by extracting low-level linguistic and
grammatical features. The media [40; 43; 87; 86; 31; 58] have used this work to
raise awareness on online privacy. In section 2.1, I discuss how authors are de-
anonymized in cyber criminal forums, where each author may use multiple aliases
and communicate with text composed of product information, slang, and multiple
languages. Such methods help experts in forensics detect abusive account holders
or cyber criminals, as well as by providing insights about the forums. On the other
hand, these methods pose a threat to privacy and as demonstrated in section 2.2,
may expose users of multiple aliases aiming to evade censorship.
Translating text has been suggested as an anonymization method to rid a text
of stylistic fingerprints. In section 2.3, I discuss that machine translation does not
prevent an author from being identified [27]. I have also researched the portability of
stylometric techniques used on natural languages to programming languages. Plain
source code does not directly reflect the grammar of a program, however, parsing the
code and generating its abstract syntax tree reveals the structure and functionality of
9
code. In section 2.4, I discuss how programmers can be de-anonymized by extracting
syntactic information from abstract syntax trees along with linguistic information
from source code to generate numeric representations of their coding style [28]. Being
able to de-anonymize programmers from their coding style can aid in forensics and
resolving copyright disputes. Studying coding style can have applications in software
engineering as well.
In chapter 3, I discuss how features extracted from text can be used to study the
behavioral fingerprints of users [29]. Features related to privacy, such as named enti-
ties, topic modeling, Brown clustering and semantic classification can categorize the
privacy behavior of a user in a social network. The combination of these natural lan-
guage processing based features gives an overall picture of users’ privacy behavior and
can be used for fine-grained analysis of causal effects behind information disclosure
and network phenomena.
In chapter 4, I discuss how influencing factors behind privacy behavior can be
used to design effective privacy nudges to encourage privacy preserving behavior in on-
line social networks. These factors could have a significant role in designing effective
privacy policies and educational initiatives. Alternatively, automated sanitization
approaches can be developed to preserve privacy. There are no known automated
approaches to sanitizing documents to prevent inferences of sensitive information.
Automated sanitization can redact classified documents, prevent data loss in com-
panies, and make it possible to distribute communication data of users’ to linguists,
psychologists, sociologists without revealing personal information.
10
2. Linguistic Style Feature Extraction
Stylometry is a field that relies on linguistic information found in a document to
perform authorship attribution. Stylometry is currently used in intelligence analysis
and forensics. The 2009 Technology Assessment for the State of the Art Biomet-
rics Excellence Roadmap (SABER) commissioned by the FBI [116] stated that, “As
non-handwritten communications become more prevalent, such as blogging, text mes-
saging and emails, there is a growing need to identify writers not by their written
script, but by analysis of the typed content.” Authorship attribution is the problem
of determining a text’s author, which can be accomplished by stylometric analysis.
Even basic stylometry systems reach high accuracy in classifying authors correctly
[6]. Stylometric analysis becomes more challenging when training and testing data
starts differing from formal English prose. A simple modification to text is trans-
lating it to another language and then back to English. Stylometry on translated
text will be discussed in section 2.3. Authorship attribution becomes even more chal-
lenging in underground forums discussed in 2.1, where the input is micro-text that
includes slang, unstructured sentences, mixture of different languages. Another com-
mon question about stylometry is if it can be applied to source code, which is a form
of structured text, since it is possible to use stylometry on many forms of text in var-
ious natural languages. Section 2.4 shows that code stylometry is possible through
syntactic, lexical and layout features, leading to 94% correct authorship attribution
accuracy among 1,600 programmers.
The methods in the following subsections of feature extraction have been estab-
lished by engineering a feature set specific to the problem in hand. They all use stylis-
tic and linguistic features, either specific to translated text, informal cyber criminal
text, or a programming language.
11
2.1 Author Identification andMultiple Identity Detection in Cyber Crim-
inal Forums
This work was completed by Sadia Afroz with support from Aylin Caliskan-Islam,
Ariel Stolerman, and Damon McCoy [10]. The text in section 2.1, excluding section
2.2.7, is from the Doppelgänger Finder paper [10]. Parts of text from the paper [10]
was used in Sadia Afroz’s thesis [7] as well as this thesis.
Forum databases of underground cyber criminals have been leaked publicly. Ana-
lyzing these forums revealed the struggle forum administrators had because of abusive
forum members that create multiple accounts even after they were banned. These
multiple identities of one user, so called doppelgängers, posed a challenge for au-
thorship attribution as well. Stylometric analysis in underground forums is already
challenging because of the nature of the messages that contain multiple languages
(German, English, Russian, Turkish), product information (stolen credit card infor-
mation), l33t-sp34k (leet-speak), slang, and micro-text. On the other hand, doppel-
gängers pollute ground truth by introducing new and artificial classes to machine
learning, whereas all accounts of a doppelgänger should belong to one class. Besides
reaching state-of-the-art authorship attribution accuracy on this data set, I helped
develop and evaluate Doppelgänger Finder [10], that is currently used by FBI to find
alternate egos of individuals by their writing style. Removing product information
and using a language independent feature set that handles l33t-sp34k made this ap-
proach possible. Sadia Afroz came up with the Doppelgänger Finder algorithm. I
performed the manual analysis on the results of the semi-supervised tool which re-
vealed that 12 previously unknown multiple identities were discovered among 221
members. The manual analysis also revealed insights about how cybercriminals make
use of multiple identities.
12
Stylometric techniques require 5,000 words of text from a known author to gen-
erate an accurate model of her writing style. This limitation eliminated many of
the users in underground forums from the stylometric analysis, since they had less
than 5,000 words of text. To overcome this limitation, an approach consisting of
representations of words as numeric vectors has been used to complement Doppel-
gänger Finder and automatically extracts personal information about forum users.
Having this approach work on such a challenging domain is a promising result for
all other common domains that are more natural language processing friendly, such
as blogs, online social networks, and emails. Law enforcement would be able to use
Doppelgänger Finder with the added extension to investigate cyber criminals or abu-
sive accounts in a larger population with the guidance of the automatically extracted
identifying information.
Underground forums are used as a rendezvous location for cybercriminals and
play a crucial role in increasing efficiency and promoting innovation in the cyber-
crime ecosystem. These forums are frequently used by cybercriminals around the
world to establish trade relationships and facilitate the exchange of illicit goods and
services such as the sale of stolen credit card numbers, compromised hosts, and on-
line credential theft. Linking different aliases to the same individual across sources
of data to increase knowledge of a cybercriminal’s activities is a powerful ability. An
anecdotal example of this analysis performed manually is the case of the Rustock bot-
net operator where his accounts were manually linked together from multiple leaked
data sources including underground forum posts [102]. All this information provides
valuable insights, about how much he was earning, who else he was dealing with,
which paints a fairly rich picture of a botnet operator’s role in the underground cyber
ecosystem.
Other information gleaned from underground forums is providing security re-
13
searchers, law enforcement, and policy makers valuable information on how the mar-
ket is segmented and specialized, the social dynamics of the community, and potential
bottlenecks that are vulnerable to interventions. These advances have been accom-
plished primarily through analysis of limited structured metadata and painstaking
manual analysis. Because of the size of the data sets and the labor intensity of the
task, there are limitations to what can be accomplished by these techniques.
In fiction and folklore, a doppelgänger is an apparition or double of a living person.
Many underground forums use the word doppelgänger to refer to a duplicate account
of a user in the forum. The use of doppelgängers is forbidden in these forums because
it undermines the fragile trust between pseudonymous users engaged in risky, illegal
behavior and enables them to take advantage of each other. Users suspected of using
multiple accounts are commonly banned. Understanding how and why users persist
in maintaining multiple identities can help identify the dynamics of trust relationships
in these forums. Detecting doppelgängers is possible through stylometric analysis and
provides insights about the nature of underground forums.
Linguistic analysis has recently been applied successfully to many security prob-
lems from using stylometry to identify anonymous bloggers [80], to using topic model-
ing to find job postings for web service abuse [60]. However, the underground forums
present a particular challenge for text analytic techniques. The messages are short
and tend to mix conversations with “products” such as credit card and bank account
numbers, URLs, IP addresses, etc. Furthermore, the forums are written in a mul-
tilingual 133t-speak slang that renders most natural language processing tools such
as part-of-speech taggers inaccurate. This language is often intentionally difficult to
parse and speak even for native human speakers and serves to weed out outsiders. As
such they represent a stress test of sorts for these approaches.
14
Key contributions are:
1. Adapting authorship attribution to underground forums. Author
attribution is useful in the scenario where an analyst has an unknown piece of text
and wishes to attribute it to one out of a set of suspects. This scenario may be useful
in underground analysis on its own, but it is also a subroutine in the multiple account
detection algorithm.
Although some language-agnostic authorship attribution methods are available [59;
61] for this task, most of the highly accurate attribution methods [80; 6] are language
specific for standard English. By using language-specific function words and parts-of-
speech taggers, authorship attribution method provides high accuracy even with over
1,000 authors in difficult, foreign language texts. Authorship attribution in under-
ground forums is possible with a feature set that incorporates the informal language,
such as l33tsp34k, used in underground forums and data preprocessing methods that
can remove non-conversational products from messages. These as a whole improve
the accuracy by 10-15% beyond current state-of-the-art methods directly applied to
underground forums.
2. A general multiple author detection algorithm. Unlike standard author-
ship attribution, identifying doppelgängers is an unsupervised learning problem and
requires novel methods where all pairs of accounts are compared against each other.
Existing methods for this problem [14; 96] based on distance have been evaluated
by artificially splitting authors into multiple identities. These methods have reduced
accuracy when applied to actual separate accounts, such as multiple blogs by the
same author and improved methods are needed. Non-textual methods used to iden-
tify fraud or spam accounts are insufficient because they do not catch the high-value
alternate identities used in these forums. Doppelgänger Finder evaluates all pairs of
a set of authors for duplicate identities and returns a list of potential pairs, ordered
15
by probability. This list can be used by a forum analyst to quickly identify interest-
ing multiple identities. Doppelgänger Finder has been validated on real-world blogs
using multiple separate blogs per author and using multiple accounts of members in
different underground forums.
3. A practical manual analysis of an underground forum to identify
previously unknown multiple identities. Discovering and grouping unknown
identities in cases when ground truth data is unavailable is possible by using Doppel-
gänger Finder.
Running Doppelgänger Finder on a German underground forum called Carders
automatically revealed at least 10 new author pairs (and an additional 3 probable
pairs) which would have been hard to discover without time consuming manual anal-
ysis. These pairs are typically high value identities. One user was creating such
identities for sale to other users on the forum. Manual analysis provides insights on
how and why these identities are created by these users and the purposes they serve
in the forums.
2.1.1 Related Work
Underground Markets
Most of the past research on the underground market has focused on either an-
alyzing structured metadata (i.e. social graphs, and trade ratings) in underground
forums or performing a manual analysis of products and prices. One of the first
studies by Franklin et al. performed an analysis of underground chat messages in
public IRC channels to gain insight into prices and types of products traded [45].
Another study performed an analysis of an underground carders forums to under-
stand how they propagate credentials in large scale data breaches [91]. A separate
study explored how trust models were formed in underground forums [79], Yip et
16
al. preformed an analysis of structural metadata in underground forums to examine
the dynamics of social graphs in these communities [121]. Finally, another study did
an analysis of activities taking place on Chinese underground markets [123]. McCoy
et al. [75] analyzed the underground forums of three pharmaceutical affiliate pro-
grams and provided a detailed cost accounting of the overall business model. Recent
research has investigated using underground market data to disrupt fraudulent activ-
ities. Thomas et al. identified patterns in fraudulent account usernames/emails by
purchasing twitter accounts from an underground market [112].
When one forum is disrupted, these cybercriminals often create or join another
forum using the same or different identity. Previous research tried to understand
why these cybercriminals choose forums for doing their business [122] and what prop-
erties make underground forums sustainable [11]. Doppelgänger Finder focuses on
detecting multiple accounts that are controlled by the same person based on auto-
mated analysis of the unstructured message contents. This approach can help identify
known cybercriminals by analyzing their conversation, even when they change online
identities.
Authorship Attribution
Users are unique in many ways and an extensive amount of research exploits
different aspects of behavior to de-anonymize users in anonymized data sets. For
example, a user can be identified based on how and what he types [30], his browser
setup [41], which movie he prefers [82], who he connected with in a social network [83],
when and what he writes in his blog or social network or on product reviews [52; 80;
15] and even how he fills bubbles in a paper form [26]. In the leaked underground
forum, we only have the users’ posts and their social network information. But de-
anonymizing these users using their social links from other social networks [83] is
17
challenging as these relationships are ephemeral business relationships. Also, often
these posts are from different time frames, so linking users using timing analysis, as
previous work did to de-anonymize flickr and twitter users is not possible [52; 82].
While stylometry has been applied to chat data in the past [6], large numbers of
authors [80], as well as foreign language, and translated texts [27], the combination of
these properties in this data set is unique. The Writeprints [6] work evaluated their
techniques on instant messaging chat logs from CyberWatch (www.cyberwatch.com).
This data set is probably the closest to the forum data sets. However, they had fewer
words per author (an average of 1,422 words), but were in English. In this work, the
accuracy is better even though there are more authors.
Some previous work has explored the question of identifying multiple identities
of an author. The Writeprints method can be used to detect similarity between two
authors by measuring distance between their “writeprints.” Qian et al.’s method,
called “learning by similarity,” learns in the similarity space by creating a training
set of similar and dissimilar documents [96] and comparing the distances between
them. This method was evaluated using users on Amazon book reviews. Almishari et
al. [14] also used a similar distance-based approach using reviews from yelp.com to find
duplicate authors. Koppel et al. [63] used a feature subsampling approach to detect
whether two documents are written by the same author. But all of these methods
were evaluated by creating artificial multiple identities per author by splitting a single
author into two parts.
Detecting Fraudulent Accounts
Perito et al. [92] showed that most users use similar usernames for their accounts
in different sites, e.g., daniele.perito and d.perito. Thus different accounts of a user
can be tracked by just using usernames. This does not hold when the users are
18
deliberately trying to hide their identity, which is often the case in underground
forums (example of usernames in multiple accounts are in Table 2.10). Usernames
and other account information and behavior in the social network have often used to
identify sybil/spam accounts [49; 38; 16]. Doppelgänger Finder has a different goal, as
it tries to identify duplicate accounts of highly active users, who would be considered
as honest users in previous fraud detection papers. For example, these doppelgänger
users are highly connected with other users in the forum, unlike spam/sybil accounts.
Their account information (usernames, email addresses) are similar to spam accounts
with mixed language, special characters, and disposable email accounts. However,
these properties hold for most users in underground forums, even for the ones who
are not creating multiple identities.
2.1.2 Overview of Underground Forums
SQL dumps of forum databases were available for the following underground fo-
rums: AntiChat (AC), BlackhatWorld (BW), Carders (CC), L33tCrew (LC) (summa-
rized in Table 2.1). The complete SQL dumps of the databases include user registra-
tion information, along with public and private messages. Each of these SQL forum
dumps has been publicly “leaked” and uploaded to public file downloading sites by
unknown parties. Previous research performed on data collected by crawling or join-
ing the forum. As a result, only the public portions of the forums were available for
analysis. Leaked databases provide access to all the public and private messages in a
specific time duration for each of the forums.
This section gives an overview of the forums, in particular, it shows the relationship
between a member’s rank and his activities in the forum. In all forums, high-ranked
members had more posts than low-ranked members. Access to special sections of these
forums depends on a member’s rank. Having the full SQL dump gives the advantage
19
Forum Language Date covered Posts Pvt msgs Users LurkersAntichat (AC) Russian May 2002-Jun 2010 2,160,815 194,498 41,036 15,165 (36.96%)BlackHat (BW) English Oct 2005-Mar 2008 65,572 20,849 8,718 4,229 (48.5%)Carders(CC) German Feb 2009-Dec 2010 373,143 197,067 8,425 3,097(36.76%)L33tCrew (LC) German May 2007-Nov 2009 861,459 501,915 18,834 9,306 (46.41%)
Table 2.1: Summary of Forums
of seeing the whole forum, which would have been unavailable to an outsider or a
newly joined member crawling the forum. In general, the high-ranked users have
more reputation, a longer post history, and consequently more words for automated
analysis.
Properties of Antichat
Antichat started in May 2002 and was leaked in June 2010. It is a predominantly
Russian language forum with 25,871 active users (users with at least one post in the
forum). Antichat covers a broad array of underground cybercrime topics such as
password cracking, stolen online credentials, email spam, search engine optimization
(SEO), and underground affiliate programs.
Anybody with a valid email address can join the forum, though access to certain
sections of the forum is restricted based on a member’s rank. At the time of the leak,
there were 8 advanced groups and 8 user ranks in the data set1. A member of level
N can access groups at level ≤ N. Admins and moderators have access to the whole
forum and they grant access to levels 3 to 6 by invitation. At the time of the leak,
there were 4 admins and 89 moderators in Antichat.
Members earn ranks based on their reputation which is given by other members of
the forum for any post or activity2. Initially each member is a Beginner (Новичок) 3, a
member with at least 50 reputation is Knowledgeable (Знающий) and 888 reputation1http://forum.antichat.ru/thread17259.html2Member rules are described https://forum.antichat.ru/thread72984.html3Translated by Google translator
20
is a Guru (Гуру) (all user reputation levels are shown in Table 2.2). A member can
also receive negative reputation points and can get banned. There were 3,033 banned
members. The top reasons for banning a member are having multiple accounts and
violating trade rules.
Rank Rep. Members Memberswith>=4,500words
Ламер (Lamer) -50 646 22Чайник (Newbie) -3 340 4Новичок (Beginner) 0 38,279 553Знающий (Knowledgeable) 50 595 256Специалист (Specialist) 100 658 413Эксперт (Expert) 350 271 177Гуру (Guru) 888 206 153Античатовец (Antichatian) 5,555 1 1
Table 2.2: AntiChat Members Rank
Antichat has a designated “Buy, Sell, Exchange” forum for trading. Most of the
transactions are in WebMoney4. To minimize cheating, Antichat has paid “Guaran-
tors” to guarantee product and service quality5. Sellers pay a percentage of the value
of one unit of goods/services to the guarantor to verify his product quality. Members
are advised not to buy non-guaranteed products. In case of a cheating, a buyer is
paid off from the guarantor’s collateral value.4http://www.wmtransfer.com/5https://forum.antichat.ru/thread63165.html
21
Properties of BlackhatWorld
BlackhatWorld is primarily an English speaking forum that focuses on blackhat
search engine optimization techniques. It started in October 2005 and is still active.
At the time of the leak (May 2008), Blackhat had 4,489 active members.
Like Antichat, anybody can join the forum and read most of the public posts.
At the time of the leak, a member needed to pay $25 to post in a public thread.6
A member can have 8 ranks depending on his posting activities and different rights
in the forums based on his rank. This rank can be achieved either by being active
in the forum for a long period or by paying fees. A new member with less than 40
posts is a Blacknoob and 40-100 posts is a Peasant, both of these ranks do not have
access to the “Junior VIP” section of the forum which requires at least 100 posts7.
The “Junior VIP” section is not indexed by any search engines or visible to any non
Jr. VIP members. At the time of the leak, a member could pay $15 to the admin
to access this section. A member is considered active after at least 40 posts and
21 days after joining the forum. Member ranks are shown in Table 2.3. The forum
also maintains an “Executive VIP” section where membership is by invitation and a
“Shitlist” for members with bad reputations. There were 43 banned members in our
data set. Most of the members in the BlackhatWorld data set were Blacknoobs.
Currently, only the Junior VIP members can post in the BlackhatWorld market-
place, the “Buy, Sell, Trade” section8. Any member with over 40 posts was allowed
to trade. Each post in the marketplace must be approved by an admin or moderator.
In the data set, there were 3 admins and 5 moderators. The major currency of this
forum is USD. Paypal and exchange of products are also accepted.6The posting cost is now $30.7http://www.blackhatworld.com/blackhat-seo/misc.php?do=vsarules8http://www.blackhatworld.com/blackhat-seo/bhw-marketplace-rules-how-post/387929-marketplace-rules-how-post-updated-no-sales-thread-bumping.
html
22
Rank Members Memberswith>=4,500words
Banned Users 43 421 days 40 posts 7,416 4Registered Member 248 74Exclusive V.I.Ps 7 7Premium Members (PAID/Donated) 191 19Admins and Moderators 8 8
Table 2.3: Blackhat Members Rank
Properties of Carders
Carders was a German language forum that specialized in stolen credit cards and
other accounts. This forum was started in February 2009 and was leaked and closed
in December 2010 9.
At the time of the leak, Carders had 3 admins and 11 moderators. A regular mem-
ber can have 9 ranks, but unlike other forums the rank was not dependent only on the
number of posts (Table 2.4). Access to different sections of the forum was restricted
based on rank. Any member with a verified email can be a Newbie. A member needs
at least 50 posts to be a Full Member. A member had to be at least a Full Member
to sell tutorials. VIP Members were invited by other high-ranked members. To sell
products continuously, a member needs a Verified vendor license which requires at
least 50 posts in the forum and a fee of e 150+ per month. For certain products, for
example, drugs and weapons, the license costs at least e 200. Carders maintained a
“Ripper” thread where any member can report a dishonest trader. A suspected ripper
was assigned Ripper-Verdacht! title. Misbehaving members, for example, spammers,
rippers, or members with multiple accounts, were either banned temporarily or per-9Details of carders leak at http://www.exploit-db.com/papers/15823/
23
manently depending on the severity of their action. In the data set, there were 1,849
banned members. The majority of the members in the Carders data set are Newbie.
Rank Members Memberswith>=4,500words
Nicht registriert (Not registered) 1 0Email verification 323 1Newbie 4,899 23Full Member 1,296 431VIP Member 7 6Verified Vendor 16 6Admins 14 13Ripper-Verdacht! (Ripper suspected) 14 7Time Banned 6 2Perm Banned 1849 193
Table 2.4: Carders Members Rank
Other products traded in this forum were cardable shops (shops to monetize stolen
cards), proxy servers, anonymous phone numbers, fake shipping and delivery services,
and drugs. The major currencies of the forum were Ukash10, PaySafeCard (PSC)11,
and WebMoney.
Properties of L33tCrew
Like Carders, L33tCrew was a predominantly carding forum. The forum was
started in May 2007 and leaked and closed in Nov 2009. Many users joined Carders
after L33tCrew was closed. At the time of the leak, L33tCrew had 9,528 active users.
L33tCrew member rank also depended on a member’s activity and number of10https://www.ukash.com/11https://www.paysafecard.com/
24
posts. A member with 15 posts was allowed in the base account area. The forum
shoutbox, which was used to report minor problems or off-topic issues, is visible to
members with at least 40 posts. A member’s ranking was based on his activity in the
forum (Table 2.5). On top of that, a member could have 2nd and 3rd level rankings.
100–150 posts were needed to be a 2nd level member. Members could rise to 3rd level
after “proving” themselves in 2nd level and proving that they had non-public tools,
tricks, etc. A 2nd level member had to send at least three non-public tools to the
admin or moderators to prove himself.
Rank Min. posts Members Memberswith>=4,500words
Newbie 0-30 715 93Half-Operator 60 158 67Operator 100 177 121Higher Levels 150 412 398Unranked Members – 16,482 679Banned – 847 197Admins – 11 11Invited – 33 8Vorzeitig in der Handelszone – 5 2
Table 2.5: L33tCrew Members Rank
Member Overlap Between Forums
Common active users in the forums can be identified by matching their email
addresses. Here “active” means users with at least one private or public message in a
forum. Among the four forums, Carders and L33tCrew had 563 common users based
on email addresses, among which 443 were active in Carders and 439 were active in
25
L33tCrew. Common users in other forums are negligible.
Identity Protection
In all of the forums, multiple identities were strictly prohibited. On Carders
and Antichat, one of the main reasons for banning a member was creating multiple
identities. Some users were taking measures to hide their identities. Several users were
using disposable email addresses (562 in Carders, 364 in L33tCrew) from top well-
known disposable email services, e.g., trashmail.com, owlpic.com, and 20minutemail.
Carders used an alternative-ego detection tool (AE detector)12 which saves a
cookie of history of ids that log into Carders. Whenever someone logs into more
than one account, it sends an automated warning message to forum moderators say-
ing that the forum has been accessed from multiple accounts. The AE detector also
warns the corresponding members. Users who received warning messages from the
AE detector were considered part of a multiple identity. There were 400 multiple
identity groups formed by 1,692 members, where group size varied from 2 to 466
accounts (shown in Figure 2.1).
We suspect that the AE detector does not reflect multiple account holders per-
fectly. There are possible scenarios that would trigger the AE detector, e.g. when
two members use a shared device to log into Carders or use a NAT/proxy. The
corresponding users in these situations were considered as doppelgängers by the AE
detector, which does not reflect the ground truth. Likewise, the AE detector may
not catch all the alter egos, as some users may take alternate measures to log in from
different sources. These suspicions were supported by the stylometric and manual
analyses of Carders’ posts.12http://www.vbulletin.org/forum/showthread.php?t=107566
26
Figure 2.1: Duplicate Account Groups Within Carders as Identified by the AE De-tector. (Each dot is one user. There is an edge between two users if the AE detectorconsidered them as duplicate users.)
27
Public and Private Messages
In a forum a member can send public messages to public threads and private
messages to other members. In our data set we had both the public and private
messages of all the members. Public messages are used to advertise/request products
or services. In general, public messages are short and often have specific formats. For
example, Carders specifies a specific format for public thread titles.
Private messages are used for discussing details of the products and negotiating
prices. Sometimes members use their other email, ICQ, or Jabber addresses for
finalizing trades.
2.1.3 Authorship Attribution
The goal in this section is to see how well stylometry works in the challenging
setting of underground forums and how to adapt stylometric methods to improve
performance.
Approach
We consider a supervised authorship attribution problem that given a doc-
ument D and a set of authors A = {A1, ..., An} determines who among the authors
in A wrote D. The authorship attribution algorithm has two steps: training and
testing. During training, the algorithm trains a classifier using F features extracted
from the sample documents of the authors in A. In the testing step, it extracts fea-
tures predefined in F from D and determines the probability of each author in A
of being the author of D. It considers an author Amax to be the author of D if the
probability of Amax being the author of D, Pr(Amax wrote D), is the highest among
all Pr(Ai wrote D), i = 1, 2, ...n.
k-attribution is the relaxed version of authorship attribution that outputs k
28
top authors, ranked by their corresponding probabilities, Pr(Ai wrote D), where
i = 1, 2, ...k and k ≤ n.
Feature Extraction
The feature set contains lexical, syntactic, and domain specific features. The lex-
ical features include frequency of n-grams, punctuation, and special characters. The
synactic features include frequency of language-specific parts-of-speech and function
words. In our data set we used English, German, and Russian parts-of-speech tag-
gers and corresponding function words. For English and German parts-of-speech
tagging we used the Stanford log-linear parts-of-speech tagger [113]. For Russian
parts-of-speech tagging we used TreeTagger [104] with Russian parameters13. Func-
tion words or stop words are words with little lexical meaning that serve to ex-
press grammatical relationships with other words within the sentence, for exam-
ple, in English function words are prepositions (to, from, for), and conjunctions
(and, but, or). We used German and Russian stop words from Ranks.nl (http:
//www.ranks.nl/resources/stopwords.html) as function words. Similar feature
sets have been used before in authorship analysis on English text [6; 80; 76]. We
modified the feature set for the multilingual case by adding language specific fea-
tures. As the majority of the members use leetspeak in these forums, we used the
percentage of leetspeak per document as a feature. Leetspeak (also known as Internet
slang) uses combinations of ASCII characters to replace Latin letters, for example,
leet is spelled as l33t or 1337. We defined leetspeak as a word with symbols and
numbers and used regular expressions to identify such words.
We used the JStylo [76] API for feature extraction, augmenting it with leetspeak
percentage and the multilingual features for German and Russian.13http://corpus.leeds.ac.uk/mocky/
29
Feature CountFreq. of punctuation (e.g. ‘,’ ‘.’) DynamicFreq of special characters (e.g., ‘@’, ‘%’ DynamicFreq. of character ngrams, n =1-3 150Length of words DynamicFreq. of numbers ngrams, n=1-3 110Freq. of parts-of-speech ngrams, n=1-3 150Freq. of word ngrams, n=1-3 150Freq. of function words, e.g. for, to, the. DynamicPercentage of leetspeak, e.g, l33t, pwn3d 1
Table 2.6: Feature Set
Classification
We used a linear kernel Support Vector Machine (SVM) with Sequential Minimal
Optimization (SMO) [94]. We performed 10-fold cross-validation, that is, our classi-
fier was trained on 90% of the documents (at least 4,500 words per author) and tested
on the remaining 10% of the documents (at least 500 words per author. This exper-
iment is repeated 10 times, each time randomly taking one 500-word document per
author for testing and the rest for training. To evaluate our method’s performance
we use precision and recall. Here true positive for author A means number of times a
document written by author A was correctly attributed to author A and false positive
for author A means number of times a document written by any other author was
attributed to author A. We calculate per author precision/recall and take the average
to show overall performance.
Removing Product Data
One of the primary challenges with this data set is the mixing of conversational
discussion with product discussions, e.g., stolen credentials, account information with
30
passwords, and exploit code. This is particularly pronounced in the most active users
who represent the majority of the trading activities. As the classifier relies on writing
style to determine authorship, it misclassifies when two or more members share similar
kinds of product information in their messages. Removing product information from
conversation improved the classifier’s performance by 10-15%. Identifying product
information is also useful for understanding what kind of products are being traded
in the forums.
Our product detector is based on two observations: 1) product information usually
has repeated patterns, 2) conversation usually has verbs, but product information
does not have verbs. To detect products, we first tag all the words in a document with
their corresponding parts-of-speech and find sentence structures that are repeated
more than a threshold of times. We consider the repeated patterns with no verbs as
products and remove these from the documents.
To find repeated patterns, we measured Jaccard distance between each pair of
tagged sentences. Due to errors in parts-of-speech tagging, sometimes two similar
sentences are tagged with different parts-of-speech. To account for this, we considered
two tagged sentences as similar if their distance is less than a threshold. We consider
a post as a product post if any pattern is repeated more than three times. Note
that our product detector is unsupervised and not specific to any particular kind of
product, rather it depends on the structure of product information.
To evaluate our product detector we randomly chose 10,000 public posts from
Carders and manually labeled them as product or conversation. 3.12% of the posts
contained products. Using a matching threshold of 0.5 and repetition threshold of 3,
we can detect 81.73% of the product posts (255 out of 312) with 2.5% false positive
rate.14
14Note that false positives are not that damaging, since they only result in additional text beingremoved.
31
2.1.4 Results
Minimum Text Requirement for Authorship Attribution
Figure 2.2: Effect of Number of Words Per User on Accuracy
We trained our classifier with different numbers of training documents per author
to see how much text is required to identify an author with sufficient accuracy. We
performed this experiment for all the forums studied. In our experiments, accuracy
increased as we trained the classifier with more words-per-author. On average, the ac-
curacy did not improve when more than 4500 words-per-author were used in training
(Figure 2.2).
32
Attribution Within Forums
Many users were removed from the data set due to insufficient text, especially
after products and data dumps were removed. Table 2.7 shows the number of authors
remaining in each forum and our results for author attribution in each forum which are
mostly the high ranked members (section 2.1.2). Results are for the public and private
messages respectively. Aside from this, performance on private messages ranged from
77.2% to 84% precision. Recall results were similar, as this is a multi-class rather than
a binary decision problem and precision for all authors was averaged (a false positive
for one author is a false negative for another author). This is comparable to results on
less challenging stylometry problems, such as English language emails and essays [6].
Performance on public messages, which were shorter and less conversational—more
like advertising copy—was worse, ranging from 60.3% to 72%. The product detection
and changes to the features set we made increased the overall accuracy by 10-15%
depending on the setting.
However, it is difficult to compare the performance across different forums due
to the differing number of authors in each forum. Figure 2.3 shows the results of k-
attribution for k = 1 to k = 10 where the k = 1 case is strict authorship attribution.
In this figure we can see that the differences between private and public messages
persist in this case and that the accuracy is not greatly affected when the number of
authors scale from 50 to the numbers in Table 2.7. Furthermore, this figure shows
that the results are best for the Carders forum. The higher accuracy for Carders
and L33tCrew may be due to the more focused set of topics on these forums or
possibly the German language. Via manual analysis, we noted that the part-of-
speech tagger we used for Russian was particularly inaccurate on the Antichat data
set. A more accurate part-of-speech tagger might lead to better results on Russian
language forums.
33
Figure 2.3: User Attribution on 50 Randomly Chosen Authors.
Relaxed or k-attribution is helpful in the case where stylometry is used to narrow
the set of authors in manual analysis. As we allow the algorithm to return up to 10
authors, we can increase the precision of results returned to 96% in the case of private
messages and 90% in the case of public messages.
Forum Public PrivateMembers Precision Members Precision
AntiChat 1,459 44.4% 25 84%Blackhat 81 72% 35 80.7%Carders 346 60.3% 210 82.8%L33tCrew 1,215 68.8% 479 77.2%
Table 2.7: Author Attribution Within a Forum.
34
Importance of Features
German forums English forums Russian forumsChar. trigram: mfg 15 Punctuation: (’) Char. 1-gram: (ё )Punctuation: Comma Punctuation: Comma Function word: ещё (Trans.: more)Leetspeak Foreign words Punctuation: DotPunctuation: Dot Leetspeak Char. 3-grams: ениChar 3-gram:(...) Function word: i’m Char. bigrams: (, )Nouns Punctuation: Dot Word-bigrams:что бы (that would)Uppercase letters POS-bigram (Noun,)Function word: dass (that) Char. bigram: (, )Conjunctions 16
Char. 1-gram: ∧Table 2.8: Features with Highest Information Gain Ratio in Different Forums
To understand which features were the most important to distinguish authors, we
calculated the Information Gain Ratio (IGR) [97] of each feature Fi over the entire
data set:
IGR(Fi) = (H(A)−H(A|Fi))/H(Fi) (2.1)
where A is a random variable corresponding to an author and H is Shannon entropy.
Punctuation marks (comma, period, consecutive periods) were some of the most
important features (shown in Table 2.8) in all the German, English, and Russian
language forums . In German and English forums, leetspeak percentage was highly
ranked. Interestingly, similar features are important across different forums, even
though the predominant languages of the forums are different.15mfg is an abbreviation of a German greeting “Mit Freundlichen Gruessen” (English: sincerely
yours).16German subordinating conjunctions (e.g. weil (because), daß (that), damit (so that))
35
2.2 Detecting Multiple Identities
In a practical scenario, an analyst may want to find any probable set of duplicate
identities within a large pool of authors. Having multiple identities per author is
not uncommon, e.g., many people on the Internet have multiple email addresses,
accounts on different sites (e.g. Facebook, Twitter, G+) and blogs. Grouping multiple
identities of an author is a powerful ability as the easiest way to change identity on
the Internet is to create a new account.
Grouping all the identities of an author is not possible using only the traditional
supervised authorship attribution. A supervised authorship attribution algorithm,
trained on a set of unique authors, can answer who, among the training set, is the
author of an unknown document. If the training set contains multiple identities of
an author, supervised authorship attribution will identify only one of the identities
as the most probable author, without saying anything about the connection among
the authors in the training set.
2.2.1 Approach
Author identities can be grouped by leveraging supervised authorship attribution.
For each pair of authors A and B we calculate the probability of A’s document being
attributed to B (Pr(A→ B)) and B’s document being attributed to A (Pr(B → A)).
A and B are considered to be the same author if the combined probability is greater
than a threshold. To calculate the pairwise probabilities, for each author Ai ∈ A,
train a model using all the other authors in A except Ai and test using Ai. This
method is called Doppelgänger Finder.
This method can be extended to larger groups. For example, for three authors
A, B and C, compute P(A==B), P(B==C), and P(C==A). If A=B and C=B, we
consider A, B, and C as the three identities of one author.
36
2.2.2 Feature Extraction
To identify similarity between two authors we use the same features as regular
authorship attribution (Table 2.6), with two exceptions: 1) exclude the word n-grams,
and 2) instead of limiting the number of other n-grams, use all possible n-grams.
Word n-grams made the feature extraction process slower without any improvement
in the performance. We used all possible character n-grams to increase the difference
between authors, e.g., if author A uses a bi-gram “ng” but author B never uses it,
then “ng” is an important feature to distinguish A and B. If we include all possible
character n-grams instead of only the top 50, we can catch many such cases, specially
the rare author-specific n-grams.
After extracting all the features, we add weight to the feature frequencies to
increase distance among authors. This serves to increase the distance between present
and not present features and gives better results. As our features contain all possible
n-grams, the total number of features per data set becomes very large (over 100,000 for
100 authors). All the features are not important and they just make the classification
task slower without improving the accuracy. To reduce the number of features without
hurting performance, we use Principal Component Analysis (PCA) to weight and
select only the features with high variance.
Principal component analysis (PCA) is a widely used mathematical tool for high
dimension data analysis. It uses the dependencies between the variables to represent
the data in a more tractable, lower-dimensional form. PCA finds the variances and
coefficients of a feature matrix by finding the eigenvalues and eigenvectors. To perform
PCA, the following steps are performed:
1. Calculate the covariance matrix of the feature matrix F. The covariance matrix
measures how much the features vary from the mean with respect to each other.
The covariance of two random variables X and Y is:
37
cov(X, Y ) =N∑i=1
(xi − x)(yi − y)
N(2.2)
where x = mean(X), y = mean(Y ) and N is the total number of documents.
2. Calculate eigenvectors and eigenvalues of the covariance matrix. The eigenvec-
tor with the highest eigenvalue is the most dominant principle component of
the data set (PC1). It expresses the most significant relationship between the
data dimensions. Principal components are calculated by multiplying each row
of the eigenvectors with the sorted eigenvalues.
3. One of the reasons for using PCA is to reduce the number of features by finding
the principal components of input data. The best low-dimensional space is
defined as having the minimal error between the input data set and the PCA
(eq. 2.3).
∑Ki=1 λi∑Ni=1 λi
> θ (2.3)
where K is the selected dimension, N is the original dimension and λ is an
eigenvalue. We chose θ = 0.999 so that the error between the original data set
and the projected data set is less than 0.1%.
2.2.3 Probability Score Calculation
We use logistic regression with ‘L1’ regularization with a regularization factor
of C = 1 as a classifier to calculate pairwise probabilities. We experimented with
linear kernel SVM, which was slower than logistic regression without any performance
improvement. Any machine learning method that outputs classification probability
scores can be used. After that we need to calculate P (A == B) by combining the
38
two probabilities: P (A → B) and P (B → A). We experimented with three ways of
combining the probabilities:
1. Average: Given two probabilities Pr(A→ B) and Pr(B → A), combined score
is Pr(A→B)+Pr(B→A)2
.
2. Multiplication: Given two probabilities, combined score is Pr(A→ B)∗Pr(B →
A). We can consider the two probabilities as independent because when Pr(A→
B) was calculated A was not present in the training set. Similarly B was not
present when Pr(B → A) was calculated. Also in this case if any of the one-way
probabilities are 0, the combined probability would be zero.
3. Squared average: The combined score is Pr(A→B)2+Pr(B→A)2
2.
All the three approaches give similar precision/recall. We finally used the multi-
plication approach as its performance is slightly higher in the high recall region.
2.2.4 Multiple Identities in Underground Forums
In this section we show how Doppelgänger Finder method can be used to identify
duplicate accounts by performing a case study on the underground forums. In the
forums, many users create multiple identities to hide their original identity (reasons
for doing so are discussed later) and they do so by changing the obvious identity
indicators, e.g. usernames and email addresses. So we did not have any strong ground
truth information for the multiple identities in a forum. We do, however, have some
common users across two forums. We treat the common identities in multiple forums
as one data set and use that to evaluate Doppelgänger Finder in underground forums.
After that we run it on a forum and manually verify our results.
39
2.2.5 Multiple Identities Across Forums
We collected users with same email address from L33tCrew and Carders. We found
563 valid common email addresses between these two forums. Among them, 443 users
were active (had at least one post) in Carders and 439 were active in L33tCrew. Out
of these 882 users, 179 had over 4,500 words of text. We performed Doppelgänger
Finder on these 179 authors which included 28 pairs of users (the rest of the 123
accounts did not have enough text in the other forum so merely served as distractor
authors for the algorithm). Our method provides 0.85 precision and 0.82 recall when
the threshold is 0.004 with exactly 4 false positive cases (Figure 2.4).
Figure 2.4: Doppelgänger Finder: With Common Users in Carders and L33tCrew:179 Users with 28 Pairs. AUC is 0.82.
40
2.2.6 Multiple Identities Within a Forum
We used Doppelgänger Finder on Carders and manually analyzed the member-
pairs with high scores to show that they are highly likely to be the same user. We
selected all the Carders users with at least 4,500 words in their private messages,
which resulted in a total of 221 users. We chose only private messages as our basic
authorship attribution method was more accurate in private messages than in public
messages. After that, we ranked the member pairs based on the scores generated by
our method. The highest combined probability score of the possible pairs is 0.806
and then it goes down to almost zero after the first 50 pairs (Figure 2.5).
0
0.2
0.4
0.6
0.8
1.0
0 25 50 75 100
Com
bine
d P
roba
bilit
y S
core
s
Number of Author Pairs
Figure 2.5: Combined Probability Scores of the Top 100 Pairs from Carders.
41
Methodology
Criteria DescriptionUsername Whether their usernames are sameICQ If two users have the same ICQ numbersSignature (Sig.) Whether they use the same signaturesContact Information Phone number and other contact information sharedAcc. Info Information in the user table, e.g, their group membership,
join and ban date, activity timeTopics Their topic of discussionAE At least one of the users trigger the AE detector.Interaction (Intr.) Do they talk with each other?Other Other identity indicators, e.g., users mention their other ac-
counts or the pair is banned for having the same IP address.Table 2.9: Criteria for Verifying Multiple Accounts
Table 2.9 shows the criteria we use to validate the possible doppelgängers. We
manually read their private and public messages in the forum and information used
in the user accounts to extract these features. The first criterion is to see if two users
have the same ICQ numbers a.k.a UINs which is used by most traders to discuss
details of their transactions. ICQ’s are generally exchanged in private messages. Our
second criterion is to match signatures. In all the forums a user can enable or disable
a default signature on their forum profiles. Signatures could be generic abbreviations
of common phrases such as ‘mfg,’ or ‘Grüße’ or pseudonyms in the forum. We also
investigate the products traded, payment methods used, topics of messages, and user
information in the user table, e.g., join date, banned date if banned, rank in the
forum and groups the user joined. We check whether or not they set off the Alter-
Ego detector on Carders. Lastly we check whether or not members in a pair sent
private messages to each other because that would indicate that they are likely not
42
the same person. We understand that there are many ways to verify identity but in
most cases these serve as good indicators.
The Doppelgänger Finder algorithm considered(2212
)possible pairs. We chose all
the pairs with score greater than 0.05 for our manual analysis (21 pairs). We limit our
analysis to limit the number of pairs to analyze as it could be quite time consuming.
We also chose three pairs with low score (rank 22-24 in Table 2.10) to illustrate that
higher score pairs are more likely to be true match than the lower score pairs. Note
that, all of the top possible doppelgängers use completely different usernames. To
protect the members’ identity we only show the first three letters of their usernames
in Table 2.10.
There are five possible outcomes of our manual analysis: True, Probably True,
Unclear, Probably False and False. True indicates that we have conclusive evidence
that the pair is doppelgängers, e.g., sometimes the pair themselves admit in their
private/public messages about their other accounts or the pair shares same IM/pay-
ment accounts. Probably True indicates that the members share similar uncommon
attributes but there’s no conclusive evidence of them being the same. Unclear in-
dicates that some criteria are similar in both and some are very different and no
conclusive attributes either way. Probably False means there are very few to no
similarity between the members but no evidence that they are not the same. False
indicates that we found conclusive evidence that the members in a pair are not the
same, e.g., the members trade with each other.
Results and Discussion
We found that in Carders, the accounts produced at the high end of the probability
range were doppelgängers. The 12 pairs with the highest probabilities were assessed
as True or Probably True. After that, there is a range where both the manual
43
Rank Score Usernames ICQ Sig. Contact Acc. Topics AE Other Intr. Decision1 0.806 per**, Smi** X icq weed X X 0 True2 0.799 Pri**, Lou** X X X 0 True3 0.673 Kan**, deb** X X 0 True4 0.601 Sch**, bob** – mfg – Kokain – 0 Probably True5 0.495 Duk**, Mer** X – – 0 True6 0.474 Dra**, Pum** X X X 0 True7 0.372 p01**, tol** – greezz X – 0 Probably True8 0.342 Qui**, gam** X X X 0 True9 0.253 aim**, sty** X X 0 True10 0.250 Un1**, Raz** X X X 0 True11 0.196 PUN**, soc** – Jabber X – X 0 True12 0.192 Koo**, Wic** – peace X weed X 0 Probably True13 0.187 Ped**, roc** – X – 0 Unclear14 0.178 Tzo**, Haw** – X X 0 Probably False15 0.140 Xer**, kdk** – X X X 0 Unclear16 0.105 sys**, pat** X X 0 True17 0.095 Xer**, pat** – – X X 0 Probably False18 0.072 Qui**, Sco** – X 0 False19 0.066 Fru**, DaV** – – – – 0 Probably False20 0.058 Ber**, neo** – 5 False21 0.051 Mr.**, Fle** – X X 26 False*
22 0.01 puT**, pol** – – – – – – 0 False23 0.001 BuE**, Fru** – – – – – – 0 False24 0.0001 Car**, Din** – – – – – – 0 False
Table 2.10: Manual Analysis of Users: X indicates same, – indicates different, emptymeans the result is inconclusive or complicated with many values.
and linguistic evidence is thinner but nonetheless contains some true pairs (pairs 13-
17). The manual analysis suggested that pairs below this probability threshold were
likely not doppelgängers. Thus, our manual analysis overall agreed with the linguistic
analysis performed by Doppelgänger Finder.
True True cases are particularly seen when users explicitly state their identities
and/or use the same ICQ numbers in two separate accounts. For example, each pair
of users in Pair 1-3, 5, 6, 8 9, 10, and 16 provides an ICQ number in their private
messages that is unique to that pair. The users in Pair-11 use the same jabber
nickname. One of the users in Pair 1 (user name per**) was asking the admins to
give his other account back and telling other members that he is Smi**.
Other cases had just as convincing, but more subtle evidence. The accounts in
Pair-8 both use trashmail which provides disposable email addresses, which shows
that these users are careful about hiding their identities. However, the most convinc-
ing evidence of their connection was a third doppelgänger account, which we will call
44
user-8c, who did not have enough text to be in our initial user set, but was brought
to our attention by the linguistic similarity between the accounts in Pair-8. Both
users in Pair-8 share the same ICQ number with user-8c. User-8b explicitly writes
two messages from User-8c’s account, one in Turkish and one in English revealing his
user-8b username. These users do not send private messages to each other. These
findings imply that the three user accounts belong to the same person.
Probably True These accounts do not have a “smoking gun” like a shared ICQ
number or Jabber account, however, we are able to observe that the accounts shared
have similar interests or other properties. We consider how common these similar
properties are in the entire forum and assess as probably true accounts that share
uncommon properties.
In the case ofPair-4, user-4a does not have an ICQ number, but user-4b frequently
gives out an ICQ number. User-4a wants to buy new ICQ numbers. This suggests
that he uses ICQ and hides his own ICQ number. They both use a similar signature:
‘mfg’, but this is common. They trade similar products and talk about similar topics
such as Kokain and D2 numbers. Since these are not common, this suggests they
might be the same user. User-4a is a newbie while user-4b is a full member. The
accounts were active during the same period.
The accounts in Pair-7 have different ICQ numbers. However, both user-7a and
user-7b deal with online banking products, PS3, Apple products, Amazon accounts
and cards. They both use Ukash. They both use the same signature such as ‘grüße’
or ‘greezz’. User-7a is a full member and user-7b is permanently banned. They
have both been active account holders at the same period. User-7a has a 13th level
reputation and user-7b has a 11th level reputation.
Similarly, the accounts in Pair-12 use the same, rare signature ‘peace’ and both
are interested in weed.
45
Unclear The accounts in Pair-13 do not have common ICQ numbers, even though
they have the same ICQ numbers with other users (suggesting they do use dop-
pelgänger accounts with lower text, lower reputation accounts). User-13a is a full
member with a reputation level of 8. User-13b is a full member with a reputation
level of 15. User-13a’s products are carding, ps, packstation, netbook, camcorder,
and user-13b’s products are carding, botnets, cc dumps, xbox, viagra, iPod.
Probably False The Pair-14 accounts have different ICQs. User-14a products are
tutorials, accounts, Nike, ebay and ps. User-14b’s products are cameras and cards.
User-14a is a full member with reputation level of 5. User-14b is permanently banned
with a reputation level of 15.
One of the users in Pair-17, User-17b shares two ICQ numbers with another user
but not with User-17a. User-17a’s products are iPhone, iPad, macbook, drops, and
paypal and User-17b’s products are: paypal, iPhone, D2 pins, and weed.
False These users have specific and different signatures and also they use different
ICQ numbers. These accounts sometimes interact, suggesting separate identities.
Pairs such as 20 send each other private messages to trade and complete a trans-
action, suggesting they are business partners not doppelgängers.
The accounts in Pair-24 do not have any common UINs. They have different
signatures, User-24a uses the signature ‘LG Carlos’ and ‘Julix’ interchangeably. User-
24b never uses ‘Carlos’ or ‘Julix’ but he sometimes uses ‘mfg’ or ‘DingDong’ at the
end of his messages. User-24a’s products are iPhone, ebay, debit, iTunes cards, drop
service, pack station, fake money while User-24b’s products are camera, ps3, paypal,
cards, keys, eplus, games, perfumes. They do not talk to each other.
Pair-21 is a special case of false labels. User-21a and user-21b belong to group
accounts. User-21a tells user-21b: “You think it is good that they think we are the
46
same.”, because they got a warning from the admins for using the same computer.
In their private messages, they state that they are meeting at each other’s houses in
person for business, which implies that they might be using the same accounts. They
send many messages to other people mentioning each other’s names to customers.
2.2.7 Automating Forum Analysis
Stylometric methods require a certain amount of training data for accurate anal-
ysis. This limitation requires excluding forum users that have less than 5,000 words
of messages from stylometric analysis. There are ways to overcome this limitation to
perform more in-depth forum analysis, such as unsupervised language analysis meth-
ods that generate global vectors for word representations [90] and cluster the words
according to semantic and syntactic similarity [77].
Generating word clusters by using the word2vec tool [78] leads to a quick under-
standing of the types of products that are being traded in the forums. Word2vec is
a neural network that processes text without human intervention. Word2vec takes
a string of sentences as its input and converts the words to n-dimensional vectors
based on word co-occurrences. In this section, word vector representations have 50
dimensions. Given enough input with related words that appear in the same con-
text, the similarity of these vectors can be measured and vectors can accordingly be
clustered. These clusters can form the basis of search recommendations. The vectors
can be used as features for many natural language processing and machine learning
applications.
The word2vec implementation used throughout this section utilizes a continuous
bag-of-words model with negative sampling. Word2vec was trained on the 11,127,050
words of public Carders messages that had 87,072 unique vocabulary words. For
example, a large cluster (Table 2.11) out of 1,000 generated from public Carders
47
messages gives examples of some products, namely illicit drugs and the quantity
associated with them. The clusters gives a quick understanding of the type of things
that are of interest to this community as well as the extent of the businesses. On the
other hand, the clusters generated from the public and private parts of the forums
are quite similar. In cases of such similarity, gaining an understanding of a forum is
possible through collecting only the public messages.
0,5 1,5g 100G 100g 100g, 10g150g 1g 2,5g 200g 20g 25g2g 300g 30g 4,5g 40g 50-100g500g 50G 50g 5G 5g 5g.6G 6g Albi Amnesia Arjans BudsChronic DGC Falschgeld G Gr. Gra?Gramm Gras Hasch Hash Haze HeroinKostprobe Kristalle Kurs Kurs. MDMA MaterialMephedron Meth Partypaket Paste Pep PeppProbe Probe, Pulver Quali Rabbat RitalinSalvia Silver Sorte Speed Speed, SpeedpasteStammkunde Streckmittel Test Testbestellung Testen TestmengeTestpacket Top Verk?ufen Weed Weed, WhiteXTC Zeug faires gestrecktes gr gr.
Table 2.11: A Product and Quantity Cluster from Public Carders Messages
Getting more detailed information about particular products is also possible. In-
stead of generating word clusters, the closest words to a particular product can be
retrieved by calculating the cosine similarity between word vectors. Table 2.12 shows
the results of querying the word ‘weed’ and getting the most related words in de-
scending order of cosine distance.
Word vector representations can also be used to automate the manual evalua-
tion of Doppelgänger Finder results. Word2vec was trained on the 5,604,188 private
48
Word Distance Word Distance Word Distancezeug 0.575246 paste 0.562277 hasch 0.560486speed 0.555898 pep 0.538496 pepp 0.523110g 0.520100 gras 0.517192 kilo 0.512124weed? 0.511901 haze 0.498430 gramm 0.483961grass 0.477622 Weed 0.475446 Speed 0.468679weed. 0.459948 mdma 0.458200 albi 0.457879pillen 0.456605 kleinzeug 0.455731 10g 0.454204100g 0.453324 weed, 0.453105 juden 0.451888xtc 0.449392 testmenge 0.449025 25g 0.438363kurs 0.434713 50g 0.434571 koka 0.432676500g 0.428182 quali 0.421362 holland 0.41889120g 0.414536 speed. 0.413150 sorte 0.407879stoff 0.404382 hash 0.399687 probe 0.392878
Table 2.12: Words Closest to the Word: ‘weed’
Carders messages that had 38,625 vocabulary words. Querying the user pairs that
Doppelgänger Finder outputs in order of probability is useful while performing man-
ual analysis. This approach does not require the analyst to read all user messages
and find user relations with sql queries. For example, in pair-2’s case, the user closest
to user-2a is user-2b.
2.2.8 Lessons Learned about Underground Markets
Doppelgänger Finder helped us detect difficult to detect dopplegänger accounts.
We performed a preliminary analysis on L33tCrew and Blackhat and found similar
results as in Carders. Our manual analysis of these accounts improves our understand-
ing of why people create multiple identities in underground forums, either within or
across forums.
49
Reasons for Creating Multiple Identities
Banning. Getting banned in a forum is one of the main reasons for creating another
account within a forum. Rippers, spammers or multiple account holders get penalized
or banned once the admins become aware of their actions. Users with penalties get
banned once their infraction points go over a certain threshold. There are hundreds
of users within forums that have been banned and they open new accounts to keep
actively participating in the forums. Some of the new accounts get banned again
because the moderators realize that they have multiple accounts, which is a violation
of forum rules.
Sockpuppet. Some forum members create multiple accounts in order to raise de-
mand and start a competition to increase product prices.
Accounts for sale. Some users maintain multiple accounts and try to raise their rep-
utation levels and associate certain accounts with particular products and customers.
Once a certain reputation level is reached, they offer to sell these extra accounts.
Branding. Some users appear to setup multiple accounts to sell different types of
goods. One reason to do this is if one class of goods is more risky, such as selling drugs,
the person can be more careful about protecting their actual identity when using this
account. Another reason to do this might be to have each account establish a “brand”
that builds a good reputation selling a single class of goods, such as stolen credit
cards.
Cross-forum accounts. Many users have accounts in more than one forum poten-
tially as a method to grow in their sales by reaching more people not present on the
same forums and to purchase goods not offered in a single forum.
Group accounts. In some cases groups of people work together as an organization
and each member is responsible for a specific operation among a variety of products
that are traded across different accounts. How to adapt stylometry algorithms to deal
50
with multi-authored documents is an open problem that is left as future work.
2.2.9 Lessons Learned about Stylometry
We found that any stylometric method can be used in a particular language by
using a high quality parts-of-speech tagger and function words of that language.
We have access to one more forum called BadhackerZ whose primary language is
transliterated Hind using English letters. We did not have a POS tagger that could
handle the mixture of these two languages. We were not able to get meaningful
results by applying stylometry to BadhackerZ, therefore we excluded this forum from
stylometric analysis. Similarly, the Russian POS tagger that was used produced
poor results on our data set. POS tags generally have high information gain in
stylometric analysis and as a result play a crucial role in stylometry. Future work
might involve experimenting with other POS taggers or improving their efficacy by
producing manually annotated samples of forum text.
2.2.10 Doppelgänger Detection by Forum Administrators
One of the primary reasons for banning accounts on these underground forums is
because of users creating multiple accounts. This shows that forum administrators
are actively looking for these types of accounts and removing them since they can
be used to undermine underground forums. They use a number of methods ranging
from automated tools, such as AE detector, and more manual methods, such as
reports from other members. As we have seen from analysis all of these methods
are error prone and result in many false positives and false negatives. Many of the
false positives were probably generated by users using proxies to hide their IP and
location. In addition, when static tools with defined heuristics (IPs, browser cookies,
etc.) are used to detect doppelgänger accounts’ users can take simple precautions
51
to avoid detection. Many of the accounts detected by Doppelgänger Finder were not
detected by these methods potentially because that user was actively evading known
detection methods.
2.2.11 Performance
Our method needs to run N classifiers for N authors. Each classifier is independent,
thus can be run in parallel. Using only 4 threads on a quad core Apple laptop, the
underground forum Doppelgänger Finder experiment took around 35 minutes, which
can be made faster with more threads.
2.2.12 Hybrid Doppelgänger Finder Methods
Based on what we have learned from our manual analysis of our Doppelgänger
Finder results on Carders, we could potentially build a hybrid method that inte-
grates both stylometry and more underground specific features. For instance, some
of the doppelgänger accounts could be identified with simple regular expressions that
find and match contact information, such as ICQ numbers. In other cases manual
analysis revealed more subtle features, such as two accounts selling the same uncom-
mon product or talk about a similar set of topics can be a good indicator that they
are doppelgängers.
Custom parsers and pattern matchers could be created and combined with our
Doppelgänger Finder tool to improve its results. However, it is difficult to know
a priori what patterns to look for in different domains. Thus, using Doppelgänger
Finder and performing manual analysis would make this task of designing and adding
additional custom tools easier.
52
2.2.13 Methods to Evade Doppelgänger Finder
There are several limitations to using stylometry to detect doppelgängers. The
most obvious limitation is that our method required a large number of words from
a single account. A forum member could stop using their account and create a new
one before reaching this amount of text, but as pointed out in Section 2.1.2 parts of
the forum are closed off to new members, thus less activity is not beneficial to them.
They are often not allowed to engage in commerce until they have payed a fee and
built up a good reputation by posting.
Another way to evade our method is for the author to intentionally change their
writing style to deceive stylometry algorithms. As shown in previous research this
is a difficult, but possible task [21], and tools such as Anonymouth can give hints
as to how to alter writing style to evade stylometry [76]. We do not currently see
any evidence of this technique being used by members of underground forums, but
Anonymouth could be integrated into forums.
2.2.14 Conclusion
Doppelgänger Finder enables easy analysis of a forum for high-value multiple iden-
tities. The analysis of Carders has already produced insight into the use of multiple
identities within these forums. This technique can also be used to detect multiple
identities on non-malicious platforms.
This work also motivates the need for improved privacy enhancing technologies
such as Anonymouth [76] for authors who wish to not have their pseudonymous
writings linked.
53
2.3 Author Identification in Translated Text
This work was completed by Aylin Caliskan-Islam. [27].
Internet scale authorship attribution is able to identify an individual 20% of the
time among 100,000 authors. As the number of authors drops, the chances of being
de-anonymized greatly increases. Someone trying to publish political opinions in
an oppressive regime would not want to be identifiable, which calls for the need of
anonymization techniques. “Anonymouth” is an anonymization framework based on
JStylo, an authorship analysis tool, which was presented in a paper that won the
PETS best student paper award [76]. I helped users anonymize their text by the use
of “Anonymouth” while investigating the suggestions to translate text to anonymize
it. Translations alone do not anonymize text [27]. A set of writing style features
that are independent of language or not affected by translators still captures stylistic
fingerprints. Translations from English to German to Japanese and back to English
do not remove stylometric fingerprints from text as long as the machine translator
has sufficient quality.
2.3.1 Introduction
Authorship attribution is the problem of determining a text’s author, which we
can be accomplished by stylometric analysis. This is a serious privacy concern that
prevents anonymous speech. Authorship attribution can still be achieved in translated
texts using a set of features, indicating that the authors are not obfuscated.
Rao and Rohatgi [98] had introduced the idea of translating text to a different
language and then back to its original language using a machine translation tool to
obfuscate a text’s author. Translated text accumulates properties from the machine
translation tool, which is called the translator effect. The translator effect introduces
an extra author to the translated text, which is the machine translation tool itself.
54
A classifier can be trained to consider the machine translation tools’ footprints to
attribute a translator to translated anonymous text. The translator effect does not
prevent authorship attribution even though the translator introduces new features to
the text.
Machine translators are categorized by the techniques they use to perform transla-
tions. Bing’s17 and Google’s18 translators both rely on statistical machine translation.
When two translators use the same technique, as is the case with Bing’s and Google’s
translators’ statistical machine translation, they do not produce the same output
given the same input. Because of these differing translator effects, certain features
can be used to identify the translator that has been used.
2.3.2 Related Work
State-of-the-art stylometry methods can identify individuals in sets of 50 authors
with over 90% accuracy as shown in Abbasi and Chen’s work [6]. There has not
been much research on identifying the translator effect, translators, and authors in
translated text. Suresh et al. [110] were able to match the translated text with the
machine translation tool used in the translation of the original text.
Hedegaard and Simonsen [57] researched authorship attribution in translated text,
which is outperformed in this work. They used classifiers based on frame semantics in
order to discover whether adding semantic features to lexical and syntactic features
would improve authorship attribution. Their studies were conducted on a corpus that
had a limited number of authors from a specific time period and cultural context,
which had only undergone a one-way translation.17http://www.microsofttranslator.com/18http://translate.google.com/
55
2.3.3 Corpus Selection
Data selection for authorship attribution is an important step. Luyckx and Daele-
mans [71] show that the number of authors and the amount of text have a big impact
on the efficiency of classification. Brennan-Greenstadt Adversarial Stylometry Cor-
pus [22] has thirteen authors and a minimum of 5,000 words per author. All writing
samples are written in English by native English speakers. The adversarial stylome-
try corpus includes one obfuscation and one imitation document per author besides
the author’s original writing. This corpus is used throughout the translated text
section, after excluding the adversarial documents in order to experiment only with
the original writing styles. The used parts of the corpus had thirteen authors, 126
documents containing an average of 4,933 words per author and 500 words per doc-
ument. Forsyth and Holmes [44] show that a minimum of 250 words is required in
text for authorship attribution. While testing authorship attribution accuracy on a
range of data sets with documents varying from 400 to 600 words, 500 words per
document led to highest accuracy. Accordingly, test documents consisted of 500 word
chunks. The writing samples in the corpus have random topics and therefore are not
content-dependent. Schein and Caver [103] show that attribution markers are influ-
enced heavily by topics and effect the authorship attribution rate. Including varying
topics among texts and authors avoids this effect.
In order to create the machine-translated texts, Bing’s and Google’s translators
applied three different sequences of translations to the original corpus. The first se-
quence translated the original texts to German and then back to English. The second
sequence translated the original texts to Japanese and then back to English. The
third and last sequence translated the original texts to Japanese, then to German,
and then back to English. Hedegaard and Simonsen used eighteen documents in
their translator attribution experiments. Their corpus consists of English transla-
56
tions of 19th century Russian romantic literature. The experimented texts have three
authors and three translators whereas this work has thirteen authors, two statisti-
cal machine translators and two or three consequent translations. The translated
Brennan-Greenstadt Adversarial Stylometry Corpus has more translator effect in the
translated text due to two or three consequent translations compared to their single
translation from Russian to English. Additionally, Brennan-Greenstadt Adversar-
ial Stylometry Corpus consists of modern text of diverse topics written in the 21st
century. As a result, it is more diverse and current.
After performing the two-way translation experiments, feature set validation is
performed on one-way translations. One-way translations consisted of the work of
two French authors and four Dutch authors from the Ad-hoc Authorship Attribution
Competition19 data set. These texts were were translated to English by using Google
Translate, Language Weaver and Systran. Both Language Weaver and Systran are
also statistical machine translation tools. Each Dutch document ranged from 400 to
600 words. The Dutch data set consisted of essays on the same six topics by the four
authors and is therefore topic-dependent. Because of these qualities in the Dutch data
set, the effect of document length and topic-dependency on authorship and translator
attribution can be observed.
French and Dutch belong to different language families, namely Romance and
Germanic, and therefore possess different grammatical structures. This distinction
between the two languages gives the opportunity to compare language family inde-
pendent features in translator and authorship attribution.
19http://www.mathcs.duq.edu/ juola/authorship_materials2.html
57
2.3.4 Experiment Design
The experiments had two main categories; namely, translator attribution and au-
thorship attribution. Accuracy obtained from a variety of feature sets were compared
to identify the features leading to the highest accuracy. The experiments utilized two
authorship attribution tools, (1) JGAAP20 developed by Juola et al. [59] and (2)
JStylo, a novel framework for authorship attribution that was developed by McDon-
ald et al. [87]. JGAAP is limited to analysis using one feature at a time. The majority
of experiments used JStylo, which is capable of using a set of multiple features.
Authorship attribution or translator attribution is a supervised authorship attri-
bution problem which relies on correct ground truth authorship or translator infor-
mation, that given a document D, and its translation with translator T to another
language and then back to English D∗ and a set of unique authors A = {A1, ..., An},
where Ai 6= Aj when i 6= j, determines who among the authors in A wrote D∗, and
which translator T was used.
Translator Attribution Experiments
The documents in the corpus were preprocessed and normalized by stripping all
non-ASCII and non-printing characters while preserving the whitespace. Two ma-
chine learning classifiers were trained using JGAAP, namely a Naïve Bayesian classi-
fier and a support vector machine with sequential minimal optimization (SMO) based
on Platt’s [94] method. The classifiers trained on features such as character grams,
part-of-speech (POS) tags, word grams, word lengths, words, function words, sentence
length, and rare words. The features with the most frequent and the least frequent
events were also calculated. These features were extracted from documents that were
translated to German and then back to English. Translator attribution accuracy is20http://evllabs.com/jgaap/w
58
calculated by using a portion of these documents as training data and the rest as
testing data.
There were several experiments carried out with JStylo while using various feature
combinations. The first high accuracy yielding feature set was the ‘9-Feature Set’ used
by Brennan and Greenstadt [22].
The extracted ‘9-Feature Set’ was classified with SMO using a polynomial kernel
by running 10-fold cross-validation. The experiments were performed on a combi-
nation of data sets using the Google (google) or Bing (bing) translations where the
suffixes en, de, and ja correspond to English, German, and Japanese translations,
respectively.
The results of these experiments were the attribution of a text as being translated
either by google or bing. Experiments incorporated combinations of features from the
‘9-Feature Set’ and the ‘WritePrints Feature Set’, which is a partial set of features
used by Li et al. [69].
‘Translation Feature Set’ , shown in Table 2.13, was yielding the highest accuracy
after many possible permutations of feature selection, therefore it will be the main
feature set throughout translation experiments.
The ‘functions words’ feature consisted of the 512 common function words used by
Koppel et al. [62]. For feature classes with many features, such as character bigrams,
the class used the top 50 extracted features. These features were also classified with
SMO using a polynomial kernel by running 10-fold cross-validation resulting with
translator classification as google or bing.
One-way translation translator attribution experiments also used the ‘Translation
Feature Set’ on French and Dutch translations performed with Google Translate,
Language Weaver and Systran. One-way translation translator attribution experi-
ments used the exact same methods as the two-way translation translator attribution
59
Translation Feature SetAverage characters per word
Character countFunction words
LettersPunctuation
Special charactersTop letter bigramsTop letter trigrams
WordsWord lengths
Table 2.13: Translation Feature Set
experiments.
Authorship Attribution Experiments
Authorship attribution experiments followed the same set of experiments as in
translator attribution experiments. The highest accuracy yielding feature set was
again the ‘Translation Feature Set’.
2.3.5 Results and Evaluation
Translator Attribution Results
The results support Hedegaard and Simonsen’s [57] suggestion of combining fea-
tures to increase attribution accuracy. Using a single feature at a time had less
successful classification accuracy. JStylo experiments using the ‘Translation Feature
Set’ had on average 7% better correct classification than experiments using the ‘9-
Feature Set’. The results of the JStylo experiments using the ‘Translation Feature
Set’ are as shown in Table 2.14.
The translator attribution results showed higher accuracy for Japanese transla-
60
Data Set Correct Attributionen_de 90.87%en_ja 98.80%
en_ja_de 98.81%en_de & en_ja 90.44%
en_de & en_ja &en_ja_de 91.13%
Table 2.14: ‘Translation Feature Set’ Translator Attribution
tions than for German translations. This indicates that Google’s and Bing’s Japanese
translations are less similar than their German translations. Texts which had under-
gone the most iterations of translations were classified with the highest accuracy,
validating our hypothesis that the more consequent translations performed on a text,
the stronger the translator footprint will become. The results showed on average
91.13% accuracy for translator attribution.
Translator Attribution Results in One-way Translations
Data Set Correct Attributionfrench_translators 92.75%dutch_translators 94.44%
Table 2.15: ‘Translation Feature Set’ Translator Attribution on One-way Translations
‘Translation Feature Set’ led to the highest accuracy rate in attributing Google
Translate, Language Weaver and Systran, which are as shown in Table 2.15. All other
possible feature sets that are available in JStylo led to lower accuracy rates than the
‘Translation Feature Set’.
61
Authorship Attribution Results
Using a single feature at a time resulted in a correct classification rate close to the
random chance rate of 7.69%. JStylo experiments using the ‘9-Feature Set’ had on
average a 16% less correct classification rate than experiments using the ‘Translation
Feature Set’ in Table 2.13, as shown in Table 2.16. The original writing samples were
classified with 97.62% accuracy, labeled as original_text in Table 2.16.
Data Set Correct Attributionen_de_bing 96.83%en_de_google 97.62%en_ja_bing 100.00%en_ja_google 89.68%en_ja_de_bing 77.78%en_ja_de_google 87.30%
all_bing 91.54%all_google 91.53%
all_translations 91.54%original_text 97.62%
Table 2.16: ‘Translation Feature Set’ Authorship Attribution
Combining several features when training a classifier led to a higher accuracy than
using a single feature for authorship attribution in translated text as was the case for
translator attribution. Hedegaard and Simonsen argue that “[f]or translated texts, a
combined method of frequent words and frames can outperform methods based solely
on traditional markers, on translated texts." The results outperform Hedegaard and
Simonsen’s results using traditional markers in the ‘Translation Feature Set’ shown
in Table 2.13 without using context-related features such as semantic frames. A high
attribution accuracy is achieved despite an increased translator effect in the corpus
62
which contains documents from consequent translations of different languages. An
author can be identified with 91.54% accuracy on average compared to Hedegaard and
Simonsen’s average authorship attribution accuracy of 75.27% using their proposed
feature set.
Hedegaard and Simonsen also suggest that if semantic markers are not used, au-
thorship attribution may not be possible because of the translator footprint. The data
set with the most translation iterations was affected thrice by the translator and had
the lowest authorship attribution accuracy, demonstrating the validity of Hedegaard
and Simonsen’s argument. A broader survey on translator attribution and authorship
attribution in translated text which includes semantic features may be conducted if
the accuracy continues to decrease as the number of consequent translations on a sin-
gle document increases. The results of such a survey will depend on the translator’s
ability to preserve semantics.
After discovering the ‘Translation Feature Set’ shown in Table 2.13 that yields
the highest accuracy for both translator and authorship attribution, WEKA [54] was
utilized to calculate the information gain of the features. The comparison of the
effectiveness of these features in translator attribution vs. authorship attribution of
translated text is as shown in Figure 2.6.
Translator-dependent vs. preserved stylometric features in translated text can be
distinguished from the results shown in Figure 2.6.
The preserved stylometric features in descending effect order are mainly: top
letter trigrams, words, top letter bigrams; less effectively: function words, letters,
and word lengths. Character-count and characters-per-word had a little effect, while
punctuation and special characters had no effect. Translator-dependent features in
descending effect order are mainly: words, top letter trigrams, function words, and top
letter bigrams; less effectively: letters and word lengths. Character-count, characters-
63
Figure 2.6: Comparison of the Effectiveness of ‘Translation Feature Set’ in TranslatorAttribution vs. Authorship Attribution
per-word, punctuation, and special characters had little effect.
The comparison shown in Figure 2.6 demonstrates that ‘function words’ are translator-
effect-heavy, but less important for authorship attribution. Hedegaard and Simon-
sen also argues that the impact of the translator may add sufficient noise to render
authorship attribution in translated text very difficult. Consequently, excluding a
translator-effect-heavy feature such as ‘function words’ should improve authorship
attribution in translated text. To test this claim, an additional experiment which
excludes function words from the feature set was necessary. The results of this ex-
periment are as shown in Figure 2.7.
The results shown in Figure 2.7 demonstrate that excluding the translator-effect-
heavy feature ‘function words’ does not improve authorship attribution. In fact,
there is a noticeable decrease in the correct classification rate when ‘function words’
64
Figure 2.7: Comparison of Authorship Attribution Using the ‘Translation FeatureSet’ and Excluding Function Words
are excluded, suggesting that translator-effect-heavy features do not interfere and
can actually aid in more accurate authorship attribution. However, a deeper survey
regarding the effects of such features is necessary to arrive at a clearer conclusion.
Authorship Attribution Results in One-way Translations
‘Translation Feature Set’ led to the highest accuracy rate in attributing French
and Dutch authors using their texts translated to English, which are as shown in
Table 2.17. All other possible feature sets available in JStylo led to lower accuracy
rates than the ‘Translation Feature Set’. Authorship attribution accuracy of Dutch
authors is significantly lower than authorship attribution accuracy of all other authors.
As described in the ‘Corpus selection’ section, the Dutch data set possesses different
65
Data Set Correct Attributionfrench_google 100.00%
french_languageweaver 100.00%french_systran 100.00%dutch_google 60.42%
dutch_languageweaver 70.83%dutch_systran 75.00%
Table 2.17: ‘Translation Feature Set’ Authorship Attribution on One-way Transla-tions
qualities than the data sets of the other languages. Firstly, the documents have a
varying size between 400 and 600 words, whereas the documents of the other data sets
are closer to 500 words, which is the optimum length of a document for authorship
attribution purposes. Additionally, the essays from each author are on the same six
topics: the TV show ‘Big Brother’, smoking, football, a children’s story, ‘Red Riding
Hood’, and a historical tale. As a result, word choices and sentences between the
author’s essays are very similar. This topic-dependency makes authorship attribution
a more difficult task.
Afroz et al. [8] show that authorship attribution is inhibited if an author is trying
to imitate another author. Since all the Dutch authors’ essays are about existing
stories or cases, they may have been influenced by a dominating style. Such an effect
may cause some degree of imitation and render authorship attribution difficult. Also,
character count is one of the features in the ‘Translation Feature Set’ and it depends
on the length of the document. Since we are using it in the Dutch data set, it causes
a misleading effect because of the varying document sizes in this data set.
66
2.3.6 Conclusion
Machine translation tools introduce an effect on translated text that allows for
identifying the machine translation tool used to translate the text. Authorship attri-
bution of translated text is successful given the existence of a translator effect on the
text. Translated texts preserve some of original texts’ stylometric features. The more
a text goes through iterations of translations, the less preserved the stylometric fea-
tures become. Machine translation tool attribution and authorship attribution share
similar effective features albeit in differing importance levels. These features need
to be present in both translator attribution and authorship attribution since they
improve attribution accuracy and result in decreased accuracy when removed even
though a certain feature may be more effective for either translator or authorship
attribution.
67
2.4 Author Identification in Source Code
This work was completed by Aylin Caliskan-Islam with support from Richard Ha-
rang, Andrew Liu, Arvind Narayanan, Clare Voss, and Fabian Yamaguchi. [29].
2.4.1 Introduction
Can the authors of source code be identified automatically through features of their
programming style? Do they leave coding “footprints”? Holding important implica-
tions for protecting intellectual property as well as for identifying malware authors
and tracking how malware spreads and evolves, this question spurred a cross-cutting
project involving natural language processing and machine learning. Code stylometry
requires features unique to coding and to the programming language. Source code has
different properties than common writing, such as the lineage, keywords, comments,
the way functions and variables are created, and the grammar of the program. These
properties can be used to create a numeric representation of a programmer’s coding
style.
Source code authorship attribution has strong privacy and security implications.
Contributors to open-source projects may hide their identity whether they are Bit-
coin’s creator or just a programmer who does not want her employer to know about
her side activities. They may live in a regime that prohibits certain types of software,
such as censorship circumvention tools. For example, an Iranian programmer was
sentenced to death in 2012 for developing photo sharing software that was used on
pornographic websites [117].
On the other hand, source code authorship attribution may be helpful in a forensic
context, such as detection of ghostwriting, a form of plagiarism, and investigation of
copyright disputes. Malware authors, who leave source code in a breached system,
can also be de-anonymized.
68
Source code authorship attribution has been studied previously. This work repre-
sents a qualitative advance over the state-of-the-art by showing that Abstract Syntax
Trees (ASTs) carry authorial ‘fingerprints.’ The highest accuracy achieved in the lit-
erature is 97%, but this is achieved on a set of only 30 programmers and furthermore
relies on using programmer comments and larger amounts of training data [48; 46].
This work matches this accuracy on small programmer sets without this limitation.
The largest scale experiments in the literature use 46 programmers and achieve 67.2%
accuracy [39]. This work handles orders of magnitude more programmers (1,600)
while using less training data with 93.61% accuracy. Furthermore, the coding syle
features are not trivial to obfuscate. The accuracy remains high while using com-
mercial obfuscators to anonymize source code. While abstract syntax trees can be
obfuscated to an extent, doing so incurs significant overhead and maintenance costs.
Contributions. First, syntactic features are used to represent coding style. Ex-
tracting such features requires parsing of incomplete source code using a fuzzy parser
to generate an abstract syntax tree. These features add a component to code stylom-
etry that has so far remained almost completely unexplored. These features are more
fundamental and harder to obfuscate. The complete feature set consists of a com-
prehensive set of around 120,000 layout-based, lexical, and syntactic features. With
this complete feature set, a dramatic increase is achieved in accuracy compared to
previous work. Second, the method scales to 1,600 programmers without losing much
accuracy. Third, this method is not specific to C or C++, and can be applied to any
programming language.
C++ source code of thousands of contestants were collected from the annual in-
ternational competition “Google Code Jam”. A bagging (portmanteau of “bootstrap
aggregating”) classifier - random forest was used to attribute programmers to source
code. The classifiers reach 98% accuracy in a 250-class closed world task, 94% ac-
69
curacy in a 1,600-class closed world task, 100% accuracy on average in a two-class
task. Finally, an analysis of various attributes of programmers, types of program-
ming tasks, and types of features that appear to influence the success of attribution
is presented. The most important 928 features out of 120,000 are identified and 44%
of them are syntactic, 1% are layout-based, and the rest of the features are lexical. 8
training files with an average of 70 lines of code is sufficient for training when using
the lexical, layout, and syntactic features. Programmers with a greater skill set are
more easily identifiable compared to less advanced programmers and a programmer’s
coding style is more distinctive in implementations of difficult tasks as opposed to
easier tasks.
2.4.2 Problem Statement
We consider an analyst interested in determining the programmer of an anonymous
fragment of source code purely based on its coding style. To do so, the analyst only
has access to source code samples with labels of their programmers from a set of
candidate programmers, as well as from zero or more unrelated programmers. The
analyst determines the programmer of an anonymous fragment of source code by
converting each labeled sample into a numerical feature vector, in order to train a
machine learning classifier, that can subsequently be used to determine the code’s
programmer. This abstract problem formulation captures the following five settings
and corresponding applications (see Table 2.18).
Programmer De-anonymization. In this scenario, the analyst is interested in
determining the identity of an anonymous programmer. For example, if she has a set
of programmers who she suspects might be Bitcoin’s creator, Satoshi, and samples of
source code from each of these programmers, she could use the initial versions of Bit-
coin’s source code to try to determine Satoshi’s identity. Of course, this assumes that
70
Satoshi did not make any attempts to obfuscate his or her coding style. Given a set of
probable programmers, this is considered a closed-world machine learning task with
multiple classes where anonymous source code is attributed to a programmer. This
is a threat to privacy for open source contributors who wish to remain anonymous.
Software Forensics. In software forensics, the analyst assembles a set of candidate
programmers based on previously collected malware samples or online code reposi-
tories. Unfortunately, she cannot be sure that the anonymous programmer is one of
the candidates, making this an open world classification problem as the test sample
might not belong to any known category.
Ghostwriting Detection. Ghostwriting detection is related to but different from
traditional plagiarism detection. The analyst has a suspicious piece of code and one
or more candidate pieces of code that the suspicious code may have been plagiarized
from. This is a well-studied problem, typically solved using code similarity metrics, as
implemented by widely used tools such as MOSS [12], JPlag [95], and Sherlock [93].
For example, a professor may want to determine whether a student’s program-
ming assignment has been written by a student who has previously taken the class.
Unfortunately, even though submissions of the previous year are available, the as-
signments may have changed considerably, rendering code-similarity based methods
ineffective. Luckily, stylometry can be applied in this setting—the professor finds the
most stylistically similar piece of code from the previous year’s corpus and brings
both students in for gentle questioning. Given the limited set of students, this can
be considered a closed-world machine learning problem.
Copyright Investigation. Theft of code often leads to copyright disputes. Infor-
mal arrangements of hired programming labor are very common, and in the absence
71
of a written contract, someone might claim a piece of code was her own after it was
developed for hire and delivered. A dispute between two parties is thus a two-class
classification problem; where labeled code from both programmers is available to the
forensic expert.
Authorship Verification. Finally, the analyst might suspect that a piece of code
was not written by the claimed programmer, but has no leads on who the actual
programmer might be. This is the authorship verification problem. This problem
can be modeled with the textbook approach as a two-class problem where positive
examples come from previous works of the claimed programmer and negative exam-
ples come from randomly selected unrelated programmers. Alternatively, anomaly
detection could be employed in this setting, e.g., using a one-class support vector
machine [109].
As an example, a recent investigation conducted by Verizon21 on a US company’s
anomalous virtual private network traffic, revealed an employee who was outsourcing
her work to programmers in China. In such cases, training a classifier on employee’s
original code and that of random programmers, and subsequently testing pieces of
recent code, could demonstrate if the employee was the actual programmer.
In each of these applications, the adversary may try to actively modify the pro-
gram’s coding style. In the software forensics application, the adversary tries to
modify code written by her to hide her style. In the copyright and authorship veri-
fication applications, the adversary modifies code written by another programmer to
match his own style. Finally, in the ghostwriting application, two of the parties may
collaborate to modify the style of code written by one to match the other’s style.21http://www.cnn.com/2013/01/17/business/us-outsource-job-china/
72
Application Learner CommentsDe-anonymization Multiclass Closed worldSoftware forensics Multiclass Open worldPlagiarism detection Multiclass Closed worldCopyright investigation Two-classAuthorship verification Two-class One-class also possible
Table 2.18: Overview of Applications for Code Stylometry
2.4.3 De-anonymizing Programmers
De-anonymizing programmers is possible through creating a machine learning clas-
sifier to automatically identify the most likely author of an anonymous source code
fragment. The success of machine learning methods in de-anonymizing programmers
depend on how accurately the feature set represents the properties of coding style.
A fuzzy AST parser, explained in section 2.4.3, parses source code to generate ASTs
that reflect programming language use. Section 2.4.3 details the types of extracted
features. Finally, the extracted features are used to train a random forest classifier
for de-anonymizing the programmers of anonymous source code files.
Fuzzy Abstract Syntax Trees
To date, methods for source code authorship attribution focus mostly on sequential
feature representations of code such as byte-level and feature level n-grams [47; 24].
While these models are well suited to capture naming conventions and preference of
keywords, they are entirely language agnostic and thus cannot model author charac-
teristics that become apparent only in the composition of language constructs. For
example, an author’s tendency to create deeply nested code, unusually long functions
or long chains of assignments cannot be modeled using n-grams alone.
Addressing these limitations requires source code to be parsed. Unfortunately,
73
1 int f oo ( )2 {3 i f ( ( x < 0) | | x > MAX)4 return −1;56 int r e t = bar (x ) ;7 i f ( r e t != 0)8 return −1;9 else10 return 1 ;11 }
Figure 2.8: Sample Code Listing
Figure 2.9: Corresponding Abstract Syntax Tree
Function
int foo CompoundStmt
If If Decl
Condition Return Condition Return Else
Or UnaryOp (-)
int ret Assign(=)
EqExpr (!=) UnaryOp (-) Return
RelExpr (<) RelExpr (>) 1
ret Call
ret 0 1 1
x 0 x MAX
bar Args
x
parsing C/C++ code using traditional compiler front-ends is only possible for com-
plete code, i.e., source code where all identifiers can be resolved. This severely limits
their applicability in the setting of authorship attribution as it prohibits analysis of
lone functions or code fragments, as is possible with simple n-gram models.
As a compromise, source code preprocessing employs the fuzzy parser Joern that
has been designed specifically with incomplete code in mind [119]. Where possible,
74
the parser produces abstract syntax trees for code fragments while ignoring fragments
that cannot be parsed without further information. The produced syntax trees form
the basis of the feature extraction procedure.
Consider the function foo in Figure 2.8, and a simplified version of the function’s
corresponding abstract syntax tree in Figure 2.9. The function contains a number
of common language constructs found in many programming languages, such as if-
statements (line 3 and 7), return-statements (line 4, 8 and 10), and function call
expressions (line 6). For each of these constructs, the abstract syntax tree contains a
corresponding node. While the leaves of the tree make classical syntactic features such
as keywords, identifiers and operators accessible, inner nodes listed in Table 2.19 rep-
resent operations showing how these basic elements are combined to form expressions
and statements. In effect, the nesting of language constructs can also be analyzed to
obtain a feature set representing the code’s structure.
Feature Extraction
Analyzing coding style using machine learning approaches is not possible without
a suitable representation of source code that clearly expresses coding style. To ad-
dress this problem, Code Stylometry Feature Set (CSFS), a novel representation of
source code is developed specifically for code stylometry. This feature set combines
three types of features, namely lexical features, layout features, and syntactic features.
Lexical and layout features are obtained from source code while the syntactic features
can only be obtained from ASTs.
Lexical and Layout Features
Numerical features that express preferences for certain identifiers and keywords, as
well as some statistics on the use of functions or the nesting depth are extracted from
75
AdditiveExpression AndExpression ArgumentArgumentList ArrayIndexing AssignmentExprBitAndExpression BlockStarter BreakStatementCallee CallExpression CastExpressionCastTarget CompoundStatement ConditionConditionalExpression ContinueStatement DoStatementElseStatement EqualityExpression ExclusiveOrExpressionExpression ExpressionStatement ForInitForStatement FunctionDef GotoStatementIdentifier IdentifierDecl IdentifierDeclStatementIdentifierDeclType IfStatement IncDecIncDecOp InclusiveOrExpression InitializerListLabel MemberAccess MultiplicativeExpressionOrExpression Parameter ParameterListParameterType PrimaryExpression PtrMemberAccessRelationalExpression ReturnStatement ReturnTypeShiftExpression Sizeof SizeofExprSizeofOperand Statement SwitchStatementUnaryExpression UnaryOp UnaryOperatorWhileStatement
Table 2.19: Abstract Syntax Tree Node Types
the source code. Lexical and layout features can be calculated from the source code,
without having access to a parser, with basic knowledge of the programming language
in use. For example, the number of functions per source line shows the programmer’s
preference of longer over shorter functions. Table 2.20 gives an overview of lexical
features.
In addition, layout features that represent code-indentation are extracted. For
example, layout features determine whether the majority of indented lines begin with
whitespace or tabulator characters, and the ratio of whitespace to the file size. Ta-
ble 2.21 gives a detailed description of these features.
76
Feature Definition CountWordUnigramTF Term frequency of word unigrams in source code dynamic*ln(numkeyword/ length) Log of the number of occurrences of keyword divided
by file length in characters, where keyword is one ofdo, else-if, if, else, switch, for or while
7
ln(numTernary/ length) Log of the number of ternary operators divided byfile length in characters
1
ln(numTokens/ length) Log of the number of word tokens divided by filelength in characters
1
ln(numComments/ length) Log of the number of comments divided by file lengthin characters
1
ln(numLiterals/ length) Log of the number of string, character, and numericliterals divided by file length in characters
1
ln(numKeywords/ length) Log of the number of unique keywords used dividedby file length in characters
1
ln(numFunctions/ length) Log of the number of functions divided by file lengthin characters
1
ln(numMacros/ length) Log of the number of preprocessor directives dividedby file length in characters
1
nestingDepth Highest degree to which control statements and loopsare nested within each other
1
branchingFactor Branching factor of the tree formed by convertingcode blocks of files into nodes
1
avgParams The average number of parameters among all func-tions
1
stdDevNumParams The standard deviation of the number of parametersamong all functions
1
avgLineLength The average length of each line 1stdDevLineLength The standard deviation of the character lengths of
each line1
*About 55,000 for 250 authors with 9 files.
Table 2.20: Lexical Features
Syntactic Features
The syntactic feature set describes the properties of the language dependent ab-
stract syntax tree, and keywords listed in Table 2.23. Calculating these features
requires access to an abstract syntax tree. All of these features are invariant to
changes in source code layout, as well as comments.
Table 2.22 gives an overview of the syntactic features. These features are generated
by preprocessing all C++ source files in the data set to produce their abstract syntax
77
Feature Definition Countln(numTabs/length) Log of the number of tab characters divided by file
length in characters1
ln(numSpaces/length) Log of the number of space characters divided by filelength in characters
1
ln(numEmptyLines/ length) Log of the number of empty lines divided by filelength in characters, excluding leading and trailinglines between lines of text
1
whiteSpaceRatio The ratio between the number of whitespace charac-ters (spaces, tabs, and newlines) and non-whitespacecharacters
1
newLineBefore OpenBrace A boolean representing whether the majority of code-block braces are preceded by a newline character
1
tabsLeadLines A boolean representing whether the majority of in-dented lines begin with spaces or tabs
1
Table 2.21: Layout Features
trees. An abstract syntax tree is created for each function in the code. There are 58
node types in the abstract syntax tree, listed in Table 2.19, produced by Joern [120].
The AST node bigrams are the most discriminating features of all. AST node
bigrams are two AST nodes that are connected to each other. In most cases, when
used alone, they provide similar classification results to using the entire feature set.
The term frequency (TF) is the raw frequency of a node found in the abstract
syntax tree of each source code file. The term frequency inverse document frequency
(TFIDF) of nodes is calculated by multiplying the term frequency of a node by inverse
document frequency. The goal in using the inverse document frequency is normalizing
the term frequency by the number of authors actually using that particular type of
node. The inverse document frequency is calculated by dividing the number of authors
in the data set by the number of authors that use that particular node. Consequently,
the rarity of a node is captured and the feature is weighted according to its rarity.
The maximum depth of an abstract syntax tree reflects the deepest level a pro-
grammer nests a node in the solution. The average depth of the AST nodes shows
78
Feature Definition CountMaxDepthASTNode Maximum depth of an AST node 1ASTNodeBigramsTF Term frequency AST node bigrams dynamic*ASTNodeTypesTF Term frequency of 58 possible AST
node type excluding leaves58
ASTNodeTypesTFIDF Term frequency inverse document fre-quency of 58 possible AST node typeexcluding leaves
58
ASTNodeTypeAvgDep Average depth of 58 possible AST nodetypes excluding leaves
58
cppkeywords Term frequency of 84 C++ keywords 84CodeInASTLeavesTF Term frequency of code unigrams in
AST leavesdynamic**
CodeInASTLeaves TFIDF Term frequency inverse document fre-quency of code unigrams in AST leaves
dynamic**
CodeInASTLeaves AvgDep Average depth of code unigrams in ASTleaves
dynamic**
*About 45,000 for 250 authors with 9 files.**About 7,000 for 250 authors with 9 files.**About 4,000 for 150 authors with 6 files.**About 2,000 for 25 authors with 9 files.
Table 2.22: Syntactic Features
how nested or deep a programmer tends to use particular structural pieces. And
lastly, term frequency of each C++ keyword is calculated. Each of these features is
written to a feature vector to represent the solution file of a specific author and these
vectors are later used in training and testing by machine learning classifiers.
2.4.4 Classification
Using the feature set presented in the previous section, source code can be ex-
pressed as numerical vectors, making them accessible to machine learning algorithms.
After feature extraction, feature selection is performed to train a random forest clas-
sifier capable of identifying the most likely author of a code fragment.
79
alignas alignof and and_eq asmauto bitand bitor bool breakcase catch char char16_t char32_tclass compl const constexpr const_castcontinue decltype default delete dodouble dynamic_cast else enum explicitexport extern false float forfriend goto if inline intlong mutable namespace new noexceptnot not_eq nullptr operator oror_eq private protected public registerreinterpret_cast return short signed sizeofstatic static_assert static_cast struct switchtemplate this thread_local throw truetry typedef typeid typename unionunsigned using virtual void volatilewchar_t while xor xor_eq
Table 2.23: C++ Keywords
Feature Selection
Due to the heavy use of unigram term frequency and TF/IDF measures, and the
diversity of individual terms in the code, the resulting feature vectors are extremely
large and sparse, consisting of tens of thousands of features for hundreds of classes.
The dynamic Code stylometry feature set, for example, produced close to 120,000
features for 250 authors with 9 solution files each.
In many cases, such feature vectors can lead to overfitting (where a rare term, by
chance, uniquely identifies a particular author). Extremely sparse feature vectors can
also damage the accuracy of random forest classifiers, as the sparsity may result in
large numbers of zero-valued features being selected during the random subsampling
of the features to select a best split. This reduces the number of ‘useful’ splits that can
be obtained at any given node, leading to poorer fits and larger trees. Large, sparse
80
feature vectors can also lead to slowdowns in model fitting and evaluation, and are
often more difficult to interpret. By selecting a smaller number of more informative
features, the sparsity in the feature vector can be greatly reduced, thus allowing the
classifier to both produce more accurate results and fit the data faster.
Feature selection step is carried out using WEKA’s information gain [97] criterion,
which evaluates the difference between the entropy of the distribution of classes and
the entropy of the conditional distribution of classes given a particular feature:
IG(A,Mi) = H(A)−H(A|Mi) (2.4)
where A is the class corresponding to an author, H is Shannon entropy, and Mi
is the ith feature of the data set. Intuitively, the information gain can be thought of
as measuring the amount of information that the observation of the value of feature
i gives about the class label associated with the example.
To reduce the total size and sparsity of the feature vector, features that individ-
ually had non-zero information gain were used. These features will be referred to
as IG-CSFS throughout the rest of the document. Note that, as H(A|Mi) ≤ H(A),
information gain is always non-negative. While the use of information gain on a
variable-per-variable basis implicitly assumes independence between the features with
respect to their impact on the class label, this conservative approach to feature selec-
tion means that the features in use have demonstrable value in classification.
To validate this approach to feature selection, feature selection was applied to two
distinct sets of source code files. Sets of features with non-zero information gain were
nearly identical between the two sets, and the ranking of features was substantially
similar between the two. This suggests that the application of information gain to
feature selection is producing a robust and consistent set of features (see Section 3.7
for further discussion). All the results are calculated by using CSFS and IG-CSFS.
81
Using IG-CSFS on all experiments demonstrates how these features generalize to
different data sets that are larger in magnitude. One other advantage of IG-CSFS is
that it consists of a few hundred features that result in non-sparse feature vectors.
Such a compact representation of coding style makes de-anonymizing thousands of
programmers possible in minutes.
Random Forest Classification
Random forest ensemble classifier [20] is used as the main classifier for authorship
attribution. Random forests are inherently multi-class classifiers and they do not
assume any linear separability in data. They learn nonlinear boundaries. Random
forests are ensemble learners built from collections of decision trees, each of which is
grown by randomly sampling N training samples with replacement, where N is the
number of instances in the data set. To reduce correlation between trees, features
are also subsampled; commonly (logM) + 1 features are selected at random (without
replacement) out of M , and the best split on these (logM) + 1 features is used
to split the tree nodes. The number of selected features represents one of the few
tuning parameters in random forests: increasing the number of features increases the
correlation between trees in the forest which can harm the accuracy of the overall
ensemble, however increasing the number of features that can be chosen at each split
increases the classification accuracy of each individual tree making them stronger
classifiers with low error rates. The optimal range of number of features can be found
by using the out of bag (oob) error estimate, or the error estimate derived from those
samples not selected for training on a given tree.
During classification, each test example is classified via each of the trained decision
trees by following the binary decisions made at each node until a leaf is reached, and
the results are then aggregated. The most populous class can be selected as the output
82
of the forest for simple classification, or classifications can be ranked according to the
number of trees that ‘voted’ for a label when performing relaxed attribution (see
Section 2.4.5).
Random forests trained on 300 trees, which empirically provided the best trade-off
between accuracy and processing time. Examination of numerous oob values across
multiple fits suggested that (logM) + 1 random features (where M denotes the total
number of features) at each split of the decision trees was in fact optimal in all of
the experiments (listed in Section 3.7), and was used throughout. Node splits were
selected based on the information gain criteria, and all trees were grown to the largest
extent possible, without pruning.
The data was analyzed via k -fold cross-validation, where the data was split into
training and test sets stratified by author (ensuring that the number of code samples
per author in the training and test sets was identical across authors). k varies accord-
ing to data sets and is equal to the number of instances present from each author.
The cross-validation procedure was repeated 10 times, each with a different random
seed. The results section lists average results across all iterations, ensuring that they
are not biased by improbably easy or difficult to classify subsets.
2.4.5 Evaluation
The evaluation section presents the results to the possible scenarios formulated
in the problem statement to evaluate the method. The corpus section gives an
overview of the collected data. Section 2.4.5 shows the main results to programmer
de-anonymization and how it scales to 1,600 programmers, which is an immediate pri-
vacy concern for open source contributors that prefer to remain anonymous. Then,
an analysis of training data requirements and efficacy of types of features is per-
formed. The obfuscation section discusses a possible countermeasure to programmer
83
de-anonymization. Possible machine learning formulations along with the verifica-
tion section that extends the approach to an open world problem is discussed. The
evaluation ends with generalizing the method to other programming languages and
providing software engineering insights.
Corpus
One concern in source code authorship attribution is that we are actually iden-
tifying differences in coding style, rather than merely differences in functionality.
Consider the case where Alice and Bob collaborate on an open source project. Bob
writes user interface code whereas Alice works on the network interface and backend
analytics. If we used a data set derived from their project, we might differentiate
differences between frontend and backend code rather than differences in style.
In order to minimize these effects, the evaluation is performed on the source code
of solutions to programming tasks from the international programming competition
Google Code Jam (GCJ), made public in 2008 [4]. The competition consists of al-
gorithmic problems that need to be solved in a programming language of choice. In
particular, this means that all programmers solve the same problems, and hence im-
plement similar functionality, a property of the data set crucial for code stylometry
analysis. The release of these results presented opportunity to build a source code
corpus with thousands of programmers with ground truth information on authorship.
The data set contains solutions by professional programmers, students, academics,
and hobbyists from 166 countries. The majority of the contestants were from India,
United States, China, Russia, Japan, Canada, Brasil, South Korea, France, Egypt,
and Poland. Participation statistics are similar over the years. Moreover, it contains
problems of different difficulty, as the contest takes place in several rounds. This
makes it possible to assess whether coding style is related to programmer experience
84
and problem difficulty.
Programmers need to pass the qualification round within a 27 hour frame to
become contestants and advance to the online rounds. 3,000 contestants from the
first round that have the highest scores advance to the second round. The top-scoring
500 contestants in the second round advance to the third round. 25 of the top-scoring
contestants in the third round advance to the onsite final round. As the round number
increases, the set of problems become more difficult. For example, 26,470 contestants
were able to pass the qualification round. 15,563 of these completed round-1 and 3,000
contestants with the top scores advanced to round-2. Only 2,599 contestants out of
this set of skilled 3,000 contestants were able to complete round-2. 500 contestants
with the highest scores from round-2 advanced to round-3 and only 393 of these highly
skilled programmers were able to complete round-3.
The most commonly used programming language was C++, followed by Java,
and Python. C++ and C data was collected because of their popularity in the
competition and having a parser for C/C++ readily available [119]. Some preliminary
experimentation was conducted on Python as well.
A validation data set was created from 2012’s GCJ competition. Some problems
had two stages, where the second stage involved answering the same problem in a
limited amount of time and for a larger input. The solution to the large input is
essentially a solution for the small input but not vice versa. Therefore, collecting
both of these solutions could result in duplicate and identical source code. In order
to avoid multiple entries, only small input versions’ solutions were used in the data
set.
The programmers had up to 19 solution files in these data sets. Solution files have
an average of 70 lines of code per programmer.
To create the experimental data sets that are discussed in further detail in the
85
results section;
(i) The corpus was first partitioned by year of competition. The “main” data set
includes files drawn from 2014 (250 programmers). The “validation” data set files
come from 2012, and the “multi-year” data set files come from years 2008 through
2014 (1,600 programmers).
(ii) Within each year, the corpus files are ordered by the round in which they were
written, and by the problem within a round, as all competitors proceed through
the same sequence of rounds in that year. As a result, stratified cross validation is
performed on each program file by the year it was written, by the round in which
the program was written, by the problems solved in the round, and by the author’s
highest round completed in that year.
Programmer De-anonymization
This section presents the main experiment—de-anonymizing 250 programmers in
the difficult scenario where all programmers solved the same set of problems. The
biggest data set formed from 2014’s Google Code Jam Competition with 9 solution
files to the same problem had 250 programmers. These were the easiest set of 9 prob-
lems, making the classification more challenging (see Section 2.4.5). Code Stylometry
Feature Set reached 91.78% accuracy in classifying 250 programmers. After apply-
ing information gain and using the features that had positive information gain, the
accuracy was 95.08%.
Another data set had 250 programmers from different years and randomly selected
9 solution files for each one of them. The information gain features obtained from
2014’s data set were used to see how well they generalize. IG-CSFS reached 98.04%
accuracy in classifying 250 programmers. This is 3% higher than the controlled large
data set’s results. The accuracy might be increasing because of using a mixed set
86
of Google Code Jam problems, which potentially contains the possible solutions’
properties along with programmers’ coding style and makes the code more distinct.
To evaluate the approach and validate the method and important features, a data
set from 2012’s Google Code Jam Competition with 250 programmers who had the
solutions to the same set of 9 problems, was created. The feature set consisted of
only the features that had positive information gain in 2014’s data set, which was
used as the main data set to implement the approach. The classification accuracy
was 96.83%, which is higher than the 95.07% accuracy obtained in 2014’s data set.
The high accuracy of validation results in Table 2.24 show that the important
features of code stylometry and a stable feature set are identified. This feature set
does not necessarily represent the exact features for all possible data sets. For a
given data set that has ground truth information on authorship, following the same
approach should generate the most important features that represent coding style in
that particular data set.
A = #programmers, F = max #problems completedN = #problems included in data set (N ≤ F)
A = 250 from 2014 A = 250 from 2012 A = 250 all yearsF = 9 from 2014 F = 9 from 2014 F ≥ 9 all years
N = 9 N = 9 N = 9Average accuracy after 10 iterations with IG-CSFS features
95.07% 96.83% 98.04%
Table 2.24: Validation Experiments
87
Scaling
Collection of a larger data set of 1,600 programmers from various years was cru-
cial to perform large scale experiments. Each of the programmers had 9 source code
samples. The large data set was divided to 6 subsets in differing sizes, with 250
programmers, 500 programmers, 750 programmers, 1,250 programmers, 1,500 pro-
grammers, and 1,600 programmers. These subsets are useful to understand how well
the approach scales. The specific features that had information gain in the main 250
programmer data set were extracted from this large data set. In theory, a classifier
needs to use more trees in the random forest as the number of classes increase to
decrease variance, but in this experiment, the classifier used fewer trees compared to
smaller experiments. A random forest of 300 trees was used to run the experiments in
a reasonable amount of time with a reasonable amount of memory. The accuracy did
not decrease too much when increasing the number of programmers. This result shows
that information gain features are robust against changes in class and are important
properties of programmers’ coding styles. The following Figure 2.10 demonstrates
how well the method scales. The classifier is able to de-anonymize 1,600 program-
mers using 32GB memory within one hour. Alternately, using a classifier with 40
trees leads to nearly the same accuracy (within 0.5%) in a few minutes.
Training Data and Features
Different data sets are formed that consisted of different sets of 62 programmers
who had F solution files, from 2 up to 14. Each data set has the solutions to the
same set of F problems by different sets of programmers. Each data set consists of
programmers that were able to solve exactly F problems. Such an experimental setup
makes it possible to investigate the effect of programmer skill set on coding style. The
size of the data sets are limited to 62, because there are only 62 contestants with 14
88
Figure 2.10: Large Scale De-anonymization
files. There are a few contestants with up to 19 files but they cannot be included in
the data set since there were not enough programmers to compare them.
The same set of F problems are used to ensure that the coding style of the pro-
grammer is being classified and not the properties of possible solutions of the problem
itself. The feature set is able to capture personal programming style since all the pro-
grammers are coding the same functionality in their own ways.
Stratified F -fold cross validation was used by training on everyone’s (F − 1) so-
lutions and testing on the F th problem that did not appear in the training set. As
a result, the problems in the test files were encountered for the first time by the
classifier.
A random forest with 300 trees and (logM)+1 features performed F -fold stratified
cross validation, first with the Code Stylometry Feature Set (CSFS) and then with
the CSFS’s features that had information gain.
Figure 2.11 shows the accuracy from 13 different sets of 62 programmers with 2
to 14 solution files, and consequently 1 to 13 training files. The CSFS reaches an
optimal training set size at 9 solution files, where the classifier trains on 8 (F − 1)
solutions.
89
Figure 2.11: Training Data
In the data sets we constructed, as the number of files increase and problems
from more advanced rounds are included, the average line of code (LOC) per file also
increases. The average lines of code per source code in the data set is 70. Increased
number of lines of code might have a positive effect on the accuracy but at the same
time it reveals programmer’s choice of program length in implementing the same
functionality. On the other hand, the average line of code of the 7 easier (76 LOC)
or difficult problems (83 LOC) taken from contestants that were able to complete 14
problems, is higher than the average line of code (68) of contestants that were able
to solve only 7 problems. This shows that programmers with better skills tend to
write longer code to solve Google Code Jam problems. The mainstream idea is that
better programmers write shorter and cleaner code which contradicts with line of code
statistics in these data sets. Google Code Jam contestants are supposed to optimize
their code to process large inputs with faster performance. This implementation
strategy might be leading to advanced programmers implementing longer solutions
for the sake of optimization.
On the data set with 62 programmers each with 9 solutions, the classification ac-
90
curacy is 97.67% with all the features and 99.28% with the information gain features.
Excluding all the syntactic features decreases the accuracy to 88.89%. Taking the
information gain of all non-syntactic features lead to 88.35% accuracy. Excluding all
the non-syntactic features and using only the syntactic features resulted in 96.06%
accuracy. Taking the information gain of all the syntactic features lead to 96.96%
accuracy. Most of the classification power is preserved with the syntactic features,
and using non-syntactic features leads to a significant decline in accuracy.
Obfuscation
An off-the-shelf C++ obfuscator called stunnix [5] was used to obfuscate the code
of a data set with 9 solution files and 25 programmers. The accuracy with the infor-
mation gain code stylometry feature set on the obfuscated data set is 97.77%. The
accuracy on the same data set when the code is not obfuscated is 98.67%. The obfus-
cator refactored function and variable names, as well as comments, and stripped all
the spaces, preserving the functionality of code without changing the structure of the
program. Obfuscating the data produced little detectable change in the performance
of the classifier for this sample. The results are summarized in Table 2.25.
A much more sophisticated open source obfuscator called Tigress [2] was used to
obfuscate the code of another data set with 9 solution files in C and 20 programmers
(see example in Figures 2.12 and 2.13). In particular, Tigress implements function
virtualization, an obfuscation technique that turns functions into interpreters and
converts the original program into corresponding bytecode. After applying function
virtualization, de-anonymizing programmers became less accurate, so it has potential
as a countermeasure to programmer de-anonymization. However, this obfuscation
comes at a cost. First of all, the obfuscated code is neither readable nor maintainable,
and is thus unsuitable for an open source project. Second, the obfuscation adds
91
significant overhead (9 times slower) to the runtime of the program, which is another
disadvantage.
Figure 2.12: A Code Sample X
Figure 2.12 shows a source code sample X from the data set that is 21 lines long.
After obfuscation with Tigress, sample X became 537 lines long. Figure 2.13 shows
the first 14 lines of the obfuscated sample X.
The accuracy with the information gain feature set on the obfuscated data set
is reduced to 67.22%. When the feature set is limited to AST node bigrams, de-
anonymization accuracy drops to 18.89%, which demonstrates the need for all feature
types in certain scenarios. The accuracy on the same data set when the code is not
obfuscated is 95.91%.
92
Figure 2.13: Code Sample X After Obfuscation
Two-class Classification
Source code author identification could automatically deal with source code copy-
right disputes without requiring manual analysis by a neutral code investigator. A
93
Obfuscator Programmers Language Results w/o Obfuscation Results w/ ObfuscationStunnix 25 C++ 96.89% 95.56%Stunnix 25 C++ 98.67*% 97.77*%Tigress 20 C 93.65% 58.33%Tigress 20 C 95.91*% 67.22*%*Information gain features
Table 2.25: Effect of Obfuscation on De-anonymization
copyright dispute on code ownership can be resolved by comparing the styles of both
parties claiming to have generated the code. The style of the disputed code can be
compared to both parties’ former source code to aid in the investigation. To imitate
such a scenario, a data set was formed from 60 different pairs of programmers, each
with 9 solution files. A random forest performed 9-fold cross validation to classify
two programmers’ source code. The average classification accuracy using CSFS set is
100.00% and accuracy using information gain features is also 100.00%.
Another two-class machine learning task can be formulated for authorship verifi-
cation. We suspect Mallory of plagiarizing, so we mix in some code of hers with a
large sample of other people’s, test, and see if the disputed code gets classified as hers
or someone else’s. If it gets classified as hers, then it was with high probability really
written by her. If it is classified as someone else’s, it really was someone else’s code.
This could be an open world problem and the person that originally wrote the code
could be a previously unknown programmer.
This is a two-class problem with classes Mallory and others. A random forest
trains on Mallory’s solutions to problems a, b, c, d, e, f, g, h. The random forest also
trains on programmer A’s solution to problem a, programmer B’s solution to prob-
lem b, programmer C’s solution to problem c, programmer D’s solution to problem
d, programmer E’s solution to problem e, programmer F’s solution to problem f, pro-
grammer G’s solution to problem g, programmer H’s solution to problem h and puts
94
them in one class called ABCDEFGH. The random forest classifier with 300 trees
trains on classes Mallory and ABCDEFGH. There are 6 test instances from Mallory
and 6 test instances from another programmer ZZZZZZ, who is not in the training
set.
These experiments have been repeated in the exact same setting with 80 different
sets of programmers ABCDEFGH, ZZZZZZ and Mallorys. The average classification
accuracy for Mallory using the CSFS set is 100.00%. ZZZZZZ’s test instances are
classified as programmer ABCDEFGH 82.04% of the time, and classified as Mallory
for the rest of the time while using the CSFS. Depending on the amount of acceptable
false positives, the operating point on the ROC curve can be adjusted.
These results are also promising for use in cases where a piece of code is suspected
to be plagiarized. Following the same approach, if the classification result of the
piece of code is someone other than Mallory, that piece of code was with very high
probability not written by Mallory.
Verification/Open World Problem
In a real world scenario, we do not know if source code belongs to one of the
programmers’ in the suspect set. In such cases, the classifier can classify the anony-
mous source code, and if the majority number of votes of trees in the random forest
is below a certain threshold, the classifier can reject the classification considering the
possibility that it might not belong to any one of the classes in the training data. By
doing so, the approach scales to an open world scenario, where we a suspect might
not have been encountered before. As long as a confidence threshold is determined
based on training data [109], the probability that an instance belongs to one of the
programmers in the set can be calculated and accordingly the classifier can accept or
reject the classification.
95
270 classifications are performed in a 30-class problem using all the features to
determine the confidence threshold based on the training data. The accuracy was
96.67%. There were 9 misclassifications and all of them were classified with less than
15% confidence by the classifier. The class probability or classification confidence
(P (C)i) is calculated by taking the percentage of trees (T ) in the random forest (f)
that voted for that particular class (Vi), which can be seen in equation 2.5.
P (C)i =
∑Vi∑Tf
(2.5)
There was one correct classification made with 13.7% confidence. This suggests
that a threshold between 13.7% and 15% confidence level can be used for verification,
and the classifications that did not pass the confidence threshold can be manually
analyzed or excluded from results.
An aggressive threshold of 15% was picked and to validate the threshold, a random
forest classifier trained on the same set of 30 programmers with 270 code samples. The
classifier tested 150 different files from the programmers in the training set. There
were 6 classifications below the 15% threshold and two of them were misclassified.
Another test set consisted of 420 test files from 30 programmers that were not in
the training set. All the files from these 30 programmers were attributed to one
of the 30 programmers in the training set since this is a closed world classification
task, however, the highest confidence level in these classifications was 14.7%. The
15% threshold catches all the instances that do not belong to the programmers in the
suspect set, gets rid of 2 misclassifications and 4 correct classifications. Consequently,
if a classification is made with a confidence value less than a certain threshold, the
classification can be rejected and the test instance can be attributed to an unknown
suspect.
96
Relaxed Classification
The goal here is to determine whether it is possible to reduce the number of
suspects using code stylometry. Reducing the set of suspects in challenging cases,
such as having too many suspects, would reduce the effort required to manually find
the actual programmer of the code.
In this section, a random forest classifies the 250 programmers in the main data
set from 2014 using the information gain features. The classification was relaxed to a
set of top R suspects instead of exact classification of the programmer. The relaxed
factor R varied from 1 to 10. Instead of taking the highest majority vote of the
decisions trees in the random forest, the highest R majority vote decisions were taken
and the classification result was considered correct if the programmer was in the set
of top R highest voted classes. The accuracy does not improve much after the relaxed
factor is larger than 5.
Figure 2.14: Relaxed Classification with 250 Programmers
Generalizing the Method
Features derived from ASTs can represent coding styles in various languages.
These features are applicable in cases when lexical and layout features may be less
97
discriminating due to formatting standards and reliance on whitespace and other ‘lex-
ical’ features as syntax, such as Python’s PEP8 formatting. To show that the method
generalizes, source code of 229 Python programmers was collected from GCJ’s 2014
competition. 229 programmers had exactly 9 solutions. Using only the Python
equivalents of syntactic features listed in Table 2.22 and 9-fold cross-validation,
the average accuracy is 53.91% for top-1 classification, 75.69% for top-5 relaxed attri-
bution. The largest set of programmers to all work on the same set of 9 problems was
23 programmers. The average accuracy in identifying these 23 programmers is 87.93%
for top-1 and 99.52% for top-5 relaxed attribution. The same classification tasks us-
ing the information gain features are also listed in Table 2.26. The overall accuracy
in data sets composed of Python code are lower than C++ data sets. In Python data
sets, a parser that was not fuzzy generated the ASTs, which had an effect on the
syntactic features. The lack of quantity and specificity of features accounts for the
decreased accuracy. The Python data set’s information gain features are significantly
fewer in quantity, compared to C++ data set’s information gain features. Informa-
tion gain only keeps features that have discriminative value all on their own. If two
features only provide discriminative value when used together, then information gain
will discard them. So if a lot of the features for the Python set are only jointly dis-
criminative (and not individually discriminative), then the information gain criteria
may be removing features that in combination could effectively discriminate between
authors. This might account for the decrease when using information gain features.
Nevertheless, a CSFS equivalent feature set can be generated for other programming
languages by implementing the layout and lexical features as well as using a fuzzy
parser.
98
Language Programmers Classification IG Top-5 Top-5 IGPython 23 87.93% 79.71% 99.52% 96.62Python 229 53.91% 39.16% 75.69% 55.46
Table 2.26: Generalizing to Other Programming Languages
Software Engineering Insights
Is programming style consistent throughout years? There were some contestants
that had the same username and country information both in 2012 and 2014. In 2014,
someone else might have picked up the same username from the same country and
started using it. Assuming that these are the same people, we are going to ignore
such a ground truth problem for now and assume that they are the same people.
A data set was created from a set of 25 programmers from 2012 that were also
contestants in 2014’s competition. A random forest classifier trained on the 8 files
from their submissions in 2012, with 300 trees using CSFS. The test documents
consisted of one instance from each one of the contestants from 2014. The correct
classification rate of these test instances from 2014 is 96.00%. The accuracy dropped
to 92.00% when using only information gain features, which might be due to the
aggressive elimination of pairs of features that are jointly discriminative. These 25
programmers’ 9 files from 2014 had a correct classification accuracy of 98.04%. These
results indicate that coding style is preserved up to some degree throughout years.
To investigate problem difficulty’s effect on coding style, two data sets were created
from 62 programmers that had exactly 14 solution files. Table 2.27 summarizes the
following results. A data set with 7 of the easier problems out of 14 resulted in
95.62% accuracy. A data set with 7 of the more difficult problems out of 14 resulted
in 99.31% accuracy. This might imply that more difficult coding tasks have a more
prevalent reflection of coding style. On the other hand, the data set that had 62
99
programmers with exactly 7 of the easier problems resulted in 91.24% accuracy, which
is a lot lower than the accuracy obtained from the data set whose programmers were
able to advance to solve 14 problems. This might indicate that, programmers who
are advanced enough to answer 14 problems likely have more unique coding styles
compared to contestants that were only able to solve the first 7 problems.
To investigate the possibility that contestants who are able to advance further
in the rounds have more unique coding styles, a second round of experiments were
performed on comparable data sets. A data set was created from 62 programmers
that had 12 solution files. The subset of this data set with 6 of the easier problems
out of 12 resulted in 91.39% accuracy. The subset of this data set with 6 of the more
difficult problems out of 12 resulted in 94.35% accuracy. These results are higher
than the data set whose programmers were only able to solve the easier 6 problems.
The data set that had 62 programmers with exactly 6 of the easier problems resulted
in 90.05% accuracy.
A = #programmers, F = max #problems completedN = #problems included in data set (N ≤ F)
A = 62F = 14 F = 7 F = 12 F = 6
N = 7 N = 7 N = 7 N = 6 N = 6 N = 6Average accuracy after 10 iterations while using CSFS
99.31% 95.62%2 91.24%1 94.35% 91.39%2 90.05%1
Average accuracy after 10 iterations while using IG CSFS99.38% 98.62%2 96.77%1 96.69% 95.43%2 94.89%1
1 Drop in accuracy due to programmer skill set.2 Coding style is more distinct in more difficult tasks.
Table 2.27: Effect of Problem Difficulty on Coding Style
100
2.4.6 Discussion
Problem Difficulty. The experiment with random problems from random authors
among seven years most closely resembles a real world scenario. In such an experi-
mental setting, there is a chance that instead of only identifying authors, the classifier
is also identifying the properties of a specific problem’s solution, which results in a
boost in accuracy.
In contrast, the main experimental setting where all authors have only answered
the nine easiest problems is possibly the hardest scenario, since the classifier is training
on the same set of eight problems that all the authors have algorithmically solved and
tries to identify the authors from the test instances that are all solutions of the 9th
problem. On the upside, these test instances help precisely capture the differences
between individual coding style that represent the same functionality. The results
also reflect that such a scenario is harder since the randomized data set has higher
accuracy.
Classifying authors that have implemented the solution to a set of difficult prob-
lems is easier than identifying authors with a set of easier problems. This shows
that coding style is reflected more through difficult programming tasks. This might
indicate that programmers come up with unique solutions and preserve their coding
style more when problems get harder. On the other hand, programmers with a better
skill set have a prevalent coding style which can be identified more easily compared to
contestants who were not able to advance as far in the competition. This might indi-
cate that as programmers become more advanced, they build a stronger coding style
compared to novices. There is another possibility that maybe better programmers
start out with a more unique coding style.
101
Effects of Obfuscation. A malware author or plagiarizing programmer might de-
liberately try to hide his source code by obfuscation. Our experiments indicate that
this method is resistant to simple off-the-shelf obfuscators such as stunnix, that make
code look cryptic while preserving functionality. The reason for this success is that the
changes stunnix makes to the code have no effect on syntactic features, e.g., removal
of comments, changing of names, and stripping of whitespace.
In contrast, sophisticated obfuscation techniques such as function virtualization
hinder de-anonymization to some degree, however, at the cost of making code unread-
able and introducing a significant performance penalty. Unfortunately, unreadability
of code is not acceptable for open-source projects, while it is no problem for attack-
ers interested in covering their tracks. Developing methods to automatically remove
stylometric information from source code without sacrificing readability is therefore
a promising direction for future research.
Limitations. The case where a source file might be written by a different author
than the stated contestant is a ground truth problem that we cannot control. More-
over, it is often the case that code fragments are the work of multiple authors. To
shed light on the feasibility of classifying such code, analyzing a data set formed of git
commits to open source projects is one possible direction. Such an experiment should
be possible since Joern, the parser that was utilized throughout the experiments,
works on code fragments rather than complete code.
Another fundamental problem for machine learning classifiers are mimicry attacks.
For example, the random forest classifier may be evaded by an adversary by adding
extra dummy code to a file that closely resembles that of another programmer, albeit
without affecting the program’s behavior. This evasion is possible, but trivial to
resolve when an analysts verifies the decision.
Finally, verifying authorship information of Google Code Jam contestants is not
102
possible. In this case, a classify and then verify approach as explained in Stolerman
et al.’s work [109] is helpful. Each classification could go through a verification step
to eliminate instances where the classifier’s confidence is below a threshold. After the
verification step, instances that do not belong to the set of known authors can be
separated from the data set to be excluded or to be further manually analyzed.
2.4.7 Related Work
Programmer de-anonymization is inspired by the research done on authorship
attribution of unstructured or semi-structured text [81; 9]. This section discusses prior
work on source code authorship attribution. In general, previous work on programmer
de-anonymization (Table 2.28) looks at smaller scale problems, does not use structural
features, and achieves lower accuracies.
The highest accuracies in the related work are achieved by Frantzeskou et al.
[48; 46]. They used 1,500 7-grams to reach 97% accuracy with 30 programmers.
They investigated the high-level features that contribute to source code authorship
attribution in Java and Common Lisp. They determined the importance of each
feature by iteratively excluding one of the features from the feature set. They showed
that comments, layout features and naming patterns have a strong influence on the
author classification accuracy. They used more training data (172 lines of code on
average) than us (70 lines of code). This work replicated their experiments on a 30
programmer subset of the C++ data set, with eleven files containing 70 lines of code
on average and no comments. The classifier reaches 76.67% accuracy with 6-grams,
and 76.06% accuracy with 7-grams. When a 6 and 7-gram feature set were used on
250 programmers with 9 files, the accuracy decreased to 63.42%. A random forest
with 300 tress, incorporating CSFS, reaches 98% accuracy on 250 programmers.
The largest number of programmers studied in the related work was 46 program-
103
mers with 67.2% accuracy. Ding and Samadzadeh [39] use statistical methods for
authorship attribution in Java. They show that among lexical, keyword and layout
properties, layout metrics have a more important role than others which is not the
case in our analysis.
There are also a number of smaller scale, lower accuracy approaches in the lit-
erature [25; 64; 105; 65; 72; 42; 67], shown in Table 2.28, all of which this work
significantly outperforms. These approaches use a combination of layout and lexical
features.
The only other work to explore structural features is by Pellin [88], who used
manually parsed abstract syntax trees with an SVM that has a tree based kernel to
classify functions of two programmers. He obtains an average of 73% accuracy in
a two class classification task. His approach explained in the white paper can be
extended to incorporate CSFS, so it is the closest to this work in the literature. This
work demonstrates that it is non-trivial to use ASTs effectively and is the first to use
structural features to achieve higher accuracies at larger scales and the first to study
how code obfuscation affects code stylometry.
There has also been some code stylometry work that focused on manual analysis
and case studies. Spafford and Weeber [107] suggest that use of lexical features such as
variable names, formatting and comments, as well as some syntactic features such as
usage of keywords, scoping and presence of bugs could aid in source code attribution
but they do not present results or a case study experiment with a formal approach.
Gray et al. [53] identify three categories in code stylometry: the layout of the code,
variable and function naming conventions, types of data structures being used and also
the cyclomatic complexity of the code obtained from the control flow graph. They
do not mention anything about the syntactic characteristics of code, which could
potentially be a great marker of coding style that reveals the usage of programming
104
language’s grammar. Their case study is based on a manual analysis of three worms,
rather than a statistical learning approach. Hayes and Offutt [56] examine coding style
in source code by their consistent programmer hypothesis. They focused on lexical
and layout features, such as the occurrence of semicolons, operators and constants.
Their data set consisted of 20 programmers and the analysis was not automated. They
concluded that coding style exists through some of their features and professional
programmers have a stronger programming style compared to students. In the results
in section 2.4.5, this work also shows that more advanced programmers have a more
identifying coding style.
There is also a great deal of research on plagiarism detection which is carried out
by identifying the similarities between different programs. For example, there is a
widely used tool called Moss that originated from Stanford University for detecting
software plagiarism. Moss [12] is able to analyze the similarities of code written by
different programmers. Rosenblum et al. [101] present a novel program representation
and techniques that automatically detect the stylistic features of binary code.
2.4.8 Conclusion and Future Work
Source code stylometry has direct applications for privacy, security, software foren-
sics, plagiarism, copyright infringement disputes, and authorship verification. Source
code stylometry is an immediate concern for programmers who want to contribute
code anonymously because de-anonymization is quite possible. This work introduces
the first principled use of syntactic features along with lexical and layout features
to investigate style in source code. 1,600 programmers can be de-anonymized with
94% accuracy in and 250 programmers with 98% accuracy with eight training files
per class. This shows that source code authorship attribution with the Code Stylom-
etry Feature Set scales even better than regular stylometric authorship attribution,
105
Related Work # of Programmers ResultsPellin [88] 2 73%MacDonell et al. [72] 7 88.00%Frantzeskou et al. [48] 8 100.0%Burrows et al. [25] 10 76.78%Elenbogen and Seliya [42] 12 74.70%Kothari et al. [64] 12 76%Lange and Mancoridis [67] 20 75%Krsul and Spafford [65] 29 73%Frantzeskou et al. [48] 30 96.9%Ding and Samadzadeh [39] 46 67.2%This work 8 100.00%This work 35 100.00%This work 250 98.04%This work 1,600 92.83%
Table 2.28: Comparison to Previous Results
as these methods can only identify individuals in sets of 50 authors with slightly over
90% accuracy [6]. Furthermore, this performance is achieved by training on only 550
lines of code or eight solution files, whereas classical stylometry requires 5,000 words
of training data.
Additionally, the results in this work raise a number of questions that motivate
future research. First, as malicious code is often only available in binary format,
it would be interesting to investigate whether syntactic features can be partially
preserved in binaries. This may require the feature set to be improved in order to
incorporate information obtained from control flow graphs.
Second, can the classification accuracy be further increased? For example, does
using features that have joint information gain alongside features that have infor-
mation gain by themselves improve performance? Moreover, designing features that
capture larger fragments of the abstract syntax tree could provide improvements.
These changes (along with adding lexical and layout features) may provide signifi-
106
cant improvements to the Python results and help generalize the approach further.
Finally, investigating whether code can be automatically normalized to remove
stylistic information while preserving functionality and readability will be a first step
towards anonymizing code while preserving readability.
107
3. Modeling and Quantifying Privacy Behavior
This work was completed by Aylin Caliskan-Islam with support from Jonathan
Walsh. [29].
Analysis on the amount of private information shared by Twitter users [29] showed
that friends who share similar privacy scores appear in clusters. People also tend to
mention friends with similar privacy scores in their tweets. This correlation presents
a starting point to investigate the causation behind revealing private information in
online social networks.
A collection of timelines along with friend lists of 500,000 Twitter users through
the Twitter API was used to analyze the users’ privacy revealing habits. Ten pri-
vacy categories, based on a general societal consensus on what is private supported
by related work, were used to label tweets as private or not. After having Amazon
Mechanical Turk workers annotate timelines of 270 users, a privacy score calculator
assigned a score from one, being mostly public, to 3, being mostly private to these
labeled timelines. A machine learning classifier used this ground truth representation
to associate 1,982 Twitter users with a privacy score. The numeric representation of
privacy score consisted of privacy related features that can be extracted through nat-
ural language processing techniques. One such feature is named entities mentioned in
text, the more entities named in a text, the more specific the descriptiveness of that
text becomes. Named entity recognition identifies elements in text such as persons,
organizations, and locations. Another feature is the distribution of topics a user talks
about. Some topics tend to be more personal, such as medical information, emotions,
politics, and religious views whereas some topics are public and do not contain as
much private data, such as sports, weather, and news. Detecting topics and named
108
entities alone is not sufficient for understanding the underlying meaning of text, which
in aggregate forms privacy behavior. This is where Brown clustering and semantic
classification become useful. Brown clustering is a distributional similarity method
that groups words appearing in the same context. Semantic classification exposes the
intention in sentences.
The privacy score classifier reached 95% correct classification accuracy on the
labeled data of 270 users in cross validation by using natural language processing
based features. Such a learning based approach covering a wide range of privacy
concerns has not been established before. Previous work focused on keyword based
detection of limited privacy categories, such as location privacy, medical privacy, or
writing under the influence. A learning based privacy score calculator could utilize
a user’s timeline to guide her during the selection of privacy settings while tailoring
user’s needs to the privacy policy in hand. In future work, this can be extended
to automate privacy policy understanding and create a privacy setting management
tool. Currently, a Facebook study has been initiated to investigate the influencing
factors of private information disclosure. This study particularly aims to answer, do
people who tend to reveal private information influence their friends to share more
private information and what are the other causal factors behind privacy behavior?
3.1 Introduction
Numerous organizations, from corporations to governments to criminal gangs, are
actively engaged in the collection of personal information released on the Internet.
Generally, this pervasive collection is performed without the user’s knowledge. Inter-
net users need an increased ability to realize how they are influenced to reveal privacy
and the amount of sensitive information they are exposing.
109
Twitter users with public accounts expose user information through tweets. A
Twitter user might share her text with another party that she trusts but this user
may not know how her information will be redistributed on the Internet. The user
might also not realize how much private information she is exposing. In such cases,
understanding how risky other users are by assigning a privacy score to those users’
timelines can help a user decide how much sensitive information she is willing to
share with users of certain privacy scores. In order to study and understand privacy
behaviors in aggregate, especially as they are embedded in social networks, ‘privacy
detective’ can attribute a privacy score to a Twitter timeline using a learning based
approach.
Privacy varies from individual to individual and each user may have differing views
of privacy. Nonetheless, there is an imperfect and non-negligible societal consensus
that certain material is more private than other material in the general societal view.
This societal consensus can be captured by having AMT workers annotate tweets as
private or not according to Table-3.2 to calculate the privacy scores of Twitter users.
Privacy scores within a user’s network could be used to understand how social in-
teractions influence users’ privacy behaviors. A reliable method can associate users to
privacy levels to analyze how privacy behavior is influenced. Do the people a user fol-
lows or mentions in tweets influence her sensitive information-sharing behavior? Does
the number of followers a user has affect her privacy habits? The proposed method
‘privacy detective’ can classify Twitter users’ timelines according to the amount of
private information being exposed and associate each user with a privacy score.
Outliers in timelines are important since a privacy preserving user can all of a
sudden decide to reveal a very rare disease or homeland security information. ‘Privacy
detective’ is not trying to catch such extreme cases and it is not designed for self
censoring. Such outliers do not have an adverse effect on collective privacy behavior
110
analysis, since the focus of the study is on population level effects.
The hypothesis here may simply be stated as, those who follow or reply to users
who frequently divulge private information are at a higher risk for having their private
information exposed. For example, the user may release private information directly,
or the release of private information may occur by an encouragement effect in which
a user replies to a post from another user revealing private information which they
would not have otherwise posted publicly. Intuitively, certain users will be more likely
to reveal private information. Are users are more likely to reveal private information
on their own, or by the influence of their friends, or after prompting from another
user?
The benefit for a user having the ability to detect this type of effect is twofold.
First, if users have a measure of the full extent of their contacts’ release of private
information they may take steps to safeguard themselves. Second, if there is a rela-
tionship between users providing private information in replies, users of these types
of systems will be more aware of the risks in such situations.
‘Privacy detective’ can uncover new things about aggregate privacy behavior. The
loss of privacy has become prevalent as online social networks expand and privacy
behaviors seem to be socially constructed. A quantitative analysis of the extent of
the user-to-user influence in sensitive information revealing habits can demonstrate
a possible factor that is contributing to the loss of personal and online privacy. This
analysis can be used to improve privacy enhancing technologies and educational in-
terventions. For example, a user can apply this on friends’ status messages to get a
sense of their privacy scores and build friends lists accordingly.
Privacy behavior analysis has been influenced by the study on the collective dy-
namics of smoking in a large social network [34] and the spread of obesity in a large so-
cial network over 32 years [33]. Christakis and Fowler used network analytic methods
111
and statistical models to derive results from these studies. They examined whether
weight gain in one person was associated with weight gain in her friends, siblings,
spouse, and neighbors. They concluded that obesity appears to spread through social
ties. They also examined the extent of the person-to-person spread of smoking behav-
ior and the extent to which groups of widely connected people quit together. They
concluded that network phenomena is relevant to smoking behavior and smoking ces-
sation. These findings had implications for clinical and public health interventions to
reduce and prevent smoking and to stop the spread of obesity.
‘Privacy detective’ detects the presence and amount of private content given text
input using topic modeling, a privacy ontology, named entity recognition, and senti-
ment analysis. Tweets are preprocessed to make better use of natural language pro-
cessing techniques. This preprocessing is important given the source text of tweets,
as Twitter has evolved a language which is challenging for natural language process-
ing tasks. Latent Dirichlet Allocation method by Blei et al. [18] is used for topic
modeling. The privacy ontology is based on the privacy dictionary contributed by
Gill et al. [51]. Named entities consist of names, location, date, time, organization,
money, and percentage. Sentiment analysis classifies sentences as either private or not
private. Private information can fall under one or more of the following 9 categories:
location, medical, drug/alcohol, emotion, personal attacks, stereotyping, family or
other associations, personal details, and personally identifiable information. Features
are extracted from tweets with the mentioned techniques to train machine learning
classifiers on various timelines with varying degrees of privacy in order to come up
with a privacy score for a user’s timeline of unknown privacy score.
The learning based approach ‘privacy detective’ is the key contribution for three
reasons:
1. Privacy detective detects a broad range of privacy categories. Previous work
112
focuses on certain types of privacy such as location privacy, medical privacy, or
writing under the influence.
2. Privacy detective adopts a learning based approach whereas previous methods
focus on keyword and regular expression based detection.
3. Privacy is socially constructed and this is demonstrated by the positive corre-
lation between a user’s and her friends’ privacy scores.
Detecting private information is a hot topic since a lot of personal information
is being exposed online. It is difficult to manage private information and friends
lists on various social media sites such as Twitter, Facebook, and Google+, which are
frequently changing their privacy policies and, at times, sensitive information is being
redistributed without the owner’s knowledge. ‘Privacy detective’ can be adapted to
assist users in privacy preferences about friends lists, sharing choices, and exposed
content. ‘Privacy detective’ also presents an invaluable research platform for privacy
researchers since it makes it possible to study how private information is revealed
over time, what affects sensitive information sharing habits, and where people expose
personal information.
Text preprocessing, topic modeling, privacy ontology, named entity recognition,
and sentiment analysis will be explained in detail in section 3.6.
3.2 Related Work
Mao et al. [73] study privacy leaks on Twitter by automatically detecting vacation
plans, tweeting under the influence of alcohol, and revealing medical conditions. Their
study focuses on analyzing these three specific privacy topics by creating filters to
analyze content and automatically categorizing tweets into the three categories. They
investigate who divulges information. Their study is followed by a cross cultural study
113
that detects these three types of privacy leaks in the US, UK, and Singapore. They
discuss how their classification system can be used as a defensive mechanism to alert
users of potential privacy leaks.
Sleeper et al. [106] survey 1,221 Twitter users on AMT and discover that users
mostly regret messages that are critical of others, cathartic/expressive, or reveal too
much information. They also show that regrets on Twitter reached broader audi-
ences and were repaired more slowly compared to in-person regrets. The privacy
categories, explained in Table-3.2, were partly influenced by Sleeper et al.’s Twitter
regret categories, which are: blunder, direct attack, group reference, direct criticism,
reveal/explain too much, agreement changed, expressive/catharsis, lie, implied criti-
cism, and behavioral edict.
Wang et al. [115] survey 569 American Facebook users to investigate regrets
associated with posts on Facebook. They show that regrets on Facebook revolved
around topics with strong sentiment, lies, and secrets, which all have subcategories.
Privacy categories used in our annotations were also partly influenced by Wang et
al.’s regret list. Their survey results revealed several causes of posting regrettable
content. They report how regret incidents had serious implications such as job loss
or breaking up relationships. They also discuss how regrets can be avoided in online
social networks.
Thomas et al. [111] explore multi-party privacy risks in social networks. They
specifically analyze Facebook to identify scenarios where conflicting privacy settings
between friends reveals information that at least one user intended to remain private.
This paper shows how private information can be spread unwillingly when a risky
user in the network gets access to other users’ personal information. To mitigate this
threat, they present a proof of concept application built into Facebook that automat-
ically ensures mutually acceptable privacy restrictions enforced on group content.
114
Cristofaro et al. [37] present a privacy preserving service for Twitter called ‘Hum-
mingbird’. Hummingbird is a variant of Twitter that protects tweet contents, hash-
tags, and follower interests from the potentially prying eyes of the centralized server.
It provides private fine grained authorization of followers and privacy for followers.
Hummingbird preserves the central server to guarantee availability but the server
learns minimal information about users.
Hart et al. [55] classify enterprise level documents as either sensitive or non-
sensitive with automatic text classification algorithms to improve data loss preven-
tion. They introduce a novel training strategy, supplement and adjust, to create an
enterprise level classifier. They evaluate their algorithm on confidential documents
published on Wikileaks and other archives and get a very low false negative and false
discovery rate. A support vector machine with a linear kernel performs the best on
their test corpora. Their best feature space across all corpora is unigrams such as sin-
gle words with binary weights. They eliminate stop words and the number of features
is limited to 20,000.
Liu et al. [70] propose a framework for computing privacy scores for users in online
social networks based on sensitivity and visibility of private information. The privacy
score in this study indicates the user’s potential risk caused by her participation in
the network.
Chow et al. [32] design a text revision assistant that detects sensitive information
in text and gives suggestions to sanitize sentences. Their method involves querying
the Internet for detections and recommendations.
There have been numerous studies on topic modeling [68], named entity recogni-
tion [100], and sentiment analysis [19] on Twitter as well as normalizing micro-text
[118] though not focusing on tweets in particular.
115
3.3 Problem Statement and Threat Model
The main problem we investigate in this work is: ‘Does the given text contain
any private or sensitive information and if it does, how much of the text reveals
private content?’ We want to control the type of information we reveal in our text
that is submitted online. We also want to know the private information sharing
habits of people in our network in order to make sharing decisions based on their
privacy scores. This also helps us understand social influences for revealing private
information. Detecting private information is crucial for analyzing textual content
and privacy behavior embedded in social networks.
The assumption in the worst case is that, an adversary will have access to all
content posted by a user to the social network. Any publicly posted information may
be captured by an adversary who is constantly monitoring public portions of the social
network. Twitter feeds that are being analyzed are either entirely public or private,
and thus the adversary can focus on users with knowledge that she has captured
their full set of activity. The adversary does not have supplemental information to
associate with each particular user that is not available through the Twitter system.
Users’ social behavior can impact privacy. An online social network member
Alice may be influenced by her friends to release more information than she might
otherwise and then some third party observer Bob, who might be an advertiser, a
potential employer, or a social enemy, uses this information to harm or embarrass
her.
3.4 Data Collection
Twitter users and posts in this study are randomly selected primarily due to the
open nature of the posts on that social network. Both the relationships between
users and their activity on the social network are recorded. Furthermore, on Twitter,
116
unlike a social network such as Facebook or LinkedIn, users do not have an array
of built in fields or requests for personal data. For example, on Facebook, users are
routinely requested to divulge further information to the social network which may
include private information such as organizational association, current location, and
specific relationship information. Twitter simply requests a username and, optionally,
a location. Thus, any private information found within the service is likely to be
shared without prompting from the service itself.
The process of data collection emphasizes collection of a continuous stream of a
conversation on Twitter. The result of this approach is that tweets of users that are
more than a single degree away from the initial user are collected and considered. In
doing so, the complete chain of a conversation is captured, which may have led to the
release of private information.
Each tweet is analyzed for metadata within the content of the message. This
metadata includes both hashtags and user references. By associating hashtags directly
to tweets, users that tweet similar content are grouped. These users might not be
connected by a following-type relationship.
Experimental data collection begins with the program running on a selected seed
user. The program selects up to 1,000 followers of the seed user and downloads the
tweets for each of these followers. For any tweet which is in reply to another tweet,
the program also downloads the originating tweets. The program iterates until it
reaches the initial originating tweet. The initial originating tweet is a tweet that has
been replied to, but is not a reply to any other tweet. Due to time delays with the
Twitter API, this process is time consuming. Thus the automated process developed
was essential in data collection.
All tweet data was collected over a period of approximately three weeks in Novem-
ber 2013. Twitter does not present demographic information on its users, thus it is
117
difficult to predict age and gender. Although Twitter permits users to enter location
information, many users do not, and such information has not been used in this study.
The initial seed user is a local news sportscaster from Philadelphia, consequently the
majority of users live in the Philadelphia area. Up to 200 of the most recent tweets
for each user were downloaded. The data collection is designed so that it cannot im-
pact the results because ground truth is provided by AMT annotations to represent
a societal consensus which is explained in detail in section 3.5.
Item CountUser 95,264Tweet 426,464Follower Relationships 4,620Referenced Users 19,123 (not included in user)Unique Hashtags 180,186
Table 3.1: Data Set Information
Data is stored in an SQL database for easier access following collection. A Java
API for accessing the data and performing queries was also developed. Table-3.1
illustrates the total number of entities captured for the data set. Due to delays caused
by the Twitter API, the program was unable to collect the complete set of tweets for
all followed users in a reasonable amount of time. Thus, one of the intermediate goals
is to determine if there is a minimum tweet count which will give a significant chance
of evaluating the likelihood of a user releasing or encouraging the release of private
information.
118
3.5 Amazon Mechanical Turk Annotations
The Amazon Mechanical Turk (AMT) is a crowdsourcing Internet marketplace
that enables individuals and businesses to use human intelligence for tasks that com-
puters cannot currently accurately perform. The goal of AMT annotations is to obtain
ground truth about how much private information Twitter users reveal. Turkers an-
notate the publicly available Twitter data which is used for calculating the privacy
scores of Twitter users. These scores are later used in supervised machine learning to
classify timelines based on privacy scores. AMT is used only for annotation purposes
on data that’s publicly available.
AMT masters labeled each tweet in a user’s timeline as private or not according
to Table-3.2. AMT masters labeled a total of 270 randomly selected timelines each
with 500 words of tweets. 500 words of tweets were sufficient to generate accurate
topic ratios in topic modeling.
AMT masters achieve the ‘master’ distinction by completing work requests with
a high degree of accuracy across a variety of AMT requesters. AMT masters that
have demonstrated accuracy in data categorization labeled the tweets in this data set.
There were 10 random quality check tweets in addition to users’ original tweets in the
timelines. Humans have manually labeled these additional tweets’ privacy categories
in advance. These additional tweets are used as an inter-annotator agreement check-
point, to observe the variance in privacy category interpretations of machine learning
classifiers and humans. The experiments in this study utilized the annotations of
AMT workers who correctly interpreted the privacy category of 80% of the quality
check tweets. If a worker did not satisfy the quality check requirements for a timeline,
other work requests were submitted for that particular timeline until a worker met
the quality requirements.
10 generic privacy categories guided AMT workers throughout the annotations.
119
The categories in Table-3.2 were influenced by related work, primarily the participant
reported types of regret in ‘Twitter Regrets’ [106], and regret categories in ‘Regrets
on Facebook’ [115]. The percentage of tweets that fall under one of the 9 privacy
categories in Table-3.2 represent the privacy score of a user’s timeline.
• Privacy score-1: If more than 70% of the tweets are not private, the user is
assigned a privacy score of 1.
• Privacy score-2: If 30% or more and less than 60% of the tweets are private,
the user is assigned a privacy score of 2.
• Privacy score-3: If 60% or more of the tweets are private, the user is assigned a
privacy score of 3.
Figure 3.1: AMT Annotation Results
According to this calculation, 185 users had a score of 1, 57 users had a score of
2, and 28 users had a score of 3, as shown in Figure-3.1.
120
CATEGORY DESCRIPTION
LocationGiving out location information
MedicalRevealing information about someone’s medicalcondition.
Drug/AlcoholGiving information about alcohol/drug use or re-vealing information under the influence.
EmotionHighly emotional content, frustration, hot states,etc.
Personal AttacksCritical statements directed at a person, generalstatements rather than specific.
StereotypingEthnic, racial, etc stereotypical references abouta group
Family/Associationdetail
Revealing information about family members,or revealing their associations, e.g. ex-partner,mother-in-law, step brother
Personal detailse.g., relationship status, sexual orientation,job/occupation, embarrassing or inappropriatecontent, reveal/explain too much
Personally Identifi-able Information
Personally identifiable information(e.g., SSN,credit card number, home address, birthdate)
Neutral/ObjectiveNeutral or objective tweets that reveal no privateor sensitive information.
Table 3.2: Tweet Privacy Categories
Having a tool that can detect the sensitivity of a timeline relative to the societal
consensus on private information is useful and interesting, especially for population-
level effects. The difference between the privacy levels of exposing having the flu
and the presence of a rare disease is not weighted in the privacy score calculations.
Excluding such exceptions does not have an adverse effect on the analysis since the
population-level privacy revealing habits on social network users can be captured
without such outliers. This approach focuses on aggregate privacy behavior which
is a reflection of sensitive information revealing patterns as opposed to discovering
121
important secrets.
A second set of annotations were requested to measure the variance among the
first set of annotations, supervised machine learning results, and this second set of
annotations. Master AMT workers annotated a subset of 100 timelines from the first
set of 270 work requests on AMT. The privacy scores of 100 users were calculated
in the same way as the first set of annotations. According to the calculation, 75
users had a score of 1, 15 users had a score of 2, and 10 users had a score of 3.
Inter-annotator agreement results are discussed in section 3.7.
3.6 Approach
We consider a supervised machine learning problem and train classifiers on time-
lines of users with known privacy scores of 1, 2 and 3 to predict the privacy scores of
timelines of interest. We calculated the privacy scores of the users with known privacy
scores based on ground truth obtained from AMT annotations. A timeline of a user
with unknown privacy score is preprocessed to normalize micro-text and after that,
features are extracted to be used in machine learning. Timelines are classified with
privacy scores by using AdaBoost [50] with Naive Bayes classifier as a weak learner.
Test data is limited to 500 words of randomly selected tweets from each users’ timeline
for the reasons explained in section 3.5. The process is shown in Figure-3.2.
Naive Bayes is a popular method to provide baseline text categorization results
such as ham or spam classification. Naive Bayes can outperform support vector
machines (SVM) with appropriate preprocessing. In our experiments, boosted Naive
Bayes significantly outperformed sequential minimal optimization [94], a type of SVM.
AdaBoost is a machine learning meta-algorithm that stands for ‘Adaptive Boosting’.
AdaBoost trains one base Naive Bayes classifier at a time which is tweaked in favor
of instances that were misclassified by the previous classifiers, and weights this clas-
122
sifier according to how useful it is in the ensemble of classifiers. As long as the the
base learners perform even slightly better than random chance, the boosted ensemble
converges to a strong classifier by majority voting.
Figure 3.2: Workflow
3.6.1 Text Preprocessing
In general, informal communication on the Internet does not tend to follow proper
English conventions such as proper sentence structure. Furthermore, such communi-
cations tend to include significant amounts of abbreviations, slang, and iconography.
Since users on Twitter are restricted to 140 characters, there is an increased likeli-
hood that such shorthand will be used. This is especially true when hashtags are
considered. Since hashtags are metadata contained within the tweet itself, they are
important to consider for both grouping tweets and also for the release of private
123
information.
Tweets contain text that is specific to Twitter and contain micro-text of slang and
unstructured sentences. For example, they can include hashtags to tag a certain topic
and user handles to refer to another Twitter user. The average number of words per
tweet in our sample is 15 and the average number of words per sentence in our sample
is 11. These properties of tweets make them challenging for topic modeling, named
entity recognition, and many other common natural language processing tasks. In
order to create meaningful topic models and detect present entities, we need to clean
up tweets and convert the English to a more formal form.
Tweets contain slang words and hashtags that are hard to process as vocabulary
words. In order to get rid of these, we replace them with cluster keywords from
Twitter word clusters. We use the 1000 hierarchical Twitter word clusters from the
Twitter NLP project [85], which were formed by Brown clustering [23] from 56,000,000
English tweets that had over 217,000 words. We manually reviewed the clusters and
selected a keyword that describes the words in the cluster. If any of the words in the
timeline were present in the clusters, we replaced that word with the cluster keyword.
After converting the words to cluster keywords, we removed non-ASCII charac-
ters to reduce non-English language and pictographic characters. User handles (e.g
@johnsmith) were replaced with the word he, URLs were replaced with the keyword
URL, and misspellings were corrected based on an English dictionary. These text
preprocessing steps are shown in Figure-3.3.
3.6.2 Feature Extraction
A list of extracted features which reflect presence of sensitive information are
shown in Table-3.3. The reason behind extracting these particular features and meth-
ods used to obtain the feature values are explained one by one in the following sections.
124
Figure 3.3: Tweet Preprocessing
Feature Normalization
All features used in the experiments were calculated either on a normalized scale
or normalized during the classification process. The majority of classifiers calculate
the distance between two points by using a distance metric. If one feature’s values fall
under a broad range, then that feature will govern the distance measurements and
mislead the classifier [13]. Features are normalized to fit individual samples in the
same scale so that they have unit norm and contribute proportionately to classification
distance calculations.
Topic Ratios
Topic models are algorithms for discovering the main themes that pervade a
large and otherwise unstructured collection of documents according to the discov-
ered themes [17]. Latent Dirichlet Allocation (LDA) [18] is used to discover topics.
This model allows you to consider each document in a set of documents as a collection
of topics. Topic modeling assumes that when a document is created, the topics that
make up that document and their proportions are selected according to the dirichlet
distribution. Then, the document is created by repeatedly selecting a topic according
to its proportion and a word from the vocabulary for that topic until the document is
completed. Although this is somewhat convoluted, if we estimate the posterior prob-
125
Feature CountTopic Probabilities 200
Privacy Dictionary Matches 1Name Entity Count 1
Location Entity Count 1Date Entity Count 1Time Entity Count 1
Organization Entity Count 1Money Entity Count 1
Percentage Entity Count 1Private Sentiment Count 1
Not-Private Sentiment Count 1Quote Count 1URL Count 1
Handle Count 1Retweet Count 1Hashtag Count 1
Table 3.3: Privacy Feature Set
abilities of this process using Gibb’s sampling, we can determine the topics discussed
in a set of documents and the proportion of those topics present in each document.
We use MALLET [74] to train a topic model on tweets that we collected from
27,293 Twitter users 267,026 tweets through the Twitter API. MALLET topic model-
ing toolkit contains an efficient and sampling-based implementation of ‘Latent Dirich-
let Allocation’ [18] as well as routines for transforming text documents into numerical
representations and removing stop words.
Some topics of discussion are more likely to reveal private information while other
topics remain neutral privacy-wise. Following this intuition, we trained a topic model
from the tweet data set and used this model to infer the topic ratios in given user
timelines. Topic modeling and inferencing proved more effective on preprocessed text.
We used the inferred topic ratios for each topic as a feature for machine learning.
126
In order to find the optimum number of topics, we divided the data into two parts:
training set (90% of the data) and testing set (10% of the data). We then conducted
20 runs of LDA by changing the number of topics from 20 to 400. On each run, we
built an LDA model on the training set and calculated the perplexity (Eq. 3.1) of
the testing set. Perplexity of an LDA model is defined as,
Perplexity(DTest) = exp
(−
∑Dd=1 log p(wd|α,β)∑D
d=1Nd
)(3.1)
where, DTest = tweet data set,∑Dd=1Nd = total number of tokens in the tweet data set,
p(wd|α, β) = probability of an entire timeline belonging to a topic.
Lower perplexity scores represent a more robust model. We chose the number
of topics as 200 since it produced the most robust model with the lowest perplexity
measure.
Table-3.4 shows 6 topics that fall under private or neutral categories. We extracted
top 20 terms from each topic to better assess contents of the topics.
Privacy Dictionary Matches
One feature used in machine learning is the number of matches between the ‘pri-
vacy dictionary’ and a user’s timeline. Since the timelines are limited to 500 words,
this feature is normalized across users’ feature vectors.
‘Privacy dictionary’ [114] is a tool for performing automated content analysis of
privacy. The privacy dictionary automates the content analysis of privacy related text.
Using methods from corpus linguistics, Vasalou et al. [114] constructed and validated
eight dictionary categories on empirical material from a wide range of privacy-sensitive
contexts. They show that these dictionary categories detect privacy language patterns
within a given text.
127
Topic Top 20 termsPrivate:Inappropriate
f*ck bad f*cking sh*tfemale person i’m people inappro-priate sh*t ass laugh appeal funny man holy fun realhell hate talking
Private:Religious
god love jesus life bless lord give respect man world goodheart christ people day job family sex hope peace
Private:Marijuana
marijuana reveals legal medical law philly sam protestcall pot country american story smoke white prohibitionhunkie smoking horror qld
Public:Sports
sixers game heat tonight win season team people bynumandrew year order nba games flyers play classify ers men-tion night
Public:News
change africa climate service food news storm year jobsgeez location weather job adaptation direction duceshows calls japan tornado
Public:Entertainment
job song music great video love rank listening watchingmovie channel i’m make favorite show country makingtalking dance cool
Table 3.4: Some Private and Public Topics
The dictionary is compatible with Linguistic Inquiry and Word Count (LIWC), a
text analysis software program developed by Pennebaker et al. [89]. The privacy dic-
tionary is useful in calculating the details on the usage of categories of words across
heterogenous types of text. The eight categories for privacy-sensitive contexts are
‘Law’, ‘OpenVisible’, ‘OutcomeState’, ‘NormsRequisites’, ‘Restriction’, ‘NegativePri-
vacy’, ‘Intimacy’, and ‘PrivateSecret’. Each linguistic category contains words and
phrases, which can be used to gain an understanding of the types of content contained
within the text and in relation to other content.
Named Entity Recognition
The more specific wording a user has, the more entities are found in text. Following
this intuition, the higher the specificity is the higher the chances of revealing private
128
information. OpenNLP’s [1] named entity recognizer extracts the number of name,
location, date, time, organization, money, and percentage entities to be used as a
machine learning feature. Again, since the timelines are limited to 500 words, this
feature is normalized across users’ feature vectors.
Sentiment Analysis
Sentiment analysis is generally used to extract subjective information in text. It
can be used to infer whether the source is subjective or objective, or whether the
tone is positive, negative, or neutral. Sentiment analysis helps differentiate private
tweets from neutral or objective tweets. Therefore, the sentiment of interest is the
state of revealing private information which can be used as a feature on a tweet by
tweet basis.
A sentiment classifier trains on 9 privacy categories: location, medical, drug/al-
cohol, emotion, personal attacks, stereotyping, family or other associations, personal
details, personally identifiable information, and a not private category that contains
objective and neutral tweets. These 9 categories are influenced by related work and
are explained in more detail in section 3.5. Each category contains at least 6,000
words of training data made up of manually labeled tweets that represent the privacy
content. Lingpipe’s n-gram based sentiment classifier [3] classifies tweets in a time-
line as private or not private. The number of private and not private tweets are two
features used in machine learning. This feature is normalized across users because of
the timeline word length limit.
Quote, URL, Handle, Retweet, Hashtag Count
Twitter users tend to place retweets or sentences written by others in quotes. The
number of quotes and retweets in timelines is a feature that represents not private
129
content. The number of URLs, user handles, and hashtags also have information gain
and are included as supplemental features. These features are normalized, since there
is a word limit on the timelines being analyzed.
3.7 Results
The first set of AMT annotations show that 10.37% of Twitter users frequently
reveal personal information (privacy score-3), 21.11% reveal some private information
(privacy score-2), 68.52% tend not to reveal much private information by tweeting
(privacy score-1). Twitter users need to be aware that the number of people revealing
private information is a significant portion of all users and make conscious decisions
when thinking of posting any text with private content.
The classifier reaches 95.45% accuracy in a two class task (users with scores of
1 and 3), and 69.63% accuracy in a three class task (users with scores of 1, 2, and
3) after performing 10-fold-cross-validation by using AdaBoost with Naive Bayes and
standardizing the features on the data set obtained from AMT annotators. These
results show that the extracted features represent privacy from a general standpoint
instead of focusing on single privacy categories. This differentiates this work from
previous efforts and makes the approach applicable to a broader range of privacy
concerns. Using the Brown clusters and converting the text to a format that is more
natural language processing friendly was a key element of being able to distinguish
private and non-private tweets. Without these transformations, accuracy drops to
58.93% in a two class task (users with scores of 1 and 3), and 38.10% in a three class
task (users with scores of 1, 2, and 3) after performing 10-fold-cross-validation by
using AdaBoost with Naive Bayes, and standardizing the features on the same data
set without preprocessing the text.
130
3.7.1 Twitter Database User Scores
The classifier that trained on the tweets from the annotated data set reached
69.63% accuracy in a 3-class supervised experiment. The timelines in this data set
are not present in the Twitter database. This trained classifier classified the scores
of 1,982 Twitter users that had at least 500 words of tweets in their timelines. The
Twitter database experiment’s results show that 18.62% of Twitter users frequently
reveal personal information, 30.52% reveal some private information, 50.86% tend not
to reveal much private information by tweeting. Figure-3.4 shows a privacy map of
the 1,982 users, where each node represents a user, each edge represents a following
relationship, and the node colors represent privacy score where light yellow is a score
of 1, orange is a score of 2, and red is a score of 3.
3.7.2 Correlation between User’s Privacy Score and User’s Friends’ Pri-
vacy Score
The privacy scores of users, and the average of privacy scores of people they follow
is positively correlated. This means that the higher a user’s privacy score, the higher
her friends’ privacy scores are and vice versa. Spearman’s Rho was calculated to
measure the direction and strength of relationship between users’ and their friends’
privacy scores. Spearman’s Rho used the privacy scores of 45 users, who had at
least 30 friends with sufficient amount of tweets, and these friends’ privacy scores
to calculate the correlation. The resulting R value is 0.41, and two-tailed P value is
0.005, which shows that there is a statistically significant positive correlation between
the two variables.
Spearman’s correlation was preferred instead of Pearson’s correlation because
Spearman’s correlation does not make any assumptions about the distribution of
the values, and the calculations are based on ranks, not the actual values. Pearson’s
131
Figure 3.4: Twitter Privacy Map
132
Figure 3.5: User with Privacy Score-1
Figure 3.6: User with Privacy Score-2
correlation assumes that both of the two variables are sampled from populations that
follow a Gaussian distribution. There has been no study showing that Twitter pri-
vacy scores follow a Gaussian distribution and our sample size is not large enough
to support or neglect such an argument. Three random users with privacy scores
1,2, and 3 and their friends’ scores, are illustrated in Figure-3.5, 3.6, and 3.7. The
correlation between the user’s privacy score and her friends’ privacy scores are shown
by the main node’s color of light yellow, orange, or red being more dominant than
the data set’s average distribution.
133
Figure 3.7: User with Privacy Score-3
3.7.3 Correlation between User’s Privacy Score and Mentioned Users’
Privacy Score
There is a positive correlation between a user’s privacy score and the privacy
scores of users she mentions in tweets. Spearman’s Rho calculation on 45 users that
mentioned at least 30 other users with calculated privacy scores returned an R value
of 0.37 and a two-tailed P value of 0.01, which shows that the positive correlation
between two variables is statistically significant. This correlation is weaker than the
correlation between a user’s privacy score and the privacy scores of her friends. This
indicates that users prefer to follow other users that have similar privacy revealing
habits and users tend to mention users with similar private information revealing
habits. Nevertheless, a user’s friends’ average privacy score is a stronger indicator of
a user’s own privacy score than the average privacy score of people a user mentions
in tweets.
3.7.4 Correlation between User’s Privacy Score and Number of Followers
Number of followers for each user that had a calculated privacy score was obtained.
There was no statistically significant correlation between a user’s privacy score and
her number of followers. Both Spearman’s Rho and Pearson’s correlation coefficient
134
were close to 0.
For example, at the time of gathering data from Twitter, rogerfederer, who is a
professional tennis player ranked world no. 4 had around 1,500,000 followers and
a privacy score of 1, whereas mark_wahlberg who is an American actor also had
around 1,500,000 followers and a privacy score of 3. There is no correlation between
how much private information a user reveals and how many followers the user has.
3.7.5 Inter-Annotator Agreement
Cohen’s Kappa coefficient [35] was calculated to measure the inter-annotator
agreement in a 95% confidence interval. Cohen’s kappa coefficient is a statistical
measure of inter-annotator agreement for categorical items which takes into account
the agreement occurring by chance. Cohen’s Kappa is a measurement of concordance
that can be applied to data that is not normally distributed or binary data such as
true/false, but is best suited to an ordinal scale, such as the 3 point privacy score
scale. Kappa statistics is generally thought to be a more robust measure than simple
percent agreement calculation since it excludes the agreement expected from random
chance.
Cohen’s Kappa can be calculated in two ways, namely weighted kappa coeffi-
cient and unweighted kappa coefficient. Weighted Kappa coefficient [36] is recom-
mended when the score categories are more than two and not binary. Weighted
Kappa statistics takes the distance between different categories into account. Conse-
quently, weighted Kappa statistics offered the most accurate agreement measurements
for privacy score predictions and annotations, which have 3 categories.
Landis and Koch [66] characterized Kappa coefficient values less than 0 as indi-
cating no agreement, 0 to 0.20 as slight agreement, 0.21 to 0.40 as fair agreement,
0.41 to 0.60 as moderate agreement, 0.61 to 0.80 as substantial agreement, and 0.81
135
to 1 as almost perfect agreement.
There is a fair agreement between the annotations of the first set and second set
of AMT annotators. The agreement between the first set of AMT annotators and the
classifier is fair. There is also a fair agreement between the annotations of the second
set of AMT annotators and the supervised machine learning classifications. These
three results suggest that the variance of privacy annotations between humans is in
the same range as the variance between human annotators and supervised machine
learning classifications. Determining if a given tweet is private or not is subjective to
an extent for AMT workers even though detailed annotation guidelines are available.
Seeing that privacy detective’s results fall under the same level of subjectivity makes
it more reliable in addition to the accuracy obtained from supervised experiments.
3.8 Limitations
The ground truth in the training set is provided by AMT workers and not the
original writers of the tweets. Turkers were given a detailed explanation of how
to annotate tweets and choose privacy categories, but the original author of the
tweet might have a different intension in writing the tweet. This annotation strategy
provides a man on the street view of privacy, therefore this limitation did not harm
the approach.
The length of timelines and the number of tweets have an effect on how much
private or sensitive information is released. A personal profile can be formed by
investigating the writings of a person. The more text that is present the more accurate
the profile will be. The quantified effect of writing length on the amount of personal
information leakage is not clear. There are numerous components in text that are
representative of private information or neutral data. Each component’s effect needs
to be factored out in order to investigate the effect of text length. In order to keep the
136
length factor stable, this study is limited to 500 words of randomly selected tweets
from a Twitter user’s timeline.
Most tweets in a user’s timeline could be benign and a few could be very private. A
sample of 500 words might only capture the neutral tweets from a user. Not including
such exceptions in the analysis is not affecting the privacy score calculations adversely.
Users’ habits rather than the outliers in their timelines are the focus of this study.
3.9 Discussion
Entity recognition requires proper English sentences to detect sentences and the
entities within text. Tweets by nature do not resemble proper English sentences
and therefore render natural language processing tasks quite challenging. Improv-
ing named entity recognition accuracy on tweets might boost private information
detection performance.
Table-3.5 shows the information gain ranks of features. Not-private sentiment
count is the most important feature followed by 13 topics and the rest of the non-
topic related features. The information gain ratios which are close to 1% for all of
the 215 features show that all features contribute proportionately and they are all
important.
There are many topics that contribute to correct classification. Creating a topic
model with correct number of topics and precise LDA parameters is crucial for accu-
rate analysis. Topic discovery is more effective on a larger data set, which covers a
greater range of topics and words. As the Twitter data set gets larger after collecting
more tweets through the Twitter API, new topic models are periodically updated to
include recent topics. 13 topics that had the highest information gain ranks among
200 topics are shown in Table-3.6.
66.66% of wrong predictions are a miss by one in privacy score and the remaining
137
Feature RankNot-Private Sentiment Count 1
13 Topics 14Private Sentiment Count 15
122 Topics 137Privacy Dictionary Matches 138
Percentage Entity Count 139Organization Entity Count 140
Name Entity Count 141Time Entity Count 142
Quote Count 143Retweet Count 144Handle Count 145Hashtag Count 146
URL Count 147Money Entity Count 148
Location Entity Count 149Date Entity Count 150
65 Other Topics 215
Table 3.5: Information Gain
138
33.33% of wrong predictions are a miss by two. Many of the wrong classifications
lie on classifier boundaries. For example, one timeline was misclassified as a privacy
score of 1, and it actually had 30% private tweets and needed to be classified as a
privacy score of 2. Such cases can be eliminated by improving the quality of extracted
features.
3.9.1 Future Work
A data set made up of tweets is a challenging one for text analytics compared to
formal writing. Text analytics methods will be more effective on regular writings of
people. This hypothesis can be tested in the future with a data set that consists of
formal writing and ground truth on private information.
Relationship between text length and the amount of personal information leakage
can be quantified as more annotated data becomes available, possibly annotated by
the owner of the writing. Applying privacy analytics methods to other social media
to detect private content in similar but differently formatted data will be useful in
understanding privacy behavior in more detail.
The text analysis software LIWC has dictionaries relevant to privacy. In future
work, incorporating other related dictionaries might improve classifier performance.
In future work, understanding the causal factors behind private information dis-
closure could be used to effectively design privacy enhancing nudges and target edu-
cational interventions.
3.10 Conclusion
Some topics are more likely to include private information since topic ratio features
help in detecting private information. Entity recognition by itself is not enough to
show if private information is being revealed, but added to topic features which define
139
the context of the entity, it greatly increases the detection rate of private information.
Keyword based private information detection helps detect private information to some
extent since privacy dictionary matches feature improves the accuracy by 4%, but it
is too limited to be generalized to all privacy concerns.
Incrementally improving the approach and understanding the causal factors be-
hind private information disclosure could be used to effectively design privacy nudges.
Another possible direction is to provide an assistive tool to users that can be more
than a research platform for privacy researchers. For example, a user can use privacy
detective to have a sense of friends privacy scores to build friends lists accordingly.
Online privacy behavior might be socially constructed. This knowledge can be
used to effectively design privacy enhancing technologies and target educational in-
terventions.
140
Topic Top 20 termsPeople url mammal person girl family bad dogs boy age front man
cats location dog hot lucky loves color baby catSports order refresh year draft round eagles games pick rank game
history trade fantasy number player nfl team calls top sea-son
Fiction letters url fiction lekker met hate pack win pur funny weerrico unit moet nar kick reaction net arv heel
Fun url check great love free awesome site store food today pho-tos tips party order time songs peek design weekend clothes
Emotions people bad i’m admit hate love strange make annoy playit’s makes time funny feel friends true angry matter good
Location url i’m philadelphia park city mayor location york phillydesign box bank photo search ave citizens center openingreveals day
Discussion url change follow education pregnancyloss computerpropulsion cycle lbs item secret money security gas savebuilt boxing vin personal jobs
Curse fuck bad fucking female person i’m people inappropriateshit ass laugh appeal funny man holy fun real hell hatetalking
News url school news sports high video upper fox lines groupgreat washington wtf today temple darby location blurredweather back
Time hours number days application time years minutes orderago back url unit late day started top running today showsleft
Personal people life things make love time good rank i’m hard emo-tion don’t happen stay find person feel it’s forget change
Religious god love jesus life bless lord give respect man world goodheart christ people day job family sex hope peace
Family bad people event family person inappropriate call problemsman make time feel kids admit makes world making agegood thing
Table 3.6: Topics with High Information Gain
141
4. Future Work and Conclusion
4.1 Future Directions
4.1.1 Design “Nudges” and NLP-based Privacy Protection Policies
The institutional review board at Drexel University has recently approved a user
study protocol to obtain direct ground truth information on privacy from Facebook
users and introduce a privacy nudge to the Facebook architecture in the form of
friend lists based on privacy scores. Users will download all of their Facebook data
and label the information they shared according to privacy categories. The decisions
they make in the presence of the privacy nudge will be compared to their past posts
to quantify the effect of the privacy nudge. The users will also record their reasons
for sharing private information, which would provide insight into the causal factors
behind privacy behavior. All the features used in a previous Twitter study will be
adjusted to the Facebook domain as well as bringing new features from privacy related
linguistic word inquiry consortium dictionaries. This study will shed light on the effect
of network phenomena on privacy behavior and the effectiveness of privacy nudges.
4.1.2 De-classification and Sanitization
Developing a method to detect sensitive entities in text is one of the next steps af-
ter completing the ongoing research for this dissertation. Identifying sensitive entities
might be achieved by improving named entity recognition on specific domains through
the use of word vector representations and topic modeling. Generalizing or swapping
sensitive entities that have been detected is a potential sanitization method. Improv-
ing the state-of-the-art in natural language processing might benefit the security and
privacy community, and many researchers, through de-classification, anonymization,
142
and sanitization.
Sanitizing text that contains sensitive and private information is of interest to
the public and agencies that try to make documents public for several reasons. The
first reason is to enable researchers share private research data with sociologists, psy-
chologists, computer scientists, or linguists, without breaching subjects’ privacy. It
can also be used in companies to prevent data loss. Another reason is to help detect
classified documents in over-classified archives and redact sensitive data automat-
ically to prepare the documents for publication. The Clinton administration had
33,000,000 emails that needed to be released to the public. Obama administration
had 250,000,000 emails in 2010 that came to the legal custody of the national archives.
Presidential library archivists systematically and manually redact pages one by one.
A portion of Reagan’s library is read because archivists have not finished manually
processing all of them yet. Manually processed documents can be searched through
manual means, which is not practical. This manual approach does not scale anymore
and urgently needs the help of an automated redaction tool to minimize the manual
efforts.
4.1.3 Source Code and Binaries
De-anonymizing code authored by multiple programmers is an open research area.
Open source repositories encourage collaborative programming. Consequently code
authored by multiple authors is widely available. Being able to identify all the
programmers in source code implemented in collaboration has practical uses. One
anonymous account could also be used by multiple programmers. Identifying multi-
ple programmers can aid in understanding who spreads malicious code or introduces
vulnerabilities to repositories.
Another open research question in this area is analyzing authorship in binaries,
143
the common way that malware spreads. Binary analysis can focus on eliminating
the compiler effect to isolate coding style in the binary to perform malware family
classification, which is directly applicable to security problems.
4.2 Conclusion
Machine learning along with natural language processing methods are crucial for
identifying certain privacy and security issues at this big data era. Being able to
identify problems with these methods comes with an advantage. Usually, addressing
problems with countermeasures requires using similar machine learning and natural
language processing methods. These methods can quantify individuals’ stylistic fin-
gerprints found in any form of textual data. Fingerprints in writing style, in the form
of machine learning features, can be used to de-anonymize authors of anonymous
documents. Similarly, a numeric representation of coding style in source code can
be used to de-anonymize programmers. Stylistic fingerprints can be extracted from
source code, translated text, slang, and highly unstructured text. Stylometric analysis
can also link multiple identities of users across and within different platforms. Despite
all these, staying anonymous is possible. The stylistic features that de-anonymize in-
dividuals are the ones that need to be modified to anonymize style. Anonymization
is the process of rendering stylistic fingerprints insignificant. Anonymization can be
achieved with machine learning and natural language processing methods which are
similar to the ones de-anonymization utilizes. De-anonymizing authors enhance se-
curity by detecting the identities of people who perform malicious activities. In turn,
de-anonymization methods aid forensic experts and law enforcement. On the other
hand, de-anonymizing authors is a serious threat to privacy, especially for individuals
who would like to remain anonymous. Nevertheless, developing privacy enhancing
technologies first and foremost depend on identifying privacy infringing techniques.
144
High level information in text can be extracted with natural language processing
techniques. The locations a person visits, people she meets, the time she does her
activities can all be automatically extracted. A machine learning classifier can use
the extracted information about topics of discussion, named entities, and semantics
to characterize the privacy behavior of people in social networks. Aspects of human
behavior expressed in language can be characterized and quantified. Such automated
methods make it possible for researchers to study privacy behavior on a large scale.
Understanding privacy behavior is a starting point for developing mechanisms that
preserve privacy.
145
Bibliography
[1] https://opennlp.apache.org.
[2] The tigress diversifying c virtualizer, http://tigress.cs.arizona.edu.
[3] http://alias-i.com/lingpipe.
[4] Google code jam, https://code.google.com/codejam, 2014.
[5] Stunnix, http://www.stunnix.com/prod/cxxo/, November 2014.
[6] Abbasi, A., and Chen, H. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf.Syst. 26, 2 (2008), 1–29.
[7] Afroz, S. Deception in Authorship Attribution. PhD thesis, Drexel University,December 2013.
[8] Afroz, S., Brennan, M., and Greenstadt, R. Detecting hoaxes, frauds,and deception in writing style online. IEEE Symposium on Security and Privacy(2012).
[9] Afroz, S., Brennan, M., and Greenstadt, R. Detecting hoaxes, frauds,and deception in writing style online. In Security and Privacy (SP), 2012 IEEESymposium on (2012), IEEE, pp. 461–475.
[10] Afroz, S., Caliskan-Islam, A., Stolerman, A., Greenstadt, R., andMcCoy, D. Doppelgänger finder: Taking stylometry to the underground. InIEEE Symposium on Security and Privacy (2014).
[11] Afroz, S., Garg, V., McCoy, D., and Greenstadt, R. Honor amongthieves: A commons analysis of underground forums. In eCrime Researcher’sSummit (2013).
[12] Aiken, A., et al. Moss: A system for detecting software plagiarism. Uni-versity of California–Berkeley. See www. cs. berkeley. edu/aiken/moss. html 9(2005).
[13] Aksoy, S., and Haralick, R. M. Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recognition Letters 22,5 (2001), 563–582.
146
[14] Almishari, M., Gasti, P., Tsudik, G., and Oguz, E. Privacy-preservingmatching of community-contributed content. In Computer Security–ESORICS2013. Springer, 2013, pp. 443–462.
[15] Almishari, M., and Tsudik, G. Exploring linkability of user reviews. InComputer Security–ESORICS 2012. Springer, 2012, pp. 307–324.
[16] Alvisi, L., Clement, A., Epasto, A., Sapienza, U., Lattanzi, S., andPanconesi, A. Sok: The evolution of sybil defense via social networks. InIEEE security and privacy (2013).
[17] Blei, D. Probabilistic topic models. Communications of the ACM 55, 4 (2012).
[18] Blei, D., Ng, A., and Jordan, M. Latent dirichlet allocation. the Journalof machine Learning research 3 (2003), 993–1022.
[19] Bollen, J., Mao, H., and Pepe, A. Modeling public mood and emotion:Twitter sentiment and socio-economic phenomena. In ICWSM (2011).
[20] Breiman, L. Random forests. Machine Learning 45, 1 (2001), 5–32.
[21] Brennan, M., Afroz, S., and Greenstadt, R. Adversarial stylometry:Circumventing authorship recognition to preserve privacy and anonymity. ACMTransactions on Information and System Security (TISSEC) 15, 3 (2012).
[22] Brennan, M., and Greenstadt, R. Practical attacks against authorshiprecognition techniques. In Proceedings of the Twenty-First Conference on In-novative Applications of Artificial Intelligence (IAAI), Pasadena, CA (2009).
[23] Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., andLai, J. C. Class-based n-gram models of natural language. Computationallinguistics 18, 4 (1992), 467–479.
[24] Burrows, S., and Tahaghoghi, S. M. Source code authorship attributionusing n-grams. In Proc. of the Australasian Document Computing Symposium(2007).
[25] Burrows, S., Uitdenbogerd, A. L., and Turpin, A. Application of infor-mation retrieval techniques for source code authorship attribution. In DatabaseSystems for Advanced Applications (2009), Springer, pp. 699–713.
[26] Calandrino, J. A., Clarkson, W., and Felten, E. W. Bubble trouble:Off-line de-anonymization of bubble forms. In USENIX Security Symposium(2011).
[27] Caliskan, A., and Greenstadt, R. Translate once, translate twice, trans-late thrice and attribute: Identifying authors and machine translation tools intranslated text. In Semantic Computing (ICSC), 2012 IEEE Sixth InternationalConference (2012), IEEE, pp. 121–125.
147
[28] Caliskan-Islam, A., Harang, R., Liu, A., Narayanan, A., Voss, C.,Yamaguchi, F., and Greenstadt, R. De-anonymizing Programmers viaCode Stylometry. In Proceedings of the USENIX Security Symposium (2015).
[29] Caliskan-Islam, A., Walsh, J., and Greenstadt, R. Privacy Detective:Detecting Private Information and Collective Privacy Behavior in a Large SocialNetwork. In Proceedings of the 13th Workshop on Privacy in the ElectronicSociety (2014), ACM, pp. 35–46.
[30] Chairunnanda, P., Pham, N., and Hengartner, U. Privacy: Gone withthe typing! identifying web users by their typing patterns. In Privacy, security,risk and trust (passat), 2011 ieee third international conference on and 2011 ieeethird international conference on social computing (socialcom) (2011), IEEE,pp. 974–980.
[31] Chirgwin, R. Your anonymous code contributions probably aren’t: boffins.The Register, January 2015.
[32] Chow, R., Oberst, I., and Staddon, J. Sanitization’s slippery slope:the design and study of a text revision assistant. In Proceedings of the 5thSymposium on Usable Privacy and Security (2009), ACM, p. 13.
[33] Christakis, N. A., and Fowler, J. H. The spread of obesity in a largesocial network over 32 years. New England journal of medicine 357, 4 (2007),370–379.
[34] Christakis, N. A., and Fowler, J. H. The collective dynamics of smokingin a large social network. New England journal of medicine 358, 21 (2008),2249–2258.
[35] Cohen, J. A Coefficient of Agreement for Nominal Scales. Educational andPsychological Measurement 20, 1 (1960), 37.
[36] Cohen, J. Weighted kappa: Nominal scale agreement provision for scaleddisagreement or partial credit. Psychological bulletin 70, 4 (1968), 213.
[37] Cristofaro, E. D., Soriente, C., Tsudik, G., and Williams, A. Hum-mingbird: Privacy at the time of twitter. In IEEE Symposium on Security andPrivacy (2012), IEEE Computer Society, pp. 285–299.
[38] Danezis, G., and Mittal, P. Sybilinfer: Detecting sybil nodes using socialnetworks. In NDSS (2009).
[39] Ding, H., and Samadzadeh, M. H. Extraction of java program fingerprintsfor software authorship identification. Journal of Systems and Software 72, 1(2004), 49–57.
148
[40] Doctorow, C. State of Adversarial Stylometry: can you change your prose-style? Boing Boing, December 2011.
[41] Eckersley, P. How unique is your web browser? In Privacy EnhancingTechnologies (2010), Springer, pp. 1–18.
[42] Elenbogen, B. S., and Seliya, N. Detecting outsourced student program-ming assignments. Journal of Computing Sciences in Colleges 23, 3 (2008),50–57.
[43] Elsbrock, P. Wer hemingway imitiert, schreibt anonym. Der Spiegel, De-cember 2011.
[44] Forsyth, R. S., and Holmes, D. I. Feature-finding for test classification.Literary and Linguistic Computing 11, 4 (1996), 163–174.
[45] Franklin, J., Paxson, V., Perrig, A., and Savage, S. An inquiry intothe nature and causes of the wealth of internet miscreants. In ACM Conferenceon Computer and Communications Security (CCS) (2007).
[46] Frantzeskou, G., MacDonell, S., Stamatatos, E., and Gritzalis, S.Examining the significance of high-level programming features in source codeauthor classification. Journal of Systems and Software 81, 3 (2008), 447–460.
[47] Frantzeskou, G., Stamatatos, E., Gritzalis, S., Chaski, C. E., andHowald, B. S. Identifying authorship by byte-level n-grams: The source codeauthor profile (scap) method. International Journal of Digital Evidence 6, 1(2007), 1–18.
[48] Frantzeskou, G., Stamatatos, E., Gritzalis, S., and Katsikas, S. Ef-fective identification of source code authors using byte-level information. In Pro-ceedings of the 28th International Conference on Software Engineering (2006),ACM, pp. 893–896.
[49] Freeman, D. M. Using naive bayes to detect spammy names in social net-works. In Proceedings of the 2013 ACM workshop on Artificial intelligence andsecurity (2013), ACM, pp. 3–12.
[50] Freund, Y., Schapire, R. E., et al. Experiments with a new boostingalgorithm. In ICML (1996), vol. 96, pp. 148–156.
[51] Gill, A. J., Vasalou, A., Papoutsi, C., and Joinson, A. N. Privacydictionary: a linguistic taxonomy of privacy for content analysis. In Proceedingsof the 2011 annual conference on Human factors in computing systems (2011),ACM, pp. 3227–3236.
149
[52] Goga, O., Lei, H., Parthasarathi, S. H. K., Friedland, G., Sommer,R., Renata, T., et al. Exploiting innocuous activity for correlating usersacross sites. In WWW’13 Proceedings of the 22nd international conference onWorld Wide Web (2013).
[53] Gray, A., Sallis, P., and MacDonell, S. Software forensics: Extendingauthorship analysis techniques to computer programs.
[54] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.,and Witten, I. H. The weka data mining software: an update. SIGKDDExplor. Newsl. 11 (November 2009), 10–18.
[55] Hart, M., Manadhata, P., and Johnson, R. Text classification for dataloss prevention. In Privacy Enhancing Technologies (2011), Springer, pp. 18–37.
[56] Hayes, J. H., and Offutt, J. Recognizing authors: an examination of theconsistent programmer hypothesis. Software Testing, Verification and Reliabil-ity 20, 4 (2010), 329–356.
[57] Hedegaard, S., and Simonsen, J. G. Lost in translation: Authorshipattribution using frame semantics. Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics (2011).
[58] Johnson, P. CSI computer science: Your coding style can give you away.ITWorld, January 2015.
[59] Juola, P., Sofko, J., and Brennan, P. A prototype for authorship attri-bution studies. Literary and Linguistic Computing 21, 2 (2006), 169–178.
[60] Kim, D., Motoyama, M., Voelker, G., and Saul, L. Topic modeling offreelance job postings to monitor web service abuse. In Proceedings of the 4thACM workshop on Security and artificial intelligence (2011), ACM, pp. 11–20.
[61] Koppel, M., Schler, J., and Argamon, S. Computational methods in au-thorship attribution. Journal of the American Society for Information Scienceand Technology 60, 1 (2009), 9–26.
[62] Koppel, M., Schler, J., and Zigdon, K. Automatically determining ananonymous author’s native language. In Intelligence and Security Informat-ics, P. Kantor, G. Muresan, F. Roberts, D. Zeng, F.-Y. Wang, H. Chen, andR. Merkle, Eds., vol. 3495 of Lecture Notes in Computer Science. Springer Berlin/ Heidelberg, 2005, pp. 41–76.
[63] Koppel, M., and Winter, Y. Determining if two documents are by thesame author. Journal of the American Society for Information Science andTechnology (2013).
150
[64] Kothari, J., Shevertalov, M., Stehle, E., and Mancoridis, S. Aprobabilistic approach to source code authorship identification. In InformationTechnology, 2007. ITNG’07. Fourth International Conference on (2007), IEEE,pp. 243–248.
[65] Krsul, I., and Spafford, E. H. Authorship analysis: Identifying the authorof a program. Computers & Security 16, 3 (1997), 233–257.
[66] Landis, J. R., Koch, G. G., et al. The measurement of observer agreementfor categorical data. biometrics 33, 1 (1977), 159–174.
[67] Lange, R. C., and Mancoridis, S. Using code metric histograms andgenetic algorithms to perform author identification for software forensics. InProceedings of the 9th Annual Conference on Genetic and Evolutionary Com-putation (2007), ACM, pp. 2082–2089.
[68] Lau, J. H., Collier, N., and Baldwin, T. On-line trend analysis withtopic models: #twitter trends detection topic model online. In COLING (2012),pp. 1519–1534.
[69] Li, J., Zheng, R., and Chen, H. From fingerprint to writeprint. Commun.ACM 49 (April 2006), 76–82.
[70] Liu, K., and Terzi, E. A framework for computing the privacy scores ofusers in online social networks. ACM Transactions on Knowledge Discoveryfrom Data (TKDD) 5, 1 (2010), 6.
[71] Luyckx, K., and Daelemans, W. Authorship attribution and verificationwith many authors and limited data. In Proceedings of the 22nd InternationalConference on Computational Linguistics - Volume 1 (Stroudsburg, PA, USA,2008), COLING ’08, Association for Computational Linguistics, pp. 513–520.
[72] MacDonell, S. G., Gray, A. R., MacLennan, G., and Sallis, P. J.Software forensics for discriminating between program authors using case-basedreasoning, feedforward neural networks and multiple discriminant analysis. InNeural Information Processing, 1999. Proceedings. ICONIP’99. 6th Interna-tional Conference on (1999), vol. 1, IEEE, pp. 66–71.
[73] Mao, H., Shuai, X., and Kapadia, A. Loose tweets: an analysis of privacyleaks on twitter. In Proceedings of the 10th annual ACM workshop on Privacyin the electronic society (2011), ACM, pp. 1–12.
[74] McCallum, A. K. Mallet: A machine learning for language toolkit.
151
[75] McCoy, D., Pitsillidis, A., Jordan, G., Weaver, N., Kreibich, C.,Krebs, B., Voelker, G. M., Savage, S., and Levchenko, K. Phar-maleaks: Understanding the business of online pharmaceutical affiliate pro-grams. In Proceedings of the 21st USENIX conference on Security symposium(2012), USENIX Association, pp. 1–1.
[76] McDonald, A. W., Afroz, S., Caliskan, A., Stolerman, A., andGreenstadt, R. Use fewer instances of the letter “i”: Toward writing styleanonymization. In Privacy Enhancing Technologies. Springer Berlin Heidelberg,2012, pp. 299–318.
[77] Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimationof word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[78] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J.Distributed representations of words and phrases and their compositionality. InAdvances in Neural Information Processing Systems (2013), pp. 3111–3119.
[79] Motoyama, M., McCoy, D., Levchenko, K., Savage, S., and Voelker,G. An analysis of underground forums. In Proceedings of the 2011 ACM SIG-COMM conference on Internet measurement conference (2011), ACM, pp. 71–80.
[80] Narayanan, A., Paskov, H., Gong, N., Bethencourt, J., Stefanov,E., Shin, R., and Song, D. On the feasibility of internet-scale author identi-fication. In Proceedings of the 33rd conference on IEEE Symposium on Securityand Privacy (2012), IEEE.
[81] Narayanan, A., Paskov, H., Gong, N. Z., Bethencourt, J., Stefanov,E., Shin, E. C. R., and Song, D. On the feasibility of internet-scale authoridentification. In Security and Privacy (SP), 2012 IEEE Symposium on (2012),IEEE, pp. 300–314.
[82] Narayanan, A., and Shmatikov, V. Robust de-anonymization of largesparse datasets. In IEEE Symposium on Security and Privacy (2008), IEEE,pp. 111–125.
[83] Narayanan, A., and Shmatikov, V. De-anonymizing social networks. In30th IEEE Symposium on Security and Privacy (2009), IEEE, pp. 173–187.
[84] Overdorf, R., Dutko, T., and Greenstadt, R. Blogs and twitter feeds:A stylometric environmental impact study. 2014.
[85] Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N.,and Smith, N. A. Improved part-of-speech tagging for online conversationaltext with word clusters. In Proceedings of NAACL-HLT (2013), pp. 380–390.
152
[86] Paganini, P. Stylometric analysis to track anonymous users in the under-ground. Security Affairs blog, January 2013.
[87] Pauli, D. Linguistics identifies anonymous users. SC Magazine (Australianedition), January 2013.
[88] Pellin, B. N. Using classification techniques to determine source code author-ship. White Paper: Department of Computer Science, University of Wisconsin(2000).
[89] Pennebaker, J. W., Francis, M. E., and Booth, R. J. Linguistic inquiryand word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates (2001).
[90] Pennington, J., Socher, R., and Manning, C. D. Glove: Global vec-tors for word representation. Proceedings of the Empiricial Methods in NaturalLanguage Processing (EMNLP 2014) 12 (2014).
[91] Peretti, K. Data breaches: what the underground world of carding reveals.Santa Clara Computer & High Tech. LJ 25 (2008), 375.
[92] Perito, D., Castelluccia, C., Kaafar, M. A., and Manils, P. Howunique and traceable are usernames? In Privacy Enhancing Technologies(2011), Springer, pp. 1–17.
[93] Pike, R. The sherlock plagiarism detector, 2011.
[94] Platt, J. C. Sequential minimal optimization: A fast algorithm for trainingsupport vector machines. Advances in Kernel Methods Support Vector Learning208, MSR-TR-98-14 (1998), 1–21.
[95] Prechelt, L., Malpohl, G., and Philippsen, M. Finding plagiarismsamong a set of programs with jplag. J. UCS 8, 11 (2002), 1016.
[96] Qian, T., and Liu, B. Identifying multiple userids of the same author. InEMNLP 2013 (2013).
[97] Quinlan, J. Induction of decision trees. Machine learning 1, 1 (1986), 81–106.
[98] Rao, J., and Rohatgi, P. Can pseudonymity really guarantee privacy?In IN PROCEEDINGS OF THE NINTH USENIX SECURITY SYMPOSIUM(2000), pp. 85–96.
[99] Rao, J., Rohatgi, P., et al. Can pseudonymity really guarantee privacy.In Proceedings of the Ninth USENIX Security Symposium (2000), pp. 85–96.
[100] Ritter, A., Clark, S., Etzioni, O., et al. Named entity recognition intweets: an experimental study. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing (2011), Association for ComputationalLinguistics, pp. 1524–1534.
153
[101] Rosenblum, N., Zhu, X., and Miller, B. Who wrote this code? identifyingthe authors of program binaries. Computer Security–ESORICS 2011 (2011),172–189.
[102] Savage, S., Voelker, G. M., Fowler, J., et al. Beyond technical security:Developing an empirical basis for socio-economic prespectives. http://www.sysnet.ucsd.edu/frontier/proposal.pdf, 2012.
[103] Schein, A. I., Caver, J. F., Honaker, R. J., and Martell, C. H. Authorattribution evaluation with novel topic cross-validation. In Proceedings of theInternational Conference on Knowledge Discovery and Information Retrieval(October 2010), pp. 206–215.
[104] Schmid, H. Improvements in part-of-speech tagging with an application togerman. In In Proceedings of the ACL SIGDAT-Workshop (1995), Citeseer.
[105] Shevertalov, M., Kothari, J., Stehle, E., and Mancoridis, S. On theuse of discretized source code metrics for author identification. In Search BasedSoftware Engineering, 2009 1st International Symposium on (2009), IEEE,pp. 69–78.
[106] Sleeper, M., Cranshaw, J., Kelley, P. G., Ur, B., Acquisti, A., Cra-nor, L. F., and Sadeh, N. i read my twitter the next morning and was as-tonished: a conversational perspective on twitter regrets. In Proceedings of the2013 ACM annual conference on Human factors in computing systems (2013),ACM, pp. 3277–3286.
[107] Spafford, E. H., and Weeber, S. A. Software forensics: Can we track codeto its authors? Computers & Security 12, 6 (1993), 585–595.
[108] Stolerman, A., Caliskan, A., and Greenstadt, R. From language tofamily and back: Native language and language family identification from en-glish text. In HLT-NAACL (2013), pp. 32–39.
[109] Stolerman, A., Overdorf, R., Afroz, S., and Greenstadt, R. Clas-sify, but verify: Breaking the closed-world assumption in stylometric authorshipattribution. In IFIP Working Group 11.9 on Digital Forensics (2014), IFIP.
[110] Suresh, V., Krishnamurthy, A., Badrinath, R., and Veni Madhavan,C. A stylometric study and assessment of machine translators. In Advancesin Intelligent Data Analysis X, J. Gama, E. Bradley, and J. Hollmén, Eds.,vol. 7014 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg,2011, pp. 364–375.
[111] Thomas, K., Grier, C., and Nicol, D. M. unfriendly: Multi-party privacyrisks in social networks. In Privacy Enhancing Technologies (2010), M. J. Atal-lah and N. J. Hopper, Eds., vol. 6205 of Lecture Notes in Computer Science,Springer, pp. 236–252.
154
[112] Thomas, K., McCoy, D., Grier, C., Kolcz, A., and Paxson, V. Traf-ficking fraudulent accounts: the role of the underground market in twitter spamand abuse. In USENIX Security Symposium (2013).
[113] Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedingsof the 2003 Conference of the North American Chapter of the Association forComputational Linguistics on Human Language Technology-Volume 1 (2003),Association for Computational Linguistics, pp. 173–180.
[114] Vasalou, A., Gill, A. J., Mazanderani, F., Papoutsi, C., and Join-son, A. Privacy dictionary: A new resource for the automated content analysisof privacy. Journal of the American Society for Information Science and Tech-nology 62, 11 (2011), 2095–2105.
[115] Wang, Y., Norcie, G., Komanduri, S., Acquisti, A., Leon, P. G., andCranor, L. F. "i regretted the minute i pressed share": A qualitative study ofregrets on facebook. In Proceedings of the Seventh Symposium on Usable Privacyand Security (New York, NY, USA, 2011), SOUPS ’11, ACM, pp. 10:1–10:16.
[116] Wayman, J., Orlans, N., Hu, Q., Goodman, F., Ulrich, A., and Va-lencia, V. Technology assessment for the state of the art biometrics excellenceroadmap. http://www.biometriccoe.gov/SABER/index.htm, March 2009.
[117] Wikipedia. Saeed Malekpour, 2014. [Online; accessed 04-November-2014].
[118] Xue, Z., Yin, D., Davison, B. D., and Davison, B. Normalizing microtext.In Analyzing Microtext (2011).
[119] Yamaguchi, F., Golde, N., Arp, D., and Rieck, K. Modeling and discov-ering vulnerabilities with code property graphs. In Proc of IEEE Symposiumon Security and Privacy (S&P) (2014).
[120] Yamaguchi, F., Wressnegger, C., Gascon, H., and Rieck, K. Chucky:Exposing missing checks in source code for vulnerability discovery. In Proceed-ings of the 2013 ACM SIGSAC Conference on Computer & CommunicationsSecurity (2013), ACM, pp. 499–510.
[121] Yip, M., Shadbolt, N., and Webber, C. Structural analysis of onlinecriminal social networks. In Intelligence and Security Informatics (ISI), 2012IEEE International Conference on (2012), IEEE, pp. 60–65.
[122] Yip, M., Shadbolt, N., and Webber, C. Why forums? an empiricalanalysis into the facilitating factors of carding forums. In ACM Web Science2013 (2013).
155
[123] Zhuge, J., Holz, T., Song, C., Guo, J., Han, X., and Zou, W. Studyingmalicious websites and the underground economy on the chinese web. ManagingInformation Risk and the Economics of Security (2009), 225–244.