of 73
7/30/2019 Deceiving Authorship Detection
1/73
Deceiving
AuthorshipDetectionToolstoWriteAnonymously&CurrentTrendsinAdversarialStylometry.
MichaelBrennan,SadiaAfrozandRachelGreenstadt.Drexel
University.
7/30/2019 Deceiving Authorship Detection
2/73
Privacy,SecurityandAutomation
Lab Faculty
Dr.RachelGreenstadt GraduateStudents
SadiaAfroz(DeceponDeteconLead) DiamondBishop MichaelBrennan AylinCaliskan ArielStolerman(JStyloLeadDeveloper)
UndergraduateStudents PavanKantharaju AndrewMcDonald(AnonymouthLeadDeveloper)
7/30/2019 Deceiving Authorship Detection
3/73
26C3/28C3Diff
Review&UpdatedAnalysisof26C3Material NewCorpus(45authors) NewMethod(Writeprints) Muchmorerobustresults.
Thetoolswediscussedarenowbuilt! JStylo Anonymouth
DetecngDeceponinAdversarialWring
7/30/2019 Deceiving Authorship Detection
4/73
AnOverview
WhatisAuthorshipRecognionandAdversarialStylometry?
Whatistheanonymitythreat? Analyzing&DeceivingAuthorshipRecognion TwoTools
JStylo Anonymouth
DetecngDecepon
7/30/2019 Deceiving Authorship Detection
5/73
WhatisAuthorship
Recognition? Thebasicqueson:whowrotethisdocument? Stylometry:Thestudyofa]ribungauthorshiptodocumentsbasedonlyonthelinguiscstyletheyexhibit.
LinguiscStyleFeatures:sentencelength,wordchoices,syntaccstructure,etc.
Handwring,content-basedfeatures,andcontextualfeaturesarenotconsidered.
Individualshaveuniquewringstylesbecauselanguageislearnedonanindividualbasis.
Inthispresentaon,stylometryandauthorshiprecognionareusedinterchangeably.
7/30/2019 Deceiving Authorship Detection
6/73
WhatisAdversarial
Stylometry? AdversarialStylometry:Applyingdecepontowringstyleinordertoaffecttheoutcomeofstylometricanalysis.
But,iswringstylemodifiable?(Yes!) Isitpossibletodeceivestylometrythroughalteredwringstyle?(Yes!)
Whataretheimplicaonsoflookingatstylometryinanadversarialcontext?
7/30/2019 Deceiving Authorship Detection
7/73
HowCanStylometrybea
Threat? SupervisedStylometry
Givenasetofdocumentsofknownauthorship,classifyadocumentofunknownauthorship.
HypothecalScenario:AlicetheAnonymousBloggervs.BobtheAbusiveEmployer.
UnsupervisedStylometry Givenasetofdocumentsofunknownauthorship,clusterthemintoauthorgroups.
HypothecalScenario:AnonymousForumvs.OppressiveGovernment.
7/30/2019 Deceiving Authorship Detection
8/73
PurelyHypothetical?
Previousexamplesarepurelyhypothecal.Whataboutarealexample?
FromInsideWikiLeaksbyDanielDomscheit-Berg: InudgedJulianwithmyfoot.Weexchangedglancesandstartedgiggling.IfsomeonehadrunWikiLeaksdocumentsthroughsuchaprogram,hewouldhavediscoveredthatthesametwopeople
werebehindallthevariouspressreleases,documentsummaries,
andcorrespondenceissuedbytheproject.Theofficialnumberof
volunteers we had was also, to put it mildly, grotesquely
exaggerated.
7/30/2019 Deceiving Authorship Detection
9/73
AdversarialStylometry:A
Review Understandthethreatmodel Buildacorpus. Evaluatecurrentmethodsofstylometryagainstadversarialtext.
Analyzeresultsanddeveloptools.
7/30/2019 Deceiving Authorship Detection
10/73
7/30/2019 Deceiving Authorship Detection
11/73
CircumventionMethods
Challenge:conceivemethodsofcircumvenngwringstyleanalysis.
Obfusca0on Anauthora]emptstowriteadocumentinsuchawaythattheirpersonalwringstylewillnotberecognized.
Imita0on Anauthora]emptstowriteadocumentsuchthatthewringstylewillberecognizedasthatofanotherspecificauthor.
Transla0on*: Machinetranslaonisusedtotranslateadocumenttooneormorelanguagesandthenbacktotheoriginallanguage.
7/30/2019 Deceiving Authorship Detection
12/73
BuildingaCorpus
Corpus=Datasetofdocuments. Datasetsforadversarialstylometrydonotexist.Parcipantsarerequiredtocraintenonallyadversarialpassages.
Parcipaonhadthreeparts: Submit6500wordsofpre-exisngwringfromaformalsource. Writeanew500wordobfuscaonpassage.
Task:Describeyourneighborhood. Writeanew500wordimitaonpassage.
Task:ImitateCormacMcCarthy,describeyourday.
Authorshadnoformaltrainingorknowledgeinlinguiscsorstylometry.
7/30/2019 Deceiving Authorship Detection
13/73
Brennan-GreenstadtCorpus
12IndividualAuthors. Parcipantscontactedthroughclasses,colleagues,friendsatDrexelUniversity.
Moveforproperparcipaon. One-on-oneinteraconwithparcipants. Corpusispubliclyavailableath]ps://psal.cs.drexel.edu Goodforpreliminaryresults,butweneedsomethingbe]er.
Toosmall. Toohomogenous.
7/30/2019 Deceiving Authorship Detection
14/73
BuildingaBetterCorpuswith
AmazonMechanicalTurk DrexelAMTCorpus
AMT=AmazonMechanicalTurk Sametasksaspreviouscorpus. Only45of101ofsubmissionsareusable!
45AcceptedSubmissions. Guidelineswithoutspoilingdataset.Mustfollowdireconsand:
Pre-exisngwringmustbeformalinnature Removenon-contentMinimaldialogue/quotaons
Refrainfromsubming:smallsamples,labreports,Q&As,etc. Releasedtoday.Publiclyavailableath]ps://psal.cs.drexel.edu Thiscorpusislarge,diverse,andunique.
7/30/2019 Deceiving Authorship Detection
15/73
Originalvs.AMTCorpus
AMTCorpusevaluatedjustasstronglyasDrexel. 9-Featuresdoesworse,Synonymdoesthesame,Writeprintsdoesbe]er.
0
0.2
0.4
0.6
0.8
1
2 3 4 8
NumberofAuthors
9-FeatureBrennan-
Greenstad
t
AMT
Random
0
0.2
0.4
0.6
0.8
1
2 3 4 8
NumberofAuthors
Synonym
Brennan-
Greenstadt
AMT
Random0
0.2
0.4
0.6
0.8
1
2 3 4 8
NumberofAuthors
Writeprints
Brennan-
Greenstadt
AMT
Random
7/30/2019 Deceiving Authorship Detection
16/73
EvaluateStylometryMethods
AgainsttheCorpus ThreemethodsofStylometry
9-Feature/NeuralNetwork Synonym-BasedApproach Writeprints/SVM
7/30/2019 Deceiving Authorship Detection
17/73
Method1:
9-FeatureSetNeuralNetwork Simplestylometricapproach.Demonstratespotenaleffecvenesswithasmallnumberofobscuremetrics.
9-FeatureSet Uniquewords,Complexity,SentenceCount,AverageSentenceLength,AverageSyllableCount,CharacterCount,Le]erCount,Gunning-FogReadabilityIndex,FleschReadingEaseScore.
NeuralNetworkClassifier.
7/30/2019 Deceiving Authorship Detection
18/73
Method2:
Synonym-BasedApproach Examineswordchoiceswhencomparedtoavailablesynonymsandfrequencyofuse.
Clark&Hannon,2007. Gooddemonstraonofsinglefeaturetypestylometry.
7/30/2019 Deceiving Authorship Detection
19/73
Method3:
Writeprints(SVM) BasedontheWriteprintsapproachbyAbbasi&Chen,2008.
WriteprintsBaselineFeatureSet. Containshundredsoffeaturesincludingcharacterandwordn-grams,funconwords,parts-of-speechtags,punctuaon,and
characterlevelmetrics. SupportVectorMachineClassifier
Standardformul-classclassificaoninstylometry. ImplementaonofthefullWriteprintsapproachusesamoreextensivefeaturesetanduniqueclassificaonapproach.
7/30/2019 Deceiving Authorship Detection
20/73
DetectingAuthorship
Fourexperiments. Baseline:Evaluatethenon-adversarialpassagestoestablishabaselineaccuracyforeachmethodandfeatureset.
Obfusca0on:A]emptclassificaonoftheobfuscaonpassageswhenclassifiersaretrainedonthebaselinedata.
Imita0on:A]emptclassificaonoftheimitaonpassages. Imita0onSuccess:Analyzethenumberofsuccessfulimitaons.
7/30/2019 Deceiving Authorship Detection
21/73
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40
NumberofAuthors
BaselinePrecision
9-Feature(NN)
Synonym-Based
WriteprintsBaseline
(SVM)
Random
7/30/2019 Deceiving Authorship Detection
22/73
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40
NumberofAuthors
ObfuscationPrecision
9-Feature(NN)
Synonym-Based
WriteprintsBaseline
(SVM)
Random
7/30/2019 Deceiving Authorship Detection
23/73
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40
NumberofAuthors
ImitationPrecision
9-Feature(NN)
Synonym-Based
WriteprintsBaseline
(SVM)
Random
7/30/2019 Deceiving Authorship Detection
24/73
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15 20 25 30 35 40
NumberofAuthors
ImitationSuccess(FramingCormacMcCarthy)
9-Feature(NN)
Synonym-Based
WriteprintsBaseline
(SVM)
Random
7/30/2019 Deceiving Authorship Detection
25/73
TwoTools
JStylo:AuthorshipRecognionAnalysisTool. Anonymouth:AuthorshipRecognionEvasionTool. Free,OpenSource.(GNUGPL) Alphareleasesavailabletodayath]ps://psal.cs.drexel.edu
MigrangtoGitHubsoon.
7/30/2019 Deceiving Authorship Detection
26/73
JStylo:TheProblem
Stylometry-basedresearchisdifficult. Exisngtoolsaregoodbutlimited.
Wekaprovidesasuiteofmachine-learningclassificaontools. Nottailoredfortextanalysisnofeatureextraconability. Funconsbe]erasanAPIforsowaredevelopment.
JGAAPhasastrongbasictoolsetforstylometry. Limitedinrunningmulplefeaturesets. StrongAPI. Extendable.Intendedtobeusedthisway.
Nuancesofstylometryarenoteasytograsp. Manyopenresearchquesonsrelatedtoauthorship.Weneedaneasy-to-usetoolthatbothresearchersandnon-
technicaluserscanunderstand.
7/30/2019 Deceiving Authorship Detection
27/73
JStylo
JStyloisanauthorshiprecognionanalysistool.Itisbuiltuponaframeworkof:
JGAAP(JavaGraphicalAuthorshipA]ribuonProject) Weka3DataMiningSoware
Features Twoexisngadversarialcorpora,featuredinthispresentaon,andnewcorpusbuildingfunconality.
Wideseleconofwringfeatureextractorsandabilitytoaddnewextractors.
Wideseleconofmachinelearningbasedclassifiers. IntuiveGUI.
AlphaReleaseAvailableNow:h]ps://psal.cs.drexel.edu
7/30/2019 Deceiving Authorship Detection
28/73
JStyloDemo
7/30/2019 Deceiving Authorship Detection
29/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
30/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
31/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
32/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
33/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
34/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
35/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
36/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
37/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
38/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
39/73
JStyloDevGoals
Widerseleconofclassificaonmethodsandfeatures. Writeprints,Synonym-based,moreWekamethods. Ensembleclassifiers,weightedaveraging. Greaterpreandpost-processingopons.
Easiertouseandunderstandfornon-technicalusers. Addinganonlinetutorial. GUIinstallsofnewfeatureextractorsandclassifiers.
Loggingandgraphingresultsovermulpleexperiments. Visualizaonofdocuments,authors,andclassificaons.
7/30/2019 Deceiving Authorship Detection
40/73
Anonymouth:TheProblem
Authorshiprecognioncanbealegimatethreattoprivacyandanonymity.
Intuioninchangingwringstylegoesalongway,butmaynotbeenoughandmaynotbesustainableovermulple
documents. Wealreadyseemethodsthatoffersomeresistancetoadversarialpassages.
Fullyautomatedtextanonymizaonisanintractableproblem. Weneedasoluonthatexplainsauthorshiprecognionnuancesasneededandassiststheauthoringmakingthemostuseful
changestowardsanonymity.
7/30/2019 Deceiving Authorship Detection
41/73
Anonymouth
Anonymouthisanauthorshiprecognioncircumvenontool.Itisbuiltuponaframeworkof:
JStylo(JGAAP&Weka) WordNet
Features Corpora,featureextractor,andclassifierfunconalityfromJStylo.
Suggesonsystemformodifyingdocumentstoevadeauthorshipdetecon.Idealvalueforeachfeatureiscalculated,existenceof
thefeaturesishighlighted,userisassistedinchangingthem.
Iteraveapproachtoanonymizingwringstyle. Diconary/Synonyms/InteracveEdingConsole
AlphaReleaseAvailableNow:h]ps://psal.cs.drexel.edu
7/30/2019 Deceiving Authorship Detection
42/73
AnonymouthDemo
7/30/2019 Deceiving Authorship Detection
43/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
44/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
45/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
46/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
47/73
7/30/2019 Deceiving Authorship Detection
48/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
49/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
50/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
51/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
52/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
53/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
54/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
55/73
(sendbugreports,suggesons,[email protected])
7/30/2019 Deceiving Authorship Detection
56/73
AnonymouthChallenges
Featuresareoennotindependent. Increasingthenumberofcomplexwordswillalsoincreaseaveragesyllablecount.
Reducingthenumberofmesaspecificwordoccurswillalsoaffectthelexicaldensity.
Howcanwecreateanalgorithmforanonymitythatgeneratesanobfuscateddocumentwithminimaleffortandwithout
circularfeaturemodificaon?
7/30/2019 Deceiving Authorship Detection
57/73
AnonymouthDevGoals
Streamlinedsuggesonsystem. Improvedautomaononapplicablefeatures. Improvedclusteringalgorithmtoprovideopmalpathtoanonymity.
Improvededinginterface. Increasedphraseandwordsynonymsetsupport. Editbyblocksoftext,notsimplyfeature-by-feature.
Widersetoffeaturesandclassificaonmethods. Mul-methodandfeaturecolleconanalysis.
Usabilityandanonymityuserstudies.
7/30/2019 Deceiving Authorship Detection
58/73
OpeningDevelopment
ProjectwillconnuetobedevelopedbyPSALatDrexel,butwewelcomecollaboraonandparcipaon.
Weareinterestedin LinguiscExperts SecurityAdvisors UIExperts
C d t t t li ti d ti ?
7/30/2019 Deceiving Authorship Detection
59/73
Canwedetectstylisticdeception?
Regular
Obfuscated
Imitated
Detecting st listic deception is
7/30/2019 Deceiving Authorship Detection
60/73
Detectingstylisticdeceptionis
possible
98
8589.5
95.7
75.3
59.9
94.5
4843
0
10
2030
40
50
6070
80
90
100
Regular Imitaon Obfuscaon
Writeprint,SVM
Lying-detecon,J48
9-featureset,J48
Feature Changes in
7/30/2019 Deceiving Authorship Detection
61/73
!"##$ !%##$ !#$ #$ #$ %##$ "##$ '##$
()*+,-$./012$
3/456-7*89$
:-;1;?$*)1-7@$
ABCD$E966;
7/30/2019 Deceiving Authorship Detection
62/73
FeatureChangesin
ImitatedPassages
!"##$ !%##$ !#$ #$ #$ %##$ "##$ '##$
()*+,-$./012$
3/456-7*89$
:-;1;?$*)1-7@$
ABCD$E966;
7/30/2019 Deceiving Authorship Detection
63/73
Problemwiththedataset:
TopicSimilarity Allthedecepvedocumentswereofsametopic. Non-content-specificfeatureshavesameeffectascontent-specificfeatures.
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
!#+"
!#,"
$"
-.-/0123" 4567804" 29:7;
7/30/2019 Deceiving Authorship Detection
64/73
Hemingway FaulknerImitation
Corpus
ArclesfromtheInternaonalImitaonHemingwayContest(2000-2005)
ArclesfromtheFauxFaulknerContest(2001-2005) OriginalexcerptsofErnestHemingwayandWilliamFaulkner
7/30/2019 Deceiving Authorship Detection
65/73
Deceptiondetectionispossible
evenwhenthetopicisnotsimilar
81.2%accurateindetecngimitateddocuments.
7/30/2019 Deceiving Authorship Detection
66/73
Longtermdeception
AGayGirlInDamascusblog: Originalauthorwasa40-yearoldAmericancizen,ThomasMacMaster.
PretendedtobeaSyriangaywoman,AminaArraf.Theauthorworkedforatleast5yearstocreateanewstyle.
Long term deception is hard to
7/30/2019 Deceiving Authorship Detection
67/73
Longtermdeceptionishardto
detect Noneoftheblogpostswerefoundtobedecepve. Butregularauthorshiprecognioncanhelp. Wetriedtoa]ributeauthorshipoftheblogpostsusingThomas(ashimself),Thomas(asAmina),Bri]a(Thomass
wife). 54.3%oftheblogpostswerea]ributedtoThomas(ashimself)
7/30/2019 Deceiving Authorship Detection
68/73
7/30/2019 Deceiving Authorship Detection
69/73
Recap
AvailableNow: Brennan-GreenstadtAdversarialStylometryCorpus(12Authors) DrexelAMTAdversarialStylometryCorpus(45Authors) JStyloAlphaRelease AnonymouthAlphaRelease
FutureWork: BetareleasesofJStyloandAnonymouth Academicpublicaonofnewresults Connuedanalysisofdecepondeteconandshortmessageclassificaon
Connuedresearchonimprovingparallyautomatedanonymizaon
7/30/2019 Deceiving Authorship Detection
70/73
7/30/2019 Deceiving Authorship Detection
71/73
AddendumSlides
ResearchQuestions,
7/30/2019 Deceiving Authorship Detection
72/73
QPracticalImplications.
Ourupcomingresearchquesonshavesubstanalpraccalimplicaons.
Howdoyouanonymizeadocumentsufficientlyinareasonableperiodofme?
Whatissufficient?Whatisreasonable? CanAnonymouthbeusedtosuccessfullyimitateotherauthors?
CanAnonymouthmaintainlong-termdecepon?Canitsusagebedetected?
JStylovs.Anonymouthwhowins? BasedonJStylo,Anonymouthwillhaveeverythingitneedstohelpevadedeteconbythemethodsitcontains.
7/30/2019 Deceiving Authorship Detection
73/73
TwoTools?
Arentwecreangatoolthatenablessurveillanceandde-anonymizaon?
AnonymouthcantexistwithoutJStylo.Butitalsoshowsthatyoucantnecessarilydependonstylometrytoassignauthorship.
JStyloallowsforeasieruseofauthorshiprecogniontools,butisextensibleandopen-source.ImplemenngamethodinJStylowill
enablecounter-a]acksinAnonymouth.
JStylovs.Anonymouthwhowins? BasedonJStylo,Anonymouthwillhaveeverythingitneedstohelpevadedeteconbythemethodsitcontains. NotethatnothingpreventsothersfromplugginginproprietarystylometricmethodsintotheirversionofJStylo.