+ All Categories
Home > Documents > Sifting robotic from organic text: A natural language approach for ...

Sifting robotic from organic text: A natural language approach for ...

Date post: 14-Feb-2017
Category:
Upload: trandieu
View: 218 times
Download: 0 times
Share this document with a friend
7
Journal of Computational Science 16 (2016) 1–7 Contents lists available at ScienceDirect Journal of Computational Science journa l h om epage: www.elsevier.com/locate/jocs Sifting robotic from organic text: A natural language approach for detecting automation on Twitter Eric M. Clark a,b,c,d,e,, Jake Ryland Williams a,b,c,d , Chris A. Jones e,f,g , Richard A. Galbraith h,i , Christopher M. Danforth a,b,c,d , Peter Sheridan Dodds a,b,c,d a Department of Mathematics & Statistics, University of Vermont, Burlington, VT 05401, United States b Vermont Complex Systems Center, University of Vermont, Burlington, VT 05401, United States c Vermont Advanced Computing Core, University of Vermont, Burlington, VT 05401, United States d Computational Story Lab, University of Vermont, Burlington, VT 05401, United States e Department of Surgery, University of Vermont, Burlington, VT 05401, United States f Global Health Economics Unit of the Vermont Center for Clinical and Translational Science, University of Vermont, Burlington, VT, 05401, United States g Vermont Center for Behavior and Health, University of Vermont, Burlington, VT 05401, United States h Department of Medicine, University of Vermont, Burlington, VT 05401, United States i Vermont Center for Clinical and Translational Science, University of Vermont, Burlington, VT 05401, United States a r t i c l e i n f o Article history: Received 8 June 2015 Received in revised form 11 October 2015 Accepted 10 November 2015 Available online 19 November 2015 a b s t r a c t Twitter, a popular social media outlet, has evolved into a vast source of linguistic data, rich with opinion, sentiment, and discussion. Due to the increasing popularity of Twitter, its perceived potential for exerting social influence has led to the rise of a diverse community of automatons, commonly referred to as bots. These inorganic and semi-organic Twitter entities can range from the benevolent (e.g., weather-update bots, help-wanted-alert bots) to the malevolent (e.g., spamming messages, advertisements, or radical opinions). Existing detection algorithms typically leverage metadata (time between tweets, number of followers, etc.) to identify robotic accounts. Here, we present a powerful classification scheme that exclu- sively uses the natural language text from organic users to provide a criterion for identifying accounts posting automated messages. Since the classifier operates on text alone, it is flexible and may be applied to any textual data beyond the Twittersphere. © 2015 Elsevier B.V. All rights reserved. 1. Introduction Twitter has become a mainstream social outlet for the dis- cussion of a myriad of topics through microblogging interactions. Members chiefly communicate via short text-based public mes- sages restricted to 140 characters, called tweets. As Twitter has evolved from a simple microblogging social media interface into a mainstream source of communication for the discussion of current events, politics, consumer goods/services, it has become increasingly enticing for parties to gameify the system by cre- ating automated software to send messages to organic (human) accounts as a means for personal gain and for influence manipu- lation [1,2]. The results of sentiment and topical analyses can be skewed by robotic accounts that dilute legitimate public opinion Corresponding author at: University of Vermont, Burlington, VT 05401, United States E-mail address: [email protected] (E.M. Clark). by algorithmically generating vast amounts of inorganic content. Nevertheless, data from Twitter is becoming a source of interest in public health and economic research in monitoring the spread of disease [4,5] and gaining insight into public health trends [6]. In related work [7–10], researchers have built classification algorithms using metadata idiosyncratic to Twitter, including the number of followers, posting frequency, account age, number of user mentions/replies, username length, and number of retweets. However, relying on metadata can be problematic: sophisticated spam algorithms now emulate the daily cycle of human activity and author borrowed content to appear human [7]. Another prob- lematic spam tactic is the renting of accounts of legitimate users (called sponsored accounts), to introduce short bursts of spam and hide under the user’s organic metadata to mask the attack [11]. A content based classifier proposed by Chu et al. [19] meas- ures the entropy between Twitter time intervals along with user metadata to classify Twitter accounts, and requires a com- parable number of tweets (60) for adequate classification accuracy as our proposed method. SentiBot, another content based http://dx.doi.org/10.1016/j.jocs.2015.11.002 1877-7503/© 2015 Elsevier B.V. All rights reserved.
Transcript
Page 1: Sifting robotic from organic text: A natural language approach for ...

Sf

ERa

b

c

d

e

f

g

h

i

a

ARRAA

1

cMseaciaals

U

h1

Journal of Computational Science 16 (2016) 1–7

Contents lists available at ScienceDirect

Journal of Computational Science

journa l h om epage: www.elsev ier .com/ locate / jocs

ifting robotic from organic text: A natural language approachor detecting automation on Twitter

ric M. Clarka,b,c,d,e,∗, Jake Ryland Williamsa,b,c,d, Chris A. Jonese,f,g,ichard A. Galbraithh,i, Christopher M. Danfortha,b,c,d, Peter Sheridan Doddsa,b,c,d

Department of Mathematics & Statistics, University of Vermont, Burlington, VT 05401, United StatesVermont Complex Systems Center, University of Vermont, Burlington, VT 05401, United StatesVermont Advanced Computing Core, University of Vermont, Burlington, VT 05401, United StatesComputational Story Lab, University of Vermont, Burlington, VT 05401, United StatesDepartment of Surgery, University of Vermont, Burlington, VT 05401, United StatesGlobal Health Economics Unit of the Vermont Center for Clinical and Translational Science, University of Vermont, Burlington, VT, 05401, United StatesVermont Center for Behavior and Health, University of Vermont, Burlington, VT 05401, United StatesDepartment of Medicine, University of Vermont, Burlington, VT 05401, United StatesVermont Center for Clinical and Translational Science, University of Vermont, Burlington, VT 05401, United States

r t i c l e i n f o

rticle history:eceived 8 June 2015eceived in revised form 11 October 2015ccepted 10 November 2015vailable online 19 November 2015

a b s t r a c t

Twitter, a popular social media outlet, has evolved into a vast source of linguistic data, rich with opinion,sentiment, and discussion. Due to the increasing popularity of Twitter, its perceived potential for exertingsocial influence has led to the rise of a diverse community of automatons, commonly referred to as bots.These inorganic and semi-organic Twitter entities can range from the benevolent (e.g., weather-updatebots, help-wanted-alert bots) to the malevolent (e.g., spamming messages, advertisements, or radical

opinions). Existing detection algorithms typically leverage metadata (time between tweets, number offollowers, etc.) to identify robotic accounts. Here, we present a powerful classification scheme that exclu-sively uses the natural language text from organic users to provide a criterion for identifying accountsposting automated messages. Since the classifier operates on text alone, it is flexible and may be appliedto any textual data beyond the Twittersphere.

© 2015 Elsevier B.V. All rights reserved.

. Introduction

Twitter has become a mainstream social outlet for the dis-ussion of a myriad of topics through microblogging interactions.embers chiefly communicate via short text-based public mes-

ages restricted to 140 characters, called tweets. As Twitter hasvolved from a simple microblogging social media interface into

mainstream source of communication for the discussion ofurrent events, politics, consumer goods/services, it has becomencreasingly enticing for parties to gameify the system by cre-ting automated software to send messages to organic (human)

ccounts as a means for personal gain and for influence manipu-ation [1,2]. The results of sentiment and topical analyses can bekewed by robotic accounts that dilute legitimate public opinion

∗ Corresponding author at: University of Vermont, Burlington, VT 05401,nited States

E-mail address: [email protected] (E.M. Clark).

ttp://dx.doi.org/10.1016/j.jocs.2015.11.002877-7503/© 2015 Elsevier B.V. All rights reserved.

by algorithmically generating vast amounts of inorganic content.Nevertheless, data from Twitter is becoming a source of interest inpublic health and economic research in monitoring the spread ofdisease [4,5] and gaining insight into public health trends [6].

In related work [7–10], researchers have built classificationalgorithms using metadata idiosyncratic to Twitter, including thenumber of followers, posting frequency, account age, number ofuser mentions/replies, username length, and number of retweets.However, relying on metadata can be problematic: sophisticatedspam algorithms now emulate the daily cycle of human activityand author borrowed content to appear human [7]. Another prob-lematic spam tactic is the renting of accounts of legitimate users(called sponsored accounts), to introduce short bursts of spam andhide under the user’s organic metadata to mask the attack [11].

A content based classifier proposed by Chu et al. [19] meas-

ures the entropy between Twitter time intervals along withuser metadata to classify Twitter accounts, and requires a com-parable number of tweets (≥60) for adequate classificationaccuracy as our proposed method. SentiBot, another content based
Page 2: Sifting robotic from organic text: A natural language approach for ...

2 mputa

cctaobalo

t

12

3

sw

caviasSriAdbtf

2

2

fmpnaFat

2

oHawtac“hw

are referencing distant passages from literature or song lyrics. Most

E.M. Clark et al. / Journal of Co

lassifier [20], utilizes latent Dirichlet allocation (LDA) for topi-al categorization combined with sentiment analysis techniqueso classify individuals as either bots or humans. We note thats these automated entities evolve their strategies, combinationsf our proposed methods and studies previously mentioned maye required to achieve reasonable standards for classificationccuracy. Our method classifies accounts solely based upon theiringuistic attributes and hence can easily be integrated into thesether proposed strategies.

We introduce a classification algorithm that operates usinghree linguistic attributes of a user’s text. The algorithm analyzes:

the average URL count per tweet, the average pairwise lexical dissimilarity between a user’stweets,

and the word introduction rate decay parameter of the user for vari-ous proportions of time-ordered tweets.

We provide detailed descriptions of each attribute in the nextection. We then test and validate our algorithm on 1000 accountshich were hand coded as automated or human.

We find that for organic users, these three attributes are denselylustered, but can vary greatly for automatons. We compute theverage and standard deviation of each of these dimensions forarious numbers of tweets from the human coded organic usersn the dataset. We classify accounts by their distance from theverages from each of these attributes. The accuracy of the clas-ifier increases with the number of tweets collected per user.ince this algorithm operates independently from user metadata,obotic accounts do not have the ability to adaptively conceal theirdentities by manipulating their user attributes algorithmically.lso, since the classifier is built from time ordered tweets, it canetermine if a once legitimate user begins demonstrating dubiousehavior and spam tactics. This allows for social media data-minerso dampen a noisy dataset by weeding out suspicious accounts andocus on purely organic tweets.

. Data handling

.1. Data-collection

We filtered a 1% sample of Twitter’s streaming API (the spritzereed) for tweets containing geo-spatial metadata spanning the

onths of April through July in 2014. Since roughly 1% of tweetsrovided GPS located spatial coordinates, our sample representsearly all of the tweets from users who enable geotagging. Thisllows for much more complete coverage of each user’s account.rom this sample, we collected all of the geo-tweets from the mostctive 1000 users for classification as human or robot and call thishe Geo-Tweet dataset.

.2. Social HoneyPots

To place our classifier in the context of recent work, we appliedur algorithm to another set of accounts collected from the SocialoneyPot Experiment [12]. This work exacted a more elaboratepproach to find automated accounts on Twitter by creating a net-ork of fake accounts (called Devils [13]) that would tweet about

rending topics amongst themselves in order to tempt robotic inter-ctions. The experiment was analyzed and compiled into a dataset

ontaining the tweets of “legitimate users” and those classified ascontent polluters”. We note that the users in this dataset were notand coded. Accounts that followed the Devil HoneyPot accountsere deemed robots. Their organic users were compiled from a

tional Science 16 (2016) 1–7

random sample of Twitter, and were only deemed organic becausethese accounts were not suspended by Twitter at the time. Hencethe full HoneyPot dataset can only serve as an estimate of the capa-bility of this classification scheme.

2.3. Human classification of Geo-Tweets

Each of the 1000 users were hand classified separately by twoevaluators. All collected tweets from each user were reviewed untilthe evaluator noticed the presence of automation. If no subsam-ple of tweets appeared to be algorithmically generated, the userwas classified as human. The results were merged, and conflictingentries were resolved to produce a final list of user ids and cod-ings. See Fig. 1 for histograms and violin plots summarizing thedistributions of each user class. We note that any form of perceivedautomation was sufficient to deem the account as automated. SeeSupplementary Material for samples of each of these types of tweetsfrom each user class and a more thorough description of the anno-tation process.

2.4. Types of users

We consider organic content, i.e. from human accounts, as thosethat have not tweeted in an algorithmic fashion. We focused onthree distinct classes of automated tweeting:

Robots: Tweets from these accounts draw on a strictly limitedvocabulary. The messages follow a very structured pattern, manyof which are in the form of automated updates. Examples includeWeather Condition Update Accounts, Police Scanner UpdateAccounts, Help Wanted Update Accounts, etc.

Cyborgs: The most covert of the three, these automatons exhibithuman-like behavior and messages through loosely structured,generic, automated messages and from borrowed content copiedfrom other sources. Since many malicious cyborgs on Twitter tryto market an idea or product, a high proportion of their tweetscontain URLs, analogous to spam campaigns studied on Facebook[14]. Messages range from the backdoor advertising of goods andservices [15] to those trying to influence social opinion or evencensor political conversations [16]. These accounts act like pup-pets from a central algorithmic puppeteer to push their producton organic users while trying to appear like an organic user [17].Since these accounts tend to borrow content, they have a muchlarger vocabulary in comparison to ordinary robots. Due to Twit-ter’s 140 character-per-tweet restriction, some of the borrowedcontent being posted must be truncated. A notable attribute ofmany cyborgs is the presence of incomplete messages followed byan ellipsis and a URL. Included in this category are ‘malicious pro-moter’ accounts [12] that are radically promoting a business or anidea systematically.

Human Spammers: These are legitimate accounts that abuse analgorithm to post a burst of almost indistinguishable tweets thatmay differ by a character in order to fool Twitter’s spam detectionprotocols. These messages are directed at a particular user, com-monly for a follow request to attempt to increase their social reachand influence.

Although we restrict our focus to the aforementioned classes,we did notice the presence of other subclasses, which we havenamed “listers”, and “quoters”, that have both organic and automa-ton features. Listers are accounts that send their messages to largegroups of individuals at once. Quoters are dedicated accounts that

of the tweets from these accounts are all encased in quotations.These accounts also separately tweet organic content. We classifiedthese accounts as human because there was not sufficient evidencesuggesting these behaviors were indeed automated.

Page 3: Sifting robotic from organic text: A natural language approach for ...

E.M. Clark et al. / Journal of Computational Science 16 (2016) 1–7 3

F histogO g the O

3

3

�s

C

Acow

123

3

tcvnimatiqo

D

ig. 1. The feature distribution of the 1000 hand coded users are summarized with

rganics. Violin plots show the kernel density estimation of each distribution. Usin

. Methods

.1. Classification algorithm

The classifier, C, takes ordinal samples of tweets from each user,, of varying number, s, to determine if the user is a human posting

trictly organic content or is algorithmically automating tweets:

: �s → {0, 1} = {Organic, Automaton}.

lthough we have classified each automaton into three distinctlasses, the classifier is built more simply to detect and separaterganic content from automated. To classify the tweets from a user,e measure three distinct linguistic attributes:

Average pairwise tweet dissimilarity, Word introduction rate decay parameter, Average number of URLs (hyperlinks) per tweet.

.2. Average pairwise tweet dissimilarity

Many algorithmically generated tweets contain similar struc-ures with minor character replacements and long chains ofommon substrings. Purely organic accounts have tweets that areery dissimilar on average. The length of a tweet, t, is defined as theumber of characters in the tweet and is denoted |t|. Each tweet

s cleaned by truncating multiple whitespace characters and theetric is performed case insensitively. A sample of s tweets from

particular user is denoted Ts�. Given a pair of tweets from a par-

icular user, ti, tj ∈ Ts�, the pairwise tweet dissimilarity, D(ti, tj),

s given by subtracting the length of the longest common subse-uence of both tweets, |LCS(ti, tj)| and then weighting by the sum

f the lengths of both tweets:

(ti, tj) = |ti| + |tj| − 2 · |LCS(ti, tj)||ti| + |tj|

.

rams and violin plots. These show the wide variation in automated features versusrganic features, automated entities are identified by exclusion.

The average tweet dissimilarity of user � for sample size of s tweetsis calculated as:

�slcs = 1

(s − 1)!·

ti,tj ∈ Ts�

D(ti, tj).

For example, given the two tweets: (t1, t2) = (I love Twitter, Ilove to spam). Then |t1| = |t2| = 14, LCS(t1, t2) = |I love t| = 8 (includingwhitespaces) and we calculate the pairwise tweet dissimilarity as:

D(t1, t2) = 14 + 14 − 2 · 814 + 14

= 1228

= 37

.

3.3. Word introduction decay rate

Since social robots automate messages, they have a limited andcrystalline vocabulary in comparison to organic accounts. Evencyborgs that mask their automations with borrowed content can-not fully mimic the rate at which organic users introduce uniquewords into their text over time. The word introduction rate is ameasure of the number of unique word types introduced over timefrom a given sample of text [18]. The rate at which unique wordsare introduced naturally decays over time, and is observably differ-ent between automated and organic text. By testing many randomword shufflings of a text, we define m̄n as the average number ofwords between the nth and n + 1st initial unique word type appear-ances. From [18], the word introduction decay rate, ˛(n), is givenas

˛(n) = 1/m̄n ∝ n−� for � > 0.

For each user, the scaling exponent of the word introduction decay

rate, ˛, is approximated by performing standard linear regressionon the last third of the log-transformed tail of the average gap sizedistribution as a function of word introduction number, n [18]. InFig. 2, the log transformed rank-unique word gap distribution is
Page 4: Sifting robotic from organic text: A natural language approach for ...

4 E.M. Clark et al. / Journal of Computational Science 16 (2016) 1–7

Fc

g(

3

[saaswa

3

tlvwt

fspactttf

iss

Fig. 3. The receiver operator characteristic curve from the 10-fold Cross ValidationExperiment performed on the Geo Tweets collected from April through July 2014.The True Positive (TP), False Positive (FP), and thresholds, N, are averaged across the

ig. 2. The rank-unique word gap distribution is plotted on a logscale for each userlass.

iven for each individual in the data set. Here the human populationgreen) is distinctly distributed in comparison to the automatons.

.4. Average URLs per tweet

Hyperlinks (URLs) help automatons spread spam and malware11,21,22]. A high fraction of tweets from spammers tend to containome type of URL in comparison to organic individuals, making theverage URLs per tweet a valuable attribute for bot classificationlgorithms [9,23,24]. For each user, the average URL rate is mea-ured by the total number of occurrences of the substring ‘http:’ithin tweets, and then divided by the total number of tweets

uthored by the user in the sample of size s:

surl = #Occurrences of′http:′

#Sampled Tweets.

.5. Cross Validation experiment

We perform a standard 10-fold Cross Validation procedure onhe 2014 Geo-Tweet data set to measure the accuracy of using eachinguistic feature for classifying Organic accounts. We divided indi-iduals into 10 equally sized groups. Then 10 trials are performedhere 9 of the 10 groups are used to train the algorithm to classify

he final group.During the calibration phase, we measure each of the three

eatures for every human coded account in the training set. Weequentially collect tweets from each user from a random startingosition in time. We record the arithmetic mean and standard devi-tion of the Organic attributes to classify the remaining group. Thelassifier distinguishes human from automaton by using a varyinghreshold, n, from the average attribute value computed from theraining set. For each attribute, we classify each user as an automa-on if their feature falls further than n standard deviations awayrom the organic mean, for varying n.

For each trial, the False Positives and True Positives for a vary-ng window size, n, are recorded. To compare to other bot-detectiontrategies, we rate True Positives as the success at which the clas-ifier identifies automatons by exclusion, and False Positives as

10 trials. The accuracies are approximated by the AUCs, which we compute using thetrapezoid rule. The points depict the best experimental model thresholding window(N).

humans that are incorrectly classified as automatons. The results ofthe trials for varying tweet sizes are averaged and visualized with aReceiver Operator Characteristic curve (ROC) (see Fig. 3). The accu-racy of each experiment is measured as the area under the ROC, orAUC. To benchmark the classifier, a 10-fold Cross Validation wasalso performed on the HoneyPot tweet-set which we describe inthe following section.

4. Results and discussion

4.1. Geo-Tweet Classification Validation

The ROC curves for the Geo-Tweet 10 fold Cross ValidationExperiment for varying tweet bins in Fig. 3 show that the accuracyincreases as a function of number of tweets.

Although the accuracy of the classifier increases with the num-ber of collected tweets, we see in Fig. 4 that within 50 tweets theaccuracy of the average of 10 random trials is only slightly higherthan a 500 tweet user sample. While this is very beneficial to ourtask (isolating humans), we note that larger samples see greaterreturns when one instead wants to isolate spammers, that tweetrandom bursts of automation.

4.2. HoneyPot external validation

The classifier was tested on the Social Honeypot Twitter-botdataset provided by Lee et al. [12]. Results are visualized with a ROCcurve in Fig. 5. The averaged optimal threshold for the full Englishuser dataset (blue curve) had a high true positive rate (correctlyclassified automatons: 86%), but also had a large false positive rate(misclassified humans: 22%).

The Honeypot Dataset relied on Twitter’s spam detection pro-tocols to label their randomly collected “legitimate users”. Someforms of automation (weather-bots, help-wanted bots) are permit-ted by Twitter. Other cyborgs that are posting borrowed organiccontent can fool Twitter’s automation criterion. This ill formationof the training set greatly reduces the ability of the classifier to

distinguish humans from automatons, since the classifier gets thewrong information about what constitutes a human. To see this, arandom sample of 1000 English Honeypot users was hand-codedto mirror the previous experiment. On this smaller sample (black
Page 5: Sifting robotic from organic text: A natural language approach for ...

E.M. Clark et al. / Journal of Computational Science 16 (2016) 1–7 5

Frt

ct

4

fitGbasptbii

Ftbtr

Fig. 6. Calibrated Classifier Performance on 1000 User Geo-Tweet Dataset. Correctlyclassified humans (True Negative), are coded in Green, while correctly identifiedautomatons (True Positives) are coded in red. The 400 tweet average optimal thresh-olds from the cross validation experiment designate the thresholding for eachfeature. The black lines demonstrates each feature cutoff. (For interpretation of ref-

ig. 4. Accuracy, computed as the AUC is plotted as a function of number of tweets,anging from 25 to 500. The average True Positive and False Positive Rates over 10rials is given on twin axes with error bars drawn using the standard error.

urve in Fig. 4), the averaged optimal threshold accuracy increasedo 96%.

.3. Calibrated classifier performance

We created the thresholding window of final calibrated classi-er using the results from the calibration experiment. We averagehe optimal parameters from the 10-fold cross validation on theeo-Tweet dataset from each of the 10 calibration trials for tweetins ranging from 25 to 500 in increments of 25 tweets. We alsoverage and record the optimal parameter windows, nopt and theirtandard deviations, �opt. The standard deviations serve as a tuningarameter to increase the sensitivity of the classifier, by increasing

he feature cutoff window (n). The results from applying the cali-rated classifier to the full set of 1000 users, using 400 tweet bags

s given in Fig. 6. The feature cutoff window (black lines) estimatesf the user’s content is organic or automated. Human feature sets

ig. 5. HoneyPot Data Set, 10 fold Cross Validation Performance for users with 200weets. The black curve represents the 1000 hand coded HoneyPot users, while thelue curve is the entire English Honeypot dataset. The accuracy increases from 84%o 96%. (For interpretation of reference to color in this figure legend, the reader iseferred to the web version of this article.)

erence to color in this figure legend, the reader is referred to the web version of thisarticle.)

(True Negatives: 716) are densely distributed with a 4.79% FalsePositive Rate (i.e., humans classified as robots). The classifier accu-rately classified 90.32% of the automated accounts and 95.21% ofthe Organic accounts. See Fig. S1 for a cross sectional comparisonof each feature set. We note that future work may apply differentmethods in statistical classification to optimize these feature sets,and that using these simple cutoffs already leads to a high level ofaccuracy.

5. Conclusion

Using a flexible and transparent classification scheme, we havedemonstrated the potential of using linguistic features as a meansof classifying automated activity on Twitter. Since these featuresdo not use the metadata provided by Twitter, our classificationscheme may be applicable outside of the Twittersphere. Futurework can extend this analysis multilingually and incorporate addi-tional feature sets with an analogous classification scheme. URLcontent can also be more deeply analyzed to identify organic versusSPAM related hyperlinks.

We note the potential for future research to investigate and todistinguish between each sub-class of automaton. We formed ourtaxonomy according to the different modes of text production. Ourefforts were primarily focused in separating any form of automa-tion from organic, human content. In doing so we recognized threedistinct classes of these types of automated accounts. However,boundary cases (e.g. cyborg-spammers, robot-spammers, robotic-cyborgs, etc.) along with other potential aforementioned subclasses(e.g. listers, quoters, etc.) can limit the prowess of our currentclassification scheme tailored towards these subclasses. We haveshown that human content is distinctly different from these formsof automation, and that for a binary classification of automated orhuman, these features have a very reasonable performance withour proposed algorithm.

Our study distinguishes itself by focusing on automated behav-ior that is tolerated by Twitter, since both types of inorganic contentcan skew the results of sociolinguistic analyses. This is particu-larly important, since Twitter has become a possible outlet for

Page 6: Sifting robotic from organic text: A natural language approach for ...

6 mputa

his(iboige

ibrsctdti

A

Crsps#(Cf

A

t

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[[

E.M. Clark et al. / Journal of Co

ealth economics [4] research including monitoring patient sat-sfaction and modeling disease spread [25,3]. Monitoring excessiveocial media marketing of electronic nicotine delivery systemsalso known as e-cigarettes), discussed in [3,26], makes classify-ng organic and automated activity relevant for research that canenefit policy-makers regarding public health agendas. Isolatingrganic content on Twitter can help dampen noisy data-sets ands pertinent for research involving social media data and other lin-uistic data sources where a mixture of humans and automatonsxist.

In health care, a cardinal problem with the use of electronic med-cal records is their lack of interoperability. This is compoundedy a lack of standardization and use of data dictionaries whichesults in a lack of precision concerning our ability to collate signs,ymptoms, and diagnoses. The use of millions or billions of tweetsoncerning a given symptom or diagnosis might help to improvehat precision. But it would be a major setback if the insertion ofata tweeted from automatons would obscure useful interpreta-ion of such data. We hope that the approaches we have outlinedn the present manuscript will help alleviate such problems.

cknowledgments

The authors wish to acknowledge the Vermont Advancedomputing Core which provided High Performance Computingesources contributing to the research results. EMC and JRW wasupported by the UVM Complex Systems Center, PSD was sup-orted by NSF Career Award # 0846668. CMD and PSD were alsoupported by a grant from the MITRE Corporation and NSF grant1447634. CJ is supported in part by the National Institute of Health

NIH) Research wards R01DA014028 & R01HD075669, and by theenter of Biomedical Research Excellence Award P20GM103644rom the National Institute of General Medical Sciences.

ppendix A. Supplementary Data

Supplementary data associated with this article can be found, inhe online version, at http://dx.doi.org/10.1016/j.jocs.2015.11.002.

eferences

[1] V.S. Subrahmanian, A. Azaria, S. Durst, V. Kagan, A. Galstyan, K. Lerman, L. Zhu,E. Ferrara, A. Flammini, F. Menczer, arXiv:1601.05140, (2016), URL http://arxiv.org/pdf/1601.05140v1.pdf.

[2] D. Harris, Can evil data scientists fool us all with the world’s best spam? 2013goo.gl/psEguf.

[3] E.M. Clark, C. Jones, J.R. Williams, A.N. Kurti, M.C. Norotsky, C.M. Danforth, P.S.Dodds, arXiv:1508.01843, (2015), URL http://arxiv.org/abs/1508.01843.

[4] A. Sadilek, H.A. Kautz, V. Silenzio, ICWSM, 2012.[5] A. Wagstaff, A.J. Culyer, J. Health Econ. 31 (2012) 406.[6] L. Mitchell, M.R. Frank, K.D. Harris, P.S. Dodds, C.M. Danforth, PLOS ONE 8

(2013) e64417.[7] E. Ferrara, O. Varol, C. Davis, F. Menczer, A. Flammini, CoRR abs/1407.5225,

2014, URL http://arxiv.org/abs/1407.5225.[8] F. Benevenuto, G. Magno, T. Rodrigues, V. Almeida, Collaboration Electronic

messaging, Anti-Abuse and Spam Conference (CEAS, 2010.[9] Z. Chu, S. Gianvecchio, H. Wang, S. Jajodia, Proceedings of the 26th Annual

Computer Security Applications Conference (ACM, New York, NY, USA, 2010),ACSAC’10, 2010, pp. 21–30, http://dx.doi.org/10.1145/1920261.1920265,ISBN 978-1-4503-0133-6.

10] C.M. Zhang, V. Paxson, Proceedings of the 12th International Conference onPassive and Active Measurement, PAM’11, Springer-Verlag, Berlin,Heidelberg, 2011, pp. 102–111 http://dl.acm.org/citation.cfm?id=1987510.1987521, ISBN 978-3-642-19259-3.

11] K. Thomas, C. Grier, D. Song, V. Paxson, Proceedings of the 2011 ACMSIGCOMM Conference on Internet Measurement Conference, IMC’11, ACM,New York, NY, USA, 2011, pp. 243–258, http://dx.doi.org/10.1145/2068816.2068840, ISBN 978-1-4503-1013-0.

12] K. Lee, B.D. Eoff, J. Caverlee, AAAI Intl Conference on Weblogs and SocialMedia (ICWSM), 2011.

13] K. Lee, B.D. Eoff, J. Caverlee, in: W.W. Cohen, S. Gosling (Eds.), ICWSM, TheAAAI Press, 2010, URL http://dblp.uni-trier.de/db/conf/icwsm/icwsm2010.html#LeeEC10.

tional Science 16 (2016) 1–7

14] H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, B.Y. Zhao, Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement, IMC’10, ACM, New York, NY,USA, 2010, pp. 35–47, http://dx.doi.org/10.1145/1879141.1879147, ISBN978-1-4503-0483-2.

15] J. Huang, R. Kornfield, G. Szczypka, S.L. Emery, Tobacco control 23 (2014)iii26.

16] K. Thomas, C. Grier, V. Paxson, Presented as part of the 5th USENIX Workshopon Large-Scale Exploits and Emergent Threats (USENIX, Berkeley, CA 2012),2012, URL https://www.usenix.org/conference/leet12/adapting-social-spam-infrastructure-political-censorship.

17] X. Wu, Z. Feng, W. Fan, J. Gao, Y. Yu, Machine Learning and KnowledgeDiscovery in Databases, in: H. Blockeel, K. Kersting, S. Nijssen, F. Elezn (Eds.),in: Lecture Notes in Computer Science, vol. 8190, Springer, Berlin, Heidelberg,2013, pp. 483–498, http://dx.doi.org/10.1007/978-3-642-40994-3 31, ISBN978-3-642-40993-6.

18] J.R. Williams, J.P. Bagrow, C.M. Danforth, P.S. Dodds, Text mixing shapes theanatomy of rank-frequency distributions: a modern Zipfian mechanics fornatural language Phys. Rev. E 91 (5) (2015) 052811.

19] S. Chu, S. Gianvecchio, H. Wang, S. Jajodia, Detecting automation of twitteraccounts: are you a human, bot, or cyborg? IEEE Trans. Depend. SecureComput. 9 (6) (2012) 811–824.

20] J.P. Dickerson, V. Kagan, H. Wang, V.S. Subrahmanian, Using sentiment todetect bots on Twitter: are humans more opinionated than bots? in: 2014IEEE/ACM International Conference on Advances in Social Networks Analysisand Mining (ASONAM), 2014, pp. 620–627.

21] G. Brown, T. Howe, M. Ihbe, A. Prakash, K. Borders, Proceedings of the 2008ACM Conference on Computer Supported Cooperative Work, CSCW’08, ACM,New York, NY, USA, 2008, pp. 403–412, http://dx.doi.org/10.1145/1460563.1460628, ISBN 978-1-60558-007-4.

22] C. Wagner, S. Mitter, M. Strohmaier, C. Körner, When social bots attack:Modeling susceptibility of users in online social networks, Making Sense ofMicroposts (# MSM2012) (2012) 2.

23] K. Lee, J. Caverlee, S. Webb, Proceedings of the 19th International Conferenceon World Wide Web, WWW’10, ACM, New York, NY, USA, 2010, pp.1139–1140, http://dx.doi.org/10.1145/1772690.1772843, ISBN978-1-60558-799-8.

24] K. Lee, J. Caverlee, S. Webb, Proceedings of the 33rd International ACM SIGIRConference on Research and Development in Information Retrieval, SIGIR’10,ACM, New York, NY, USA, 2010, pp. 435–442, http://dx.doi.org/10.1145/1835449.1835522, ISBN 978-1-4503-0153-4.

25] D.A. Broniatowski, M.J. Paul, M. Dredze, PLOS ONE 8 (2013) e83672.26] J. Huang, R. Kornfield, G. Szczypka, S.L. Emery, Tobacco Control 23 (2014)

iii26.

Eric M. Clark is an applied mathematician who isinterested in implementing mathematical theory tosolve real-world, interdisciplinary problems. His interestsinclude (but are not limited to) Network Theory, SocialContagion, Computational Linguistics, Natural LanguageProcessing, Machine Learning, Evolutionary Algorithms,and Complex Systems. Currently, he is working withthe Department of Surgery at UVM to investigate healthtrends on Twitter to make socio-geographical compar-isons of different treatment regimens and how sentimentssurrounding such health disparities are changing overtime.

Jake Ryland Williams Upon receiving his Ph.D. in math-ematical sciences at the University of Vermont (UVM)spring, 2015, he accepted a postdoctoral fellowship at theUniversity of California Berkeley with the School of Infor-mation in the Master of Information and Data ScienceProgram, where he will teach coursework on machinelearning and continue to explore his research interestsat the intersection of mathematics, physics, and linguis-tics. While at UVM in the Department of Mathematics andStatistics and the Vermont Complex Systems Center, histhesis and focus of research was centered on mathemati-cal linguistics and computational social science.

Chris Jones is a Health Economist, Assistant Professor andDirector of the Global Health Economics Unit of the Centerfor Clinical and Translational Science, University of Ver-mont (UVM) College of Medicine. Prior to joining UVM,Dr. Jones worked in industry, as research faculty at theJohns Hopkins Bloomberg School of Public Health and as agovernment health economist in the National Institute forHealth and Clinical Excellence (NICE) Collaborating Cen-tre on Mental Health in London where he contributedto six U.K. national guidelines for the National Guideline

Development Group, including the Guidelines for Pre-venting Physical Inactivity, gaining considerable expertisewith health economic evaluation and health economic

appraisal. Dr. Jones’ work focuses on time-criticality, the cost-effectiveness ofincentive-based treatments and patient centeredness. He has published novel

Page 7: Sifting robotic from organic text: A natural language approach for ...

mputa

sHtH

iHpaA

and the formulation, analysis, and simulation of theoretical models. Dodds’s train-ing is in theoretical physics, mathematics, and electrical engineering with extensiveformal postdoctoral and research experience in the social sciences. Dodds’s foun-dational funding was an NSF CAREER grant awarded by the Social and EconomicSciences Directorate.

E.M. Clark et al. / Journal of Co

tatistical methodologies for predicting cost and personalizing treatment pathways.e currently serves as an elected member of the New England Comparative Effec-

iveness Public Advisory Council and contributes to Vermont’s State Blueprint forealth Analytic Working Group.

Richard A. Galbraith is a Professor in the Departmentof Medicine and also serves as the Director of the Uni-versity of Vermont’s Center for Clinical and TranslationalScience. The Center is dedicated to the concept of apply-ing interdisciplinary research to translational science bothfrom the bench to the bedside, from the bedside to thecommunity, and from the community to health policy. Hereceived his MD degree from King’s College University,London England where he completed an internship andresidency in Internal Medicine. He relocated to the UnitedStates and completed a fellowship in Endocrinology,Metabolism and Nutrition and earned an interdisciplinaryPhD in Molecular and Cellular Physiology from the Med-

cal University of South Carolina. He spent 12 years at the Rockefeller Universityospital in Manhattan, New York where he directed the Rockefeller University Hos-ital and its attendant translational research programs and was Attending Physiciant the New York Hospital. He also served as the Medical Director and Hospitaldministrator for six years.

Chris Danforth is the Flint Professor of Mathemati-cal, Natural, and Technical Sciences at the University ofVermont. He co-directs the Computational Story Lab, agroup of applied mathematicians at the undergraduate,masters, PhD, and postdoctoral level working on large-scale, system problems in many fields including sociology,

nonlinear dynamics, networks, ecology, and physics. Hisresearch has been covered by the New York Times, Sci-ence Magazine, and the BBC among others. Descriptionsof his projects are available at his website: http://uvm.edu/∼cdanfort

tional Science 16 (2016) 1–7 7

Peter Sheridan Dodds is a Professor at the Universityof Vermont (UVM) working on system-level problemsin many fields, ranging from sociology to physics. He isDirector of UVM’s Complex Systems Center, co-Directorof UVM’s Computational Story Lab, a visiting faculty fel-low at the Vermont Advanced Computing Core, and isappointed to the Department of Mathematics and Statis-tics. He maintains general research and teaching interestsin complex systems and networks with a current focuson sociotechnical and psychological phenomena includ-ing collective emotional states, contagion, language, andstories. His methods encompass large-scale data collec-tion and analysis, large-scale sociotechnical experiments,


Recommended