+ All Categories
Home > Documents > Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files ›...

Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files ›...

Date post: 25-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
42
Drug term discovery through social media using natural language processing Presenters: Dr. Claudia Brugman, Dr. Nikki Adams Collaborators: Dr. Thomas Conners, Sean Simpson M.A., Adam Liter, M.A. Center for Advanced Study of Language In collaboration with the Center for Substance Abuse Research November 15, 2017
Transcript
Page 1: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Drug term discovery through social media using natural language

processingPresenters: Dr. Claudia Brugman, Dr. Nikki Adams

Collaborators: Dr. Thomas Conners, Sean Simpson M.A., Adam Liter, M.A.Center for Advanced Study of Language

In collaboration with the Center for Substance Abuse ResearchNovember 15, 2017

Page 2: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Initial Challenge

• One of CESAR (NDEWS) many research efforts explores the question “Can social media communications contribute to the early warning system?”

• CESAR (NDEWS) funded exploratory work to use the Twitter corpus already being collected to investigate this question

• Can CASL linguists bring knowledge of language and natural language processing to bear on the language of drug use?

Page 3: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

What is CASL?

• The Center for Advanced Study of Language is the only social-science University Affiliated Research Center (UARC) in the country. UARCs are intended to serve the needs of the U.S. Government.

• CASL’s ambit of government service is directed at the analysis of language materials for under-resourced strategically important languages, as well as improving and assessing language and other education and training programs for the U.S. Government.

• More recently, CASL has expanded its portfolio to partner with industry and other institutions that don’t necessarily have U.S. government needs in their focus.

Page 4: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Previous research on social media and public health• Topics of investigation:

• Epidemiology, for example: • Twitter Catches the Flu (Aramaki et al 2011); Predicting Flu Trends (Achrekar et al 2011)• Psychological Language on Twitter Predicts County-Level Heart Disease Mortality

(Eichstaedt et al 2015)• Pharmacovigilance, particularly with respect to adverse drug reactions, for

example: • Pharmacovigilance from social media: … (Nikfarjam et al 2015)

• Toxicovigilance , for example:• Social Media Mining for Toxicovigilance (prescription medication abuse) (Sarker et al

2016)• Health Department Use of Social Media to Identify Foodborne Illness (Harris et al 2014)

Page 5: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Previous research on social media and public health• Corpora for investigation can be

• General: Twitter, Instagram• Specific: DailyStrength (support groups), Yelp

• Corpora can be processed in a number of ways (or not)• Filtered by keyword • Divided by subforum• Stemmed (reducing number of forms)• Part-of-speech tagged (increasing available number of forms to be analyzed, if

so desired)

Page 6: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Previous research on social media and public health• The methods of investigation can be

• Entirely manual, e.g. search for keywords and have humans examine all or portion of texts, for example:

• Using Social Listening Data to Monitor Misuse and Nonmedical Use of Bupropion (Anderson et al 2017)

• Via supervised machine with or without other semi-automatic methods• Supervised machine learning, which requires some human annotation, such as with

Twitter Catches the Flu• Foodborne Chicago work used a trained classifier and then sent potential hits to humans

for intervention work• Unsupervised

• Word clouds (an output of topic modeling) can be achieved through unsupervised learning

• The method CASL employs in this study is also unsupervised

Page 7: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

• As far as we are aware, we are the only ones

Social media for discovering new terminology?

Page 8: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Using Twitter as an early warning indicator: Big Data for epidemiological research• NDEWS researchers had been doing human analyses of social media • Big Data NLP analyses (initiated in the I-School at UMD) took an

important first step in taking advantage of Twitter’s enormous volume and its metadata (particularly geographic origin of the tweet).

• Filtering by keyword is an acknowledged, but mostly accepted, limitation of public health social media research

• In particular, previous analyses on social media communications for epidemiological research typically used standard or clinical terms as keywords

• Published lists may be several years old

Page 9: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Using Twitter as an early warning indicator: Big Data for variationist language research• However, previous linguistic research (including CASL’s) on social media

showed conclusively that people don’t use language the same way on these media as in other written sources (e.g., Gouws et al. 2011; Verheijen, 2015; Brugman & Conners, submitted.)

• General linguistic knowledge tells us that three things go together: taboo substances or practices; social groups associated with those practices; and special vocabulary used by members of the group.

• The instant broad reach of social media could only speed up the rise and fall of emergent terms.

• CASL asked the question: if we find novel and emergent terms, can we provide a stronger empirical base to do the communication analyses for early warning?

Page 10: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

• CASL researchers hoped to show that big data methods can not only analyze a huge amount of data but can reveal terms not known to drug researchers (a step beyond most big-data analysis efforts within public health research)

• This could improve accuracy of drug identification for first responders, epidemiologists, ethnographers.

The intersection of two problem spaces: language use and first response

Page 11: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Research Questions for CASL’s Pilot Study

• Is twitter a valid medium in which to explore this?• What kinds of automated (computer) models can accurately discover

new drug reference terms?• Discovering non-standard terms for recreational drugs

• New terms for known substances (neologism)• Existing terms for known substances (semantic shift) • New terms for new substances (neologism)• Existing terms for new substances (semantic broadening or shift)

• Both kinds of semantic broadening or shift result in lexical ambiguity

Page 12: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Method: using big data to classify words according to their context• Modeling linguistic environments: the company a word keeps tells

you something about the word• The molly I took last night was terrible (unambiguous: MDMA)• Did you watch Molly dance? (unambiguous: person)• Did you take molly/Molly last night? (ambiguous)

• State of the art unsupervised NLP methods, using vector space modelings (word and bigram embeddings)

• Cosine similarity is the indicator of the relevant aspect of meaning similarity

Page 13: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.
Page 14: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Vector space models• “Spatial” relationships encode semantic

relationships!

• Allows a sort of semantic vector algebra:Vking – Vqueen = Vman – Vwoman

Vx – Vqueen = Vman – Vwoman

Vx = Vqueen – Vwoman + Vman

• Analogical reasoning tests can be used as a rough approximation for model accuracy

• woman : man :: queen : ?? should return “king”

• For our purposes, developed a small set of drug-related analogy diagnostics

• Alcohol : drink :: marijuana : ?? should return “smoke”• Alcohol : drunk :: marijuana : ?? should return “stoned” or “high”

14

Page 15: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Challenges• People talk about everything in their lives on Twitter; the data are very very

noisy.• Some words with multiple meanings, though relevant, are hard to analyze

as relevant (e.g. weed; Molly; candy). We called this “the candy problem”.• Not all substances are discussed at the same level of frequency on Twitter. • The computational models being used are focused on individual terms, not

individual meanings (therefore the candy problem is huge). • The computational models do not show a new drug term immediately; it

has to reach a threshold of relevancy before it emerges.• This method does not replace human content analysis, but rather reduces

the swath of data required for human analysis.

Page 16: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Method overview

1

Twitter data extracted/filtered

Twitter data subset into 3 test corpora

Vector-space models trained on each subset

Models queried for synonyms to target terms based on cosine similarity

Results compared with expert-generated synonym lists

Page 17: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Data: cleaning and filtering

17

Non-English tweets removed

Tweets from outside continental USA

removed

“at” mentions(e.g. “@MyName”) and

url links removed

No retweeted content included in final dataset

Page 18: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Method overview

18

Twitter data extracted/filtered

Twitter data subset into 3 test corpora

Vector-space models trained on each subset

Models queried for synonyms to target terms based on cosine similarity

Results compared with expert-generated synonym lists

Page 19: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Data: summary of subset corpora

1 week• last week of Dec

2015

• 17.6 M tweets

• 178.3 M tokens

1 month• Dec 2015

• 74.9 M tweets

• 784.1 M tokens

3 month• Dec 2015 – Feb 2016

• 217.9 M tweets

• 2.3 B tokens

19

Page 20: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Model Accuracy

20

1 week 1 month 3 months

Page 21: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Method overview

21

Twitter data extracted/filtered

Twitter data subset into 3 test corpora

Vector-space models trained on each subset, tested for accuracy

Models queried for synonyms to target terms based on cosine similarity

Top 50 most similar terms from model compared with expert-generated synonym lists

Page 22: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Drugs of focus: MDMA and marijuana

• For pilot, we wanted drugs with high likelihood of mention on social media• In consultation with CESAR (NDEWS), MDMA was chosen as initial drug of focus.

• List of potential current slang terms for MDMA acquired from CESAR (NDEWS) field experts

• 2 mainstream/frequent terms selected as target query terms (“ecstasy”, “molly”). • Analysis was considered successful if:

• A) given target terms, model is able to discover other (less frequent/mainstream) terms provided by CESAR (NDEWS)

• B) given target terms, model is able to discover other terms NOT provided by CESAR (NDEWS)

22

Page 23: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

MDMA RESULTS• 38 possible terms for MDMA supplied by CESAR (NDEWS)

• 19/38 terms did not occur at all in the corpus.

• 16/38 terms occurred in the corpus but were NOT selected by the model• Pulled tweets in which these terms appeared• Spot checked (up to) 200 tweets containing each term, where possible• 11 of these 16 revealed no references to MDMA in the 200 instances checked

• 5 of these 16 revealed a few references, but were overwhelmingly non-MDMA related

• 3/38 terms occurred in the corpus and were selected as candidates by the model • These 3 all referenced MDMA at least 35% of the time:

23

Candy Cloud-nine/9 Dominoes Fantasia

MDM Moonrocks Snowball Smarties

Speed Stp Superman

Skittles (1/200) 0.5% Thizz (5/60) 8% Xtc (3/156) 2%

X (15/200) 7.5% E (8/200) 4%

Ecstasy (134/200) 67% Molly (74/200) 37% MDMA (180/200) 90%

LOW incidence

HIGH incidence

Page 24: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

TERM RELEVANCY (raw) RELEVANCY (percent)

MDMA 180/200 90.0%

ecstasy 134/200 67.0%

molly 74/200 37.0%

thizz 5/60 8.0%

X 15/200 7.5%

E 8/200 4.0%

XTC 3/156 1.9%

skittles 1/200 0.5%

candy 0/200 0.0%

MDM 0/128 0.0%

speed 0/200 0.0%

cloud-nine/9 0/200 0.0%

moonrocks 0/154 0.0%

STP 0/200 0.0%

dominoes 0/200 0.0%

snowball 0/200 0.0%

superman 0/200 0.0%

fantasia 0/200 0.0%

smarties 0/185 0.0%

MDMA terms supplied by CESAR (NDEWS) which occurred within the corpus

TERM RELEVANCY (raw) RELEVANCY (percent)

MDMA 180/200 90.0%

ecstasy 134/200 67.0%

molly 74/200 37.0%

thizz 5/60 8.0%

X 15/200 7.5%

E 8/200 4.0%

XTC 3/156 1.9%

skittles 1/200 0.5%

candy 0/200 0.0%

MDM 0/128 0.0%

speed 0/200 0.0%

cloud-nine/9 0/200 0.0%

moonrocks 0/154 0.0%

STP 0/200 0.0%

dominoes 0/200 0.0%

snowball 0/200 0.0%

superman 0/200 0.0%

fantasia 0/200 0.0%

smarties 0/185 0.0%

Returned by Model as Potential Candidates

Not Returned

24

Page 25: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

TERM RELEVANCY (raw) RELEVANCY (percent)

MDMA 180/200 90.0%

ecstasy 134/200 67.0%

molly 74/200 37.0%

thizz 5/60 8.0%

X 15/200 7.5%

E 8/200 4.0%

XTC 3/156 1.9%

skittles 1/200 0.5%

candy 0/200 0.0%

MDM 0/128 0.0%

speed 0/200 0.0%

cloud-nine/9 0/200 0.0%

moonrocks 0/154 0.0%

STP 0/200 0.0%

dominoes 0/200 0.0%

snowball 0/200 0.0%

superman 0/200 0.0%

fantasia 0/200 0.0%

smarties 0/185 0.0%

MDMA terms supplied by CESAR (NDEWS) which occurred within the corpus

TERM RELEVANCY (raw) RELEVANCY (percent)

MDMA 180/200 90.0%

ecstasy 134/200 67.0%

molly 74/200 37.0%

thizz 5/60 8.0%

X 15/200 7.5%

E 8/200 4.0%

XTC 3/156 1.9%

skittles 1/200 0.5%

candy 0/200 0.0%

MDM 0/128 0.0%

speed 0/200 0.0%

cloud-nine/9 0/200 0.0%

moonrocks 0/154 0.0%

STP 0/200 0.0%

dominoes 0/200 0.0%

snowball 0/200 0.0%

superman 0/200 0.0%

fantasia 0/200 0.0%

smarties 0/185 0.0%

25

Page 26: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

• A frequency of ~30% or more relevant uses over all instances can be revealed by the vector space model

• Any term with a drug meaning below that will not be revealed

Results for MDMA show limitation of the model

Page 27: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Marijuana

• Received list of 28 potential slang terms from experts at 3 of CESAR (NDEWS)’s “sentinel” sites

• Selected 3 frequent/mainstream terms to use as query targets:• “marijuana”, “weed”, “ganja”

• 17/28 supplied terms occurred in the corpus but were NOT selected by the model as potential candidates

• 11/28 supplied terms occurred in corpus and WERE selected by model

27

Page 28: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

28

TERM RELEVANCY (Raw) RELEVANCY (percentage)kush 192/200 96.0%loud pack 191/200 95.5%ganja 190/200 95.0%weed 186/200 93.0%edibles 180/200 90.0%sativa 176/200 88.0%devil's lettuce 145/169 85.8%dro 136/200 68.0%sour diesel 127/135 63.5%purp 75/200 37.5%pot 64/200 32.0%nug 55/200 27.5%shatter 44/200 22.0%wax 42/200 21.0%chronic 35/200 17.5%pineapple express 29/200 14.5%mary jane 24/200 12.0%skunk 24/200 12.0%bud 20/200 10.0%haze 13/200 6.5%exotics 2/134 1.5%flower 1/200 0.5%green 1/200 0.5%mud 1/200 0.5%shard 2/124 0.2%blue cheese 0/200 0.0%fire 0/200 0.0%flame 0/200 0.0%

TERM RELEVANCY (Raw) RELEVANCY (percentage)kush 192/200 96.0%loud pack 191/200 95.5%ganja 190/200 95.0%weed 186/200 93.0%edibles 180/200 90.0%sativa 176/200 88.0%devil's lettuce 145/169 85.8%dro 136/200 68.0%sour diesel 127/135 63.5%purp 75/200 37.5%pot 64/200 32.0%nug 55/200 27.5%shatter 44/200 22.0%wax 42/200 21.0%chronic 35/200 17.5%pineapple express 29/200 14.5%mary jane 24/200 12.0%skunk 24/200 12.0%bud 20/200 10.0%haze 13/200 6.5%exotics 2/134 1.5%flower 1/200 0.5%green 1/200 0.5%mud 1/200 0.5%shard 2/124 0.2%blue cheese 0/200 0.0%fire 0/200 0.0%flame 0/200 0.0%

Returned by Model as Potential Candidates

Not Returned

Page 29: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

However, the model returned terms that were not supplied by CESAR (NDEWS) experts:

• SUBSTANCE TERMS• reefer / reef / reefa• piff• tooka / tookah• ganj• dodi• doja• thrax / thraxx• pacc• mids• gas• oregano• cannabis

29

PARAPHENALIA blunt(s) / blizzy doob / doobie(s) dutch / dutchie / dutches spliff(s) gar(s) rellos / rillos swisher(s) / swisher sweets backwood(s) bong(s) grabba fronto

SYNTHETIC MARIJUANA incense k2

Page 30: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Take-aways

• Given target terms, is model able to predict other (less frequent/mainstream) terms provided by CESAR (NDEWS)?

• YES (provided those less frequent terms have drug-relevancy of > 30%)

• Given target terms, is model able to predict other terms for target substances NOT provided by CESAR (NDEWS)?

• YES (terms with high relevancy)

• Can Twitter be an effective medium for this sort of analysis?• YES, for certain drugs. (Heroin and methamphetamine were not successfully modeled.)

30

Page 31: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Limitations of the method

• Twitter is too noisy a corpus for this model to work on specialized subject matter.

• Therefore, the candy problem remains largely unsolved.• Terms for other substances (e.g. opioids) have not shown stunning success.• Potential advantages of using Twitter’s huge volume for regional

terminological trends have obstacles, in particular with data scarcity. However, there are people developing ways of addressing this

• Because the model only accepts and returns words or phrases, the actual meaning is not provided: the terms serve as input for human analysis.

Page 32: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Successes to date: proof of concept

• CASL researchers used a vector space model to discover terms for substances and paraphernalia that were previously “unknown”. *

• The research team has demonstrated that as a little as one month’s worth of Twitter can provide revealing outcomes, thus shortening the potential discovery of new terms.

• Discovered terms include pacc, a spelling variation specific to a gang, and jazz cabbage, a term that seems to have arisen very recently and which didn’t exist in Twitter one month, but appeared as a marijuana synonym the next.

• Terms for marijuana and related concepts (strains, ingestion processes, paraphernalia) have the best success. Using one slang term as a query for new terms is much more successful than using standard or clinical terms.

Page 33: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Ongoing and future work

• Additional methods should be tested to improve results for• Low-frequency/relevancy items, by tracking their emergence over time• Words like weed, Molly, skittles, to improve the chance that relevant uses will

be found

• Additional social media platforms can be explored, including• Instagram• Specialized subreddits• Discussion forums

• This general method can be used to discover any term that is being used as a “code word”, given enough linguistic material

Page 34: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Broadening the corpora used

• Towards the end of this initial study, CASL began looking at Reddit• The hypothesis was that Reddit subforums could provide more

targeted, domain-specific text, without the issues brought up by filtering by keyword

• Initial indications seemed to confirm this, with less data from Reddit providing results for marijuana comparable and in some respects better than the Twitter data

• Reddit also held more promise for investigations into opioid terminology (see, for example, NYT 7/20/17 piece ‘On Reddit, Intimate Glimpses of Addicts in Thrall to Opioids’)

Page 35: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

• NLP method stays primarily the same• Not yet pre-processing further, but considering it• Twitter vs. Reddit (possibly vs. Twitter filtered by keyword)

Discussion of current work

Page 36: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Reddit, marijuana, and opioids

• Have collected marijuana, opioid, and a few other select drug-related subforums from Reddit

• Comparing 3 month slices from Reddit, unfiltered Twitter, and now Twitter filtered by keyword

• Reddit shows targeted synonyms, a reduction in noisiness, without filtering

Page 37: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Unfiltered Twitter vs. Reddit

Most similar terms to “30s” by model

Twitter Reddit

20s 10s

40s blues

twenties 20s

thirties roxys

50s roxies

mid 20s oxys

mid 30s 15s

forties 40s

60s opanas

mid twenties 5s

Page 38: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Contact: Nikki Adams [email protected] of Maryland Center for Advanced Study of Language

Thank you for your attention!

Page 39: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

References

• Achrekar, H., Gandhe, A., Lazarus, R., Yu, S. H., & Liu, B. (2011, April). Predicting flu trends using twitter data. In Computer Communications Workshops (INFOCOM WKSHPS), 2011 IEEE Conference on (pp. 702-707). IEEE.

• Anderson, L. S., Bell, H. G., Gilbert, M., Davidson, J. E., Winter, C., Barratt, M. J., ... & Dasgupta, N. (2017). Using Social Listening Data to Monitor Misuse and Nonmedical Use of Bupropion: A Content Analysis. JMIR Public Health and Surveillance, 3(1).

• Aramaki, E., Maskawa, S., & Morita, M. (2011, July). Twitter catches the flu: detecting influenza epidemics using Twitter. In Proceedings of the conference on empirical methods in natural language processing (pp. 1568-1576). Association for Computational Linguistics.

• Brugman, C., & Conners, T. ms. Register Properties of SMS and Twitter in Indonesian: A contrastive study. Submitted to Digital Scholarship in the Humanities.

Page 40: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

References continued

• Eichstaedt, J. C., Schwartz, H. A., Kern, M. L., Park, G., Labarthe, D. R., Merchant, R. M., ... & Weeg, C. (2015). Psychological language on Twitter predicts county-level heart disease mortality. Psychological science, 26(2), 159-169.

• Gouws, S., D. Metzler, C. Cai, & E. Hovy. 2011. Contextual bearing on linguistic variation in social media. Proceedings of the Workshop on Language in Social Media (LSM 2011), pages 20–29, Portland, Oregon, 23 June 2011. Pp. 20-29.

• Harris, J. K., Mansour, R., Choucair, B., Olson, J., Nissen, C., Bhatt, J., & Centers for Disease Control and Prevention. (2014). Health department use of social media to identify foodborne illness-Chicago, Illinois, 2013-2014. MMWR Morb Mortal WklyRep, 63(32), 681-685.

• Nikfarjam, A., Sarker, A., O’Connor, K., Ginn, R., & Gonzalez, G. (2015). Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. Journal of the American Medical Informatics Association, 22(3), 671-681.

Page 41: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

References continued

• Sarker, A., O’Connor, K., Ginn, R., Scotch, M., Smith, K., Malone, D., & Gonzalez, G. (2016). Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from Twitter. Drug safety, 39(3), 231-240.

• Simpson, S., Adams, N., Brugman, C., & Conners, T. In press. Natural Language Processing for up-to-date Drug Term Innovations. Journal of Medical Internet Research: Public health and surveillance.

• Verheijen, L. 2016. Linguistic Characteristics of Dutch CMC. Proceedings of the 4th conference on CMC and Social Media corpora for the Humanities, Ljubljana.

Page 42: Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files › ... · Twitter data extracted/filtered. Twitter data subset into 3 test corpora.

Image CreditsSlide 5: Beta Ceti observed by the Chandra telescope, by NASASlide 7: 44 Bootis_System, by TyrogthekreeperSlide 10: Fireworks, by MattbuckSlide 26: Astro 4D stars proper radial obafgkm b 7mag big (edited), by Alexander MelegSlide 36: Arang Kel Sky at Night (edited), by Jahanzeb AhsanSlide 39: City lights blend into the starry sky (edited),, by Evonneyu

Creative Commons licenses: (CC BY-SA 3.0), (CC BY-SA 4.0)


Recommended