Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files ›...

Drug term discovery through social media using natural language

processingPresenters: Dr. Claudia Brugman, Dr. Nikki Adams

Collaborators: Dr. Thomas Conners, Sean Simpson M.A., Adam Liter, M.A.Center for Advanced Study of Language

In collaboration with the Center for Substance Abuse ResearchNovember 15, 2017

Initial Challenge

• One of CESAR (NDEWS) many research efforts explores the question “Can social media communications contribute to the early warning system?”

• CESAR (NDEWS) funded exploratory work to use the Twitter corpus already being collected to investigate this question

• Can CASL linguists bring knowledge of language and natural language processing to bear on the language of drug use?

What is CASL?

• The Center for Advanced Study of Language is the only social-science University Affiliated Research Center (UARC) in the country. UARCs are intended to serve the needs of the U.S. Government.

• CASL’s ambit of government service is directed at the analysis of language materials for under-resourced strategically important languages, as well as improving and assessing language and other education and training programs for the U.S. Government.

• More recently, CASL has expanded its portfolio to partner with industry and other institutions that don’t necessarily have U.S. government needs in their focus.

Previous research on social media and public health• Topics of investigation:

• Epidemiology, for example: • Twitter Catches the Flu (Aramaki et al 2011); Predicting Flu Trends (Achrekar et al 2011)• Psychological Language on Twitter Predicts County-Level Heart Disease Mortality

(Eichstaedt et al 2015)• Pharmacovigilance, particularly with respect to adverse drug reactions, for

example: • Pharmacovigilance from social media: … (Nikfarjam et al 2015)

• Toxicovigilance , for example:• Social Media Mining for Toxicovigilance (prescription medication abuse) (Sarker et al

2016)• Health Department Use of Social Media to Identify Foodborne Illness (Harris et al 2014)

Previous research on social media and public health• Corpora for investigation can be

• General: Twitter, Instagram• Specific: DailyStrength (support groups), Yelp

• Corpora can be processed in a number of ways (or not)• Filtered by keyword • Divided by subforum• Stemmed (reducing number of forms)• Part-of-speech tagged (increasing available number of forms to be analyzed, if

so desired)

Previous research on social media and public health• The methods of investigation can be

• Entirely manual, e.g. search for keywords and have humans examine all or portion of texts, for example:

• Using Social Listening Data to Monitor Misuse and Nonmedical Use of Bupropion (Anderson et al 2017)

• Via supervised machine with or without other semi-automatic methods• Supervised machine learning, which requires some human annotation, such as with

Twitter Catches the Flu• Foodborne Chicago work used a trained classifier and then sent potential hits to humans

for intervention work• Unsupervised

• Word clouds (an output of topic modeling) can be achieved through unsupervised learning

• The method CASL employs in this study is also unsupervised

• As far as we are aware, we are the only ones

Social media for discovering new terminology?

Using Twitter as an early warning indicator: Big Data for epidemiological research• NDEWS researchers had been doing human analyses of social media • Big Data NLP analyses (initiated in the I-School at UMD) took an

important first step in taking advantage of Twitter’s enormous volume and its metadata (particularly geographic origin of the tweet).

• Filtering by keyword is an acknowledged, but mostly accepted, limitation of public health social media research

• In particular, previous analyses on social media communications for epidemiological research typically used standard or clinical terms as keywords

• Published lists may be several years old

Using Twitter as an early warning indicator: Big Data for variationist language research• However, previous linguistic research (including CASL’s) on social media

showed conclusively that people don’t use language the same way on these media as in other written sources (e.g., Gouws et al. 2011; Verheijen, 2015; Brugman & Conners, submitted.)

• General linguistic knowledge tells us that three things go together: taboo substances or practices; social groups associated with those practices; and special vocabulary used by members of the group.

• The instant broad reach of social media could only speed up the rise and fall of emergent terms.

• CASL asked the question: if we find novel and emergent terms, can we provide a stronger empirical base to do the communication analyses for early warning?

• CASL researchers hoped to show that big data methods can not only analyze a huge amount of data but can reveal terms not known to drug researchers (a step beyond most big-data analysis efforts within public health research)

• This could improve accuracy of drug identification for first responders, epidemiologists, ethnographers.

The intersection of two problem spaces: language use and first response

Research Questions for CASL’s Pilot Study

• Is twitter a valid medium in which to explore this?• What kinds of automated (computer) models can accurately discover

new drug reference terms?• Discovering non-standard terms for recreational drugs

• New terms for known substances (neologism)• Existing terms for known substances (semantic shift) • New terms for new substances (neologism)• Existing terms for new substances (semantic broadening or shift)

• Both kinds of semantic broadening or shift result in lexical ambiguity

Method: using big data to classify words according to their context• Modeling linguistic environments: the company a word keeps tells

you something about the word• The molly I took last night was terrible (unambiguous: MDMA)• Did you watch Molly dance? (unambiguous: person)• Did you take molly/Molly last night? (ambiguous)

• State of the art unsupervised NLP methods, using vector space modelings (word and bigram embeddings)

• Cosine similarity is the indicator of the relevant aspect of meaning similarity

Vector space models• “Spatial” relationships encode semantic

relationships!

• Allows a sort of semantic vector algebra:Vking – Vqueen = Vman – Vwoman

Vx – Vqueen = Vman – Vwoman

Vx = Vqueen – Vwoman + Vman

• Analogical reasoning tests can be used as a rough approximation for model accuracy

• woman : man :: queen : ?? should return “king”

• For our purposes, developed a small set of drug-related analogy diagnostics

• Alcohol : drink :: marijuana : ?? should return “smoke”• Alcohol : drunk :: marijuana : ?? should return “stoned” or “high”

14

Challenges• People talk about everything in their lives on Twitter; the data are very very

noisy.• Some words with multiple meanings, though relevant, are hard to analyze

as relevant (e.g. weed; Molly; candy). We called this “the candy problem”.• Not all substances are discussed at the same level of frequency on Twitter. • The computational models being used are focused on individual terms, not

individual meanings (therefore the candy problem is huge). • The computational models do not show a new drug term immediately; it

has to reach a threshold of relevancy before it emerges.• This method does not replace human content analysis, but rather reduces

the swath of data required for human analysis.

Method overview

1

Twitter data extracted/filtered

Twitter data subset into 3 test corpora

Vector-space models trained on each subset

Models queried for synonyms to target terms based on cosine similarity

Results compared with expert-generated synonym lists

Data: cleaning and filtering

17

Non-English tweets removed

Tweets from outside continental USA

removed

“at” mentions(e.g. “@MyName”) and

url links removed

No retweeted content included in final dataset

Method overview

18



Vector-space models trained on each subset


Results compared with expert-generated synonym lists

Data: summary of subset corpora

1 week• last week of Dec

2015

• 17.6 M tweets

• 178.3 M tokens

1 month• Dec 2015

• 74.9 M tweets

• 784.1 M tokens

3 month• Dec 2015 – Feb 2016

• 217.9 M tweets

• 2.3 B tokens

19

Model Accuracy

20

1 week 1 month 3 months

Method overview

21



Vector-space models trained on each subset, tested for accuracy


Top 50 most similar terms from model compared with expert-generated synonym lists

Drugs of focus: MDMA and marijuana

• For pilot, we wanted drugs with high likelihood of mention on social media• In consultation with CESAR (NDEWS), MDMA was chosen as initial drug of focus.

• List of potential current slang terms for MDMA acquired from CESAR (NDEWS) field experts

• 2 mainstream/frequent terms selected as target query terms (“ecstasy”, “molly”). • Analysis was considered successful if:

• A) given target terms, model is able to discover other (less frequent/mainstream) terms provided by CESAR (NDEWS)

• B) given target terms, model is able to discover other terms NOT provided by CESAR (NDEWS)

22

MDMA RESULTS• 38 possible terms for MDMA supplied by CESAR (NDEWS)

• 19/38 terms did not occur at all in the corpus.

• 16/38 terms occurred in the corpus but were NOT selected by the model• Pulled tweets in which these terms appeared• Spot checked (up to) 200 tweets containing each term, where possible• 11 of these 16 revealed no references to MDMA in the 200 instances checked

• 5 of these 16 revealed a few references, but were overwhelmingly non-MDMA related

• 3/38 terms occurred in the corpus and were selected as candidates by the model • These 3 all referenced MDMA at least 35% of the time:

23

Candy Cloud-nine/9 Dominoes Fantasia

MDM Moonrocks Snowball Smarties

Speed Stp Superman

Skittles (1/200) 0.5% Thizz (5/60) 8% Xtc (3/156) 2%

X (15/200) 7.5% E (8/200) 4%

Ecstasy (134/200) 67% Molly (74/200) 37% MDMA (180/200) 90%

LOW incidence

HIGH incidence

TERM RELEVANCY (raw) RELEVANCY (percent)

MDMA 180/200 90.0%

ecstasy 134/200 67.0%

molly 74/200 37.0%

thizz 5/60 8.0%

X 15/200 7.5%

E 8/200 4.0%

XTC 3/156 1.9%

skittles 1/200 0.5%

candy 0/200 0.0%

MDM 0/128 0.0%

speed 0/200 0.0%

cloud-nine/9 0/200 0.0%

moonrocks 0/154 0.0%

STP 0/200 0.0%

dominoes 0/200 0.0%

snowball 0/200 0.0%

superman 0/200 0.0%

fantasia 0/200 0.0%

smarties 0/185 0.0%

MDMA terms supplied by CESAR (NDEWS) which occurred within the corpus


MDMA 180/200 90.0%

ecstasy 134/200 67.0%

molly 74/200 37.0%

thizz 5/60 8.0%

X 15/200 7.5%

E 8/200 4.0%

XTC 3/156 1.9%

skittles 1/200 0.5%

candy 0/200 0.0%

MDM 0/128 0.0%

speed 0/200 0.0%

cloud-nine/9 0/200 0.0%


STP 0/200 0.0%

dominoes 0/200 0.0%

snowball 0/200 0.0%

superman 0/200 0.0%

fantasia 0/200 0.0%

smarties 0/185 0.0%

Returned by Model as Potential Candidates

Not Returned

24


MDMA 180/200 90.0%

ecstasy 134/200 67.0%

molly 74/200 37.0%

thizz 5/60 8.0%

X 15/200 7.5%

E 8/200 4.0%

XTC 3/156 1.9%

skittles 1/200 0.5%

candy 0/200 0.0%

MDM 0/128 0.0%

speed 0/200 0.0%

cloud-nine/9 0/200 0.0%


STP 0/200 0.0%

dominoes 0/200 0.0%

snowball 0/200 0.0%

superman 0/200 0.0%

fantasia 0/200 0.0%

smarties 0/185 0.0%

MDMA terms supplied by CESAR (NDEWS) which occurred within the corpus


MDMA 180/200 90.0%

ecstasy 134/200 67.0%

molly 74/200 37.0%

thizz 5/60 8.0%

X 15/200 7.5%

E 8/200 4.0%

XTC 3/156 1.9%

skittles 1/200 0.5%

candy 0/200 0.0%

MDM 0/128 0.0%

speed 0/200 0.0%

cloud-nine/9 0/200 0.0%


STP 0/200 0.0%

dominoes 0/200 0.0%

snowball 0/200 0.0%

superman 0/200 0.0%

fantasia 0/200 0.0%

smarties 0/185 0.0%

25

• A frequency of ~30% or more relevant uses over all instances can be revealed by the vector space model

• Any term with a drug meaning below that will not be revealed

Results for MDMA show limitation of the model

Marijuana

• Received list of 28 potential slang terms from experts at 3 of CESAR (NDEWS)’s “sentinel” sites

• Selected 3 frequent/mainstream terms to use as query targets:• “marijuana”, “weed”, “ganja”

• 17/28 supplied terms occurred in the corpus but were NOT selected by the model as potential candidates

• 11/28 supplied terms occurred in corpus and WERE selected by model

27

28

TERM RELEVANCY (Raw) RELEVANCY (percentage)kush 192/200 96.0%loud pack 191/200 95.5%ganja 190/200 95.0%weed 186/200 93.0%edibles 180/200 90.0%sativa 176/200 88.0%devil's lettuce 145/169 85.8%dro 136/200 68.0%sour diesel 127/135 63.5%purp 75/200 37.5%pot 64/200 32.0%nug 55/200 27.5%shatter 44/200 22.0%wax 42/200 21.0%chronic 35/200 17.5%pineapple express 29/200 14.5%mary jane 24/200 12.0%skunk 24/200 12.0%bud 20/200 10.0%haze 13/200 6.5%exotics 2/134 1.5%flower 1/200 0.5%green 1/200 0.5%mud 1/200 0.5%shard 2/124 0.2%blue cheese 0/200 0.0%fire 0/200 0.0%flame 0/200 0.0%

TERM RELEVANCY (Raw) RELEVANCY (percentage)kush 192/200 96.0%loud pack 191/200 95.5%ganja 190/200 95.0%weed 186/200 93.0%edibles 180/200 90.0%sativa 176/200 88.0%devil's lettuce 145/169 85.8%dro 136/200 68.0%sour diesel 127/135 63.5%purp 75/200 37.5%pot 64/200 32.0%nug 55/200 27.5%shatter 44/200 22.0%wax 42/200 21.0%chronic 35/200 17.5%pineapple express 29/200 14.5%mary jane 24/200 12.0%skunk 24/200 12.0%bud 20/200 10.0%haze 13/200 6.5%exotics 2/134 1.5%flower 1/200 0.5%green 1/200 0.5%mud 1/200 0.5%shard 2/124 0.2%blue cheese 0/200 0.0%fire 0/200 0.0%flame 0/200 0.0%

Returned by Model as Potential Candidates

Not Returned

However, the model returned terms that were not supplied by CESAR (NDEWS) experts:

• SUBSTANCE TERMS• reefer / reef / reefa• piff• tooka / tookah• ganj• dodi• doja• thrax / thraxx• pacc• mids• gas• oregano• cannabis

29

PARAPHENALIA blunt(s) / blizzy doob / doobie(s) dutch / dutchie / dutches spliff(s) gar(s) rellos / rillos swisher(s) / swisher sweets backwood(s) bong(s) grabba fronto

SYNTHETIC MARIJUANA incense k2

Take-aways

• Given target terms, is model able to predict other (less frequent/mainstream) terms provided by CESAR (NDEWS)?

• YES (provided those less frequent terms have drug-relevancy of > 30%)

• Given target terms, is model able to predict other terms for target substances NOT provided by CESAR (NDEWS)?

• YES (terms with high relevancy)

• Can Twitter be an effective medium for this sort of analysis?• YES, for certain drugs. (Heroin and methamphetamine were not successfully modeled.)

30

Limitations of the method

• Twitter is too noisy a corpus for this model to work on specialized subject matter.

• Therefore, the candy problem remains largely unsolved.• Terms for other substances (e.g. opioids) have not shown stunning success.• Potential advantages of using Twitter’s huge volume for regional

terminological trends have obstacles, in particular with data scarcity. However, there are people developing ways of addressing this

• Because the model only accepts and returns words or phrases, the actual meaning is not provided: the terms serve as input for human analysis.

Successes to date: proof of concept

• CASL researchers used a vector space model to discover terms for substances and paraphernalia that were previously “unknown”. *

• The research team has demonstrated that as a little as one month’s worth of Twitter can provide revealing outcomes, thus shortening the potential discovery of new terms.

• Discovered terms include pacc, a spelling variation specific to a gang, and jazz cabbage, a term that seems to have arisen very recently and which didn’t exist in Twitter one month, but appeared as a marijuana synonym the next.

• Terms for marijuana and related concepts (strains, ingestion processes, paraphernalia) have the best success. Using one slang term as a query for new terms is much more successful than using standard or clinical terms.

Ongoing and future work

• Additional methods should be tested to improve results for• Low-frequency/relevancy items, by tracking their emergence over time• Words like weed, Molly, skittles, to improve the chance that relevant uses will

be found

• Additional social media platforms can be explored, including• Instagram• Specialized subreddits• Discussion forums

• This general method can be used to discover any term that is being used as a “code word”, given enough linguistic material

Broadening the corpora used

• Towards the end of this initial study, CASL began looking at Reddit• The hypothesis was that Reddit subforums could provide more

targeted, domain-specific text, without the issues brought up by filtering by keyword

• Initial indications seemed to confirm this, with less data from Reddit providing results for marijuana comparable and in some respects better than the Twitter data

• Reddit also held more promise for investigations into opioid terminology (see, for example, NYT 7/20/17 piece ‘On Reddit, Intimate Glimpses of Addicts in Thrall to Opioids’)

• NLP method stays primarily the same• Not yet pre-processing further, but considering it• Twitter vs. Reddit (possibly vs. Twitter filtered by keyword)

Discussion of current work

Reddit, marijuana, and opioids

• Have collected marijuana, opioid, and a few other select drug-related subforums from Reddit

• Comparing 3 month slices from Reddit, unfiltered Twitter, and now Twitter filtered by keyword

• Reddit shows targeted synonyms, a reduction in noisiness, without filtering

Unfiltered Twitter vs. Reddit

Most similar terms to “30s” by model

Twitter Reddit

20s 10s

40s blues

twenties 20s

thirties roxys

50s roxies

mid 20s oxys

mid 30s 15s

forties 40s

60s opanas

mid twenties 5s

Contact: Nikki Adams [email protected] of Maryland Center for Advanced Study of Language

Thank you for your attention!

References

• Achrekar, H., Gandhe, A., Lazarus, R., Yu, S. H., & Liu, B. (2011, April). Predicting flu trends using twitter data. In Computer Communications Workshops (INFOCOM WKSHPS), 2011 IEEE Conference on (pp. 702-707). IEEE.

• Anderson, L. S., Bell, H. G., Gilbert, M., Davidson, J. E., Winter, C., Barratt, M. J., ... & Dasgupta, N. (2017). Using Social Listening Data to Monitor Misuse and Nonmedical Use of Bupropion: A Content Analysis. JMIR Public Health and Surveillance, 3(1).

• Aramaki, E., Maskawa, S., & Morita, M. (2011, July). Twitter catches the flu: detecting influenza epidemics using Twitter. In Proceedings of the conference on empirical methods in natural language processing (pp. 1568-1576). Association for Computational Linguistics.

• Brugman, C., & Conners, T. ms. Register Properties of SMS and Twitter in Indonesian: A contrastive study. Submitted to Digital Scholarship in the Humanities.

References continued

• Eichstaedt, J. C., Schwartz, H. A., Kern, M. L., Park, G., Labarthe, D. R., Merchant, R. M., ... & Weeg, C. (2015). Psychological language on Twitter predicts county-level heart disease mortality. Psychological science, 26(2), 159-169.

• Gouws, S., D. Metzler, C. Cai, & E. Hovy. 2011. Contextual bearing on linguistic variation in social media. Proceedings of the Workshop on Language in Social Media (LSM 2011), pages 20–29, Portland, Oregon, 23 June 2011. Pp. 20-29.

• Harris, J. K., Mansour, R., Choucair, B., Olson, J., Nissen, C., Bhatt, J., & Centers for Disease Control and Prevention. (2014). Health department use of social media to identify foodborne illness-Chicago, Illinois, 2013-2014. MMWR Morb Mortal WklyRep, 63(32), 681-685.

• Nikfarjam, A., Sarker, A., O’Connor, K., Ginn, R., & Gonzalez, G. (2015). Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. Journal of the American Medical Informatics Association, 22(3), 671-681.

References continued

• Sarker, A., O’Connor, K., Ginn, R., Scotch, M., Smith, K., Malone, D., & Gonzalez, G. (2016). Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from Twitter. Drug safety, 39(3), 231-240.

• Simpson, S., Adams, N., Brugman, C., & Conners, T. In press. Natural Language Processing for up-to-date Drug Term Innovations. Journal of Medical Internet Research: Public health and surveillance.

• Verheijen, L. 2016. Linguistic Characteristics of Dutch CMC. Proceedings of the 4th conference on CMC and Social Media corpora for the Humanities, Ljubljana.

Image CreditsSlide 5: Beta Ceti observed by the Chandra telescope, by NASASlide 7: 44 Bootis_System, by TyrogthekreeperSlide 10: Fireworks, by MattbuckSlide 26: Astro 4D stars proper radial obafgkm b 7mag big (edited), by Alexander MelegSlide 36: Arang Kel Sky at Night (edited), by Jahanzeb AhsanSlide 39: City lights blend into the starry sky (edited),, by Evonneyu

Creative Commons licenses: (CC BY-SA 3.0), (CC BY-SA 4.0)

https://commons.wikimedia.org/wiki/Category:Stars#/media/File:Bceti_xray.jpg

https://commons.wikimedia.org/wiki/Category:Stars#/media/File:44_Bootis_System.png

https://commons.wikimedia.org/wiki/Category:Stars#/media/File:Beeston_MMB_19_Fireworks.jpg

https://commons.wikimedia.org/wiki/Category:Stars#/media/File:Astro_4D_stars_proper_radial_obafgkm_b_7mag_big.png

https://commons.wikimedia.org/wiki/Category:Stars#/media/File:Arang_Kel_Sky_at_Night.jpg

https://commons.wikimedia.org/wiki/Category:Stars#/media/File:City_lights_blend_into_the_starry_sky.JPG

https://creativecommons.org/licenses/by-sa/3.0/deed.en

https://creativecommons.org/licenses/by-sa/4.0/deed.en

Date post:	25-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Drug term discovery through social media using natural ... › sites › ndews.umd.edu › files ›...

Documents