Text Mining with R: A brief overview -...

Text Mining with R: A brief overview

Cornelius Puschmann

This talk

1. From NLP to text mining

2. Building corpora

3. Latent semantic analysis (LSA)

4. Topic models/Latent dirichlet allocation (LDA)

5. Sentiment analysis

6. Misc useful packages

The most intuitive procedures are not necessarily the best ones….

word cloud of Barack Obama’s inaugural address

From NLP to text mining

(One) origin of text mining: early efforts to digitize (for example) religious texts. Above: Index Thomisticus on punch cards.

Father Roberto Busa, pioneered language processing of biblical texts in collaboration with IBM’s Thomas J. Watson in the 1940s

Textual concordance

part of speech tagging = annotate for word class (noun, verb, adjective….) parsing = annotate for parts of a sentence (subject, verb, object)

Natural Language Processing (NLP)

Text mining: more pragmatic than NLP, interested in a) comparing texts to each other b) tracking changes over time c) getting information out of texts d) measuring properties such as sentiment/polarity

Building corpora with tm

tokenize

stem

remove punctuation/numbers/whitespace

Build DTM/TDM

remove stopwords

Preprocessing a corpus

tm can perform these and further steps!

Latent semantic analysis (lsa)

Latent semantic analysis (Deerwester et al, 1990)

• allows the comparisons of texts on multiple dimensions based on co-occurring terms

• frequently used to quantify (and visualize) document difference/similarity

• difference/similarity as measured by LSA is not necessarily topical, but is also affected by style, document length etc

How similar are two academic fields?

LSA of scientific publications. Each dot represents an article. Color indicates the field, proximity shows how similar two texts are.

−0.5 −0.4 −0.3 −0.2 −0.1 0.0

−0.4

−0.2

0.0

0.2

LSA for 36 news articles on Cyprus (blue) and Greece (red)

dimension 1

dim

ensi

on 2

football

football

immigration

Greek election live coverage

Latent semantic analysis performed with library lsa, news articles collected through The Guardian API, library GuardianR

Euro crisis

Topic models/Latent dirichlet allocation (topicmodels)

Topic models (Blei et al, 2003)

• inductively identifies topics across texts based on word co-occurrence

• you set both the number of texts and the number of topics (!)

• able to identify latent topical differences

• generates both quantitative and qualitative results

Word cloud for the topic female fashion (Jockers, 2013)

Topics in historical English novels and their distribution between male and female authors (Jockers & Mimno, 2013, p. 762)

Topics in two blog platforms(Puschmann & Bastos, 2014)

16 topics in two blog platforms

Sentiment analysis with sentiment and syuzhet

Sentiment analysis (Pang & Lee, 2008)

• assigns basic emotions (joy, anger, amusement) and polarity scores (+1/-1) to documents

• usually based on a bag-of-words approach

• easily misjudges connotations, so use with care

Sentiment analysis in R(a) legacy package sentiment

anger

disgust

fear

joy

sadness

surprise

unknown

tiredamazing

love

bad

hate

redbulll

routine

venetian

starbucks

wake

mad

gold

amp

card

alert

chillin

manufacturing

job

free

stole

frappuccino

ochocincos

gotten

theyvertpatriots

happy

cubesdpmo

flinged

pint

shelterstorms

thundertop

wonderfulhanging

milksounds

little

time

whipped

ice

denverco

cream

shift

poor

sad

sorrynews

god

pretty

agreeatdoes

carmeldecaffeinated

disgustingflatfoul

halsted

heyhave

hiccup

lovers

madison

mroning

poolssick

tire

upstarbucks

watery

weirdosstarbucks

youremp

closedcounter

finishmood

rtinwifi

stolen

homelessmacchiatowakingleft

venti remember

especially

coffee

thank

sitting

dont

peoplewothers

walk

bit

ooo

niggainsteadchad

chai

goodstarbucksisrael

juniormom

help

lost

roast

broke

thats

lol

ochocinco

tasted

please

getting

absolutely

abundancebeat

boo

boughhtbuzzcappuccino

chinese

client

district

evening

fantastic

godlike

lolly

lunchparfait manifestingmassage

ofin

plethora

simply

sisters

starbucksrtthe

surprisesurprised

thisislove

wana

wonder

yumyum

bagel

fellow

girlfriend

supposed

afraidalarm

anyway

atimes

awful

bottom

bye

cantturnaroundchilling

companye

companylakewoodcompanysixteenth

dearthe

exp fehfire

hahastarbucks

hang

horrible

hystericalinvoluntarily

jded

lakewoodca

margccjust

middle

morninghours

move

peaches

prepared

rtmid

shi

shiftstarbucks

starbucksi

supervstarbucks

teacher

tearing

terrible

terrified

vent

yelled

yogurt

zod

breakfast

center

hatha

kidding

supervisorstarbucks

ave

calls

cold

mall

paying

skinny

yay

chris

joke

rtjust

lot

theyre

withand

start

line

shes

iced

school

finally

person

cup

buy

table

date

hit

water

tell

taste

street

morning

rti

bringcrumble

maybe

caramel

twitter

cant

latte

customer

nice

mocha

ive

rtstarbucks

cookie

friend

hour

youre

addictive ako

american

annoy

annoyed

annoying

baristaproblem

bed

betty

con

correct

deathbyboredom

delilah

devil

exxxtra

family

gretchen

heads

idol

intrigue

italian

late

laugh

leaving

loud

ltd

magbabasa

mamaya

misspellmodern

mufucka

pagkagising

peoplethesedays

piss

pissed

plentttyrecognize

rthell

specproisabitch

spilledsteal

step

teasing

undercharacters

vente

viastarbucks

war

weiner

winner

words

busted

frozen

girls

kids

legit

reach

sounded

station

wait

hahaha

wit

watch

Emotion and polarity in tweets containing the keyword Starbucks https://sites.google.com/site/miningtwitter/questions/sentiment/sentiment

https://sites.google.com/site/miningtwitter/questions/sentiment/sentiment

Comparing plot arc (through sentiment) in two novels

http://www.matthewjockers.net/2015/02/02/syuzhet/

Sentiment analysis in R(b) experimental package syuzhet

http://www.matthewjockers.net/2015/02/02/syuzhet/

Misc stuff

package gender

Summary: Text mining with R

• R has become much better at text processing in recent years

• Processing speeds are still somewhat lower than with Python (depending on what you want to do)

• Text mining is largely exploratory and presupposes qualitative knowledge about your data

• CRAN Task View NLP: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Thanks for [email protected]

mailto:[email protected]

ReferencesBlei, D. M., Ng, A. Y., & Jordan, M. I. (2012). Latent Dirichlet Allocation. Journal of

Machine Learning Research, 3, 993–1022. doi:10.1162/jmlr.2003.3.4-5.993 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).

Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

Jockers, M. L. (2013). Macroanalysis: Digital Methods and Literary History (p. 208). Champaign, IL: University of Illinois Press.

Jockers, M. L., & Mimno, D. (2013). Significant themes in 19th-century literature. Poetics, 41(6), 750–769. doi:10.1016/j.poetic.2013.08.005

Puschmann, C., & Bastos, M. (2015). How digital are the digital humanities? An analysis of two scholarly blogging platforms. PLoS ONE, 10, e0115035. doi:10.1371/journal.pone.0115035

Pang, B., & Lee, L. (2008). Opinion mining and sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135. doi:10.1561/1500000011

Date post:	01-Jun-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Text Mining with R: A brief overview -...

Documents