Text Mining with R: A brief overview
Cornelius Puschmann
This talk
1. From NLP to text mining
2. Building corpora
3. Latent semantic analysis (LSA)
4. Topic models/Latent dirichlet allocation (LDA)
5. Sentiment analysis
6. Misc useful packages
The most intuitive procedures are not necessarily the best ones….
word cloud of Barack Obama’s inaugural address
From NLP to text mining
(One) origin of text mining: early efforts to digitize (for example) religious texts. Above: Index Thomisticus on punch cards.
Father Roberto Busa, pioneered language processing of biblical texts in collaboration with IBM’s Thomas J. Watson in the 1940s
Textual concordance
part of speech tagging = annotate for word class (noun, verb, adjective….) parsing = annotate for parts of a sentence (subject, verb, object)
Natural Language Processing (NLP)
Text mining: more pragmatic than NLP, interested in a) comparing texts to each other b) tracking changes over time c) getting information out of texts d) measuring properties such as sentiment/polarity
Building corpora with tm
tokenize
stem
remove punctuation/numbers/whitespace
Build DTM/TDM
remove stopwords
Preprocessing a corpus
tm can perform these and further steps!
Latent semantic analysis (lsa)
Latent semantic analysis (Deerwester et al, 1990)
• allows the comparisons of texts on multiple dimensions based on co-occurring terms
• frequently used to quantify (and visualize) document difference/similarity
• difference/similarity as measured by LSA is not necessarily topical, but is also affected by style, document length etc
How similar are two academic fields?
LSA of scientific publications. Each dot represents an article. Color indicates the field, proximity shows how similar two texts are.
−0.5 −0.4 −0.3 −0.2 −0.1 0.0
−0.4
−0.2
0.0
0.2
LSA for 36 news articles on Cyprus (blue) and Greece (red)
dimension 1
dim
ensi
on 2
football
football
immigration
Greek election live coverage
Latent semantic analysis performed with library lsa, news articles collected through The Guardian API, library GuardianR
Euro crisis
Topic models/Latent dirichlet allocation (topicmodels)
Topic models (Blei et al, 2003)
• inductively identifies topics across texts based on word co-occurrence
• you set both the number of texts and the number of topics (!)
• able to identify latent topical differences
• generates both quantitative and qualitative results
Word cloud for the topic female fashion (Jockers, 2013)
Topics in historical English novels and their distribution between male and female authors (Jockers & Mimno, 2013, p. 762)
Topics in two blog platforms(Puschmann & Bastos, 2014)
16 topics in two blog platforms
Sentiment analysis with sentiment and syuzhet
Sentiment analysis (Pang & Lee, 2008)
• assigns basic emotions (joy, anger, amusement) and polarity scores (+1/-1) to documents
• usually based on a bag-of-words approach
• easily misjudges connotations, so use with care
Sentiment analysis in R(a) legacy package sentiment
anger
disgust
fear
joy
sadness
surprise
unknown
tiredamazing
love
bad
hate
redbulll
routine
venetian
starbucks
wake
mad
gold
amp
card
alert
chillin
manufacturing
job
free
stole
frappuccino
ochocincos
gotten
theyvertpatriots
happy
cubesdpmo
flinged
pint
shelterstorms
thundertop
wonderfulhanging
milksounds
little
time
whipped
ice
denverco
cream
shift
poor
sad
sorrynews
god
pretty
agreeatdoes
carmeldecaffeinated
disgustingflatfoul
halsted
heyhave
hiccup
lovers
madison
mroning
poolssick
tire
upstarbucks
watery
weirdosstarbucks
youremp
closedcounter
finishmood
rtinwifi
stolen
homelessmacchiatowakingleft
venti remember
especially
coffee
thank
sitting
dont
peoplewothers
walk
bit
ooo
niggainsteadchad
chai
goodstarbucksisrael
juniormom
help
lost
roast
broke
thats
lol
ochocinco
tasted
please
getting
absolutely
abundancebeat
boo
boughhtbuzzcappuccino
chinese
client
district
evening
fantastic
godlike
lolly
lunchparfait manifestingmassage
ofin
plethora
simply
sisters
starbucksrtthe
surprisesurprised
thisislove
wana
wonder
yumyum
bagel
fellow
girlfriend
supposed
afraidalarm
anyway
atimes
awful
bottom
bye
cantturnaroundchilling
companye
companylakewoodcompanysixteenth
dearthe
exp fehfire
hahastarbucks
hang
horrible
hystericalinvoluntarily
jded
lakewoodca
margccjust
middle
morninghours
move
peaches
prepared
rtmid
shi
shiftstarbucks
starbucksi
supervstarbucks
teacher
tearing
terrible
terrified
vent
yelled
yogurt
zod
breakfast
center
hatha
kidding
supervisorstarbucks
ave
calls
cold
mall
paying
skinny
yay
chris
joke
rtjust
lot
theyre
withand
start
line
shes
iced
school
finally
person
cup
buy
table
date
hit
water
tell
taste
street
morning
rti
bringcrumble
maybe
caramel
cant
latte
customer
nice
mocha
ive
rtstarbucks
cookie
friend
hour
youre
addictive ako
american
annoy
annoyed
annoying
baristaproblem
bed
betty
con
correct
deathbyboredom
delilah
devil
exxxtra
family
gretchen
heads
idol
intrigue
italian
late
laugh
leaving
loud
ltd
magbabasa
mamaya
misspellmodern
mufucka
pagkagising
peoplethesedays
piss
pissed
plentttyrecognize
rthell
specproisabitch
spilledsteal
step
teasing
undercharacters
vente
viastarbucks
war
weiner
winner
words
busted
frozen
girls
kids
legit
reach
sounded
station
wait
hahaha
wit
watch
Emotion and polarity in tweets containing the keyword Starbucks https://sites.google.com/site/miningtwitter/questions/sentiment/sentiment
Comparing plot arc (through sentiment) in two novels
http://www.matthewjockers.net/2015/02/02/syuzhet/
Sentiment analysis in R(b) experimental package syuzhet
Misc stuff
package gender
Summary: Text mining with R
• R has become much better at text processing in recent years
• Processing speeds are still somewhat lower than with Python (depending on what you want to do)
• Text mining is largely exploratory and presupposes qualitative knowledge about your data
• CRAN Task View NLP: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
ReferencesBlei, D. M., Ng, A. Y., & Jordan, M. I. (2012). Latent Dirichlet Allocation. Journal of
Machine Learning Research, 3, 993–1022. doi:10.1162/jmlr.2003.3.4-5.993 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).
Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Jockers, M. L. (2013). Macroanalysis: Digital Methods and Literary History (p. 208). Champaign, IL: University of Illinois Press.
Jockers, M. L., & Mimno, D. (2013). Significant themes in 19th-century literature. Poetics, 41(6), 750–769. doi:10.1016/j.poetic.2013.08.005
Puschmann, C., & Bastos, M. (2015). How digital are the digital humanities? An analysis of two scholarly blogging platforms. PLoS ONE, 10, e0115035. doi:10.1371/journal.pone.0115035
Pang, B., & Lee, L. (2008). Opinion mining and sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1–135. doi:10.1561/1500000011