Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, pages 85–90Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics
https://doi.org/10.18653/v1/P17-4015
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, pages 85–90Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics
https://doi.org/10.18653/v1/P17-4015
Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ
Jason S. KesslerCDK Global
Abstract
Scattertext is an open source tool for visu-alizing linguistic variation between docu-ment categories in a language-independentway. The tool presents a scatterplot,where each axis corresponds to the rank-frequency a term occurs in a categoryof documents. Through a tie-breakingstrategy, the tool is able to display thou-sands of visible term-representing pointsand find space to legibly label hundredsof them. Scattertext also lends itself to aquery-based visualization of how the useof terms with similar embeddings differsbetween document categories, as well asa visualization for comparing the impor-tance scores of bag-of-words features tounivariate metrics.
1 IntroductionFinding words and phrases that discriminate cat-egories of text is a common application of sta-tistical NLP. For example, finding words thatare most characteristic of a political party incongressional speeches can help political scien-tists identify means of partisan framing (Mon-roe et al., 2008; Grimmer, 2010), while identify-ing differences in word usage between male andfemale characters in films can highlight narra-tive archetypes (Schofield and Mehr, 2016). Lan-guage use in social media can inform understand-ing of personality types (Schwartz et al., 2013),and provides insights into customers’ evaluationsof restaurants (Jurafsky et al., 2014).
A wide range of visualizations have been usedto highlight discriminating words– simple rankedlists of words, word clouds, word bubbles, andword-based scatter plots. These techniques have anumber of limitations. For example, the difficulty
in comparing the relative frequencies of two termsin a word cloud, or in legibly displaying term la-bels in scatterplots.
Scattertext1 is an interactive, scalable toolwhich overcomes many of these limitations. It isbuilt around a scatterplot which displays a highnumber of words and phrases used in a corpus.Points representing terms are positioned to allowa high number of unobstructed labels and to in-dicate category association. The coordinates of apoint indicate how frequently the word is used ineach category.
Figure 1 shows an example of a Scattertext plotcomparing Republican and Democratic politicalspeeches. The higher up a point is on the y-axis,the more it was used by Democrats, and similarly,the further right on the x-axis a point appears,the more its corresponding word was used by Re-publicans. Highly associated terms fall closer tothe upper left and lower right-hand corners ofthe chart, while stop words fall in the far upperright-hand corner. Words occurring infrequentlyin both classes fall closer to the lower left-handcorner. When used interactively, mousing-over apoint shows statistics about a term’s relative usein the two contrasting categories, and clicking ona term shows excerpts from convention speechesused.
The point placement, intelligent word-labeling,and auxiliary term-lists ensure a low-whitespace,legible plot. These are issues which have plaguedother scatterplot visualizations showing discrimi-native language.§2 discusses different views of term-category
association that make up the basis of visualiza-tions. In §3, the objectives, strengths, and weak-nesses of existing visualization techniques. §4presents the technical details behind Scattertext.
1github.com/JasonKessler/scattertext
85
Infrequent Average Frequent
Republican Frequency
Infre
quen
tAv
erag
eFr
eque
nt
Dem
ocra
tic F
requ
ency Top Democratic
autoauto industryinsurance companiespelllast weekpell grantsaffordablegrantsplatformreduceaccessfairgrandmotherclean
Top Republicanunemploymentlibertyolympicsreaganannfoundingconstitutionchurchfree enterprisefederal governmententerprisesonsboygreatness
Characteristicobamaromneybarackmittobamacarebidenromneyshardworkingbaingrandkidsbillionairesmillionairesledbetterbuenaspellnochesblessdreamerscongresswomanbipartisanwealthiestriskedtrillionrepublicansrecessioninsurmountablegentlemenelectingpelosiunderstands
obama
romneybarack
millionairespell
wealthiest
trillion
gentlemen
fought
president
americansmedicare
prosper
reagan
blaming
founding
seniors
unemployment
wealthy
veterans
blessedrhetoric
fathers
olympics
stake
stimulus
strengthen
fightingcuts
fights
ours
liberty
bridges
strongest
freedom
blame
roosevelt
we
ballot
failing
michelle
greatest
stark
hopes
politicians
heal
charlotte
affordable
caring
success
belief
uniform
vote
loyal
walks
telling
depend
fulfill
troops mayor
better
roads
voting
pursuit
workers
detroit
cents
broke
laidpays ran
moving
opponent
earn
class
myth
fair
grants
drill
kept
pushed
shared
jill
hands
refuse
sick
pay
spend
coal
worry
solve
loved
principles
move
won
led
son
story
rid
administration
that
fiscal
lady
wanted
achievement
annpray
israel
minds
iraq
mate
bills
justice
reducing
shut
regulations
tells fix
lie
trust
gm
younger
rise
era
brings
kate
iowa
fear
begin
coverage
que
rules
auto
bold
hit
mean
federal
pass
felt
carefully
deepseem
played
act
cast
last
safe
signed
insurance
simple
role
bus
tea
fall
edge
wall
loans
church
deal
seat
clock
risk
along
con
lower
fill
pre
strategy
fivepain
train
hold
firm
race
leading
bit
bed
join
ship
savecost
re
eitherhours
feet ok
access
soundfine
line
los
point
top
visit
loss
full
baby
free
add
1
re elect
18
3
running mate
second term$ 10
jobs overseaschildhood education
42
ok.
4
no matter
816
insurance companies
Democratic document count: 123; word count: 76,864Republican document count: 66; word count: 58,138
Search for term Type a word or two…
Figure 1: Scattertext visualization of words and phrases used in the 2012 Political Conventions. 2,202points are colored red or blue based on the association of their corresponding terms with Democrats orRepublicans, 215 of which were labeled. The corpus consists of 123 speeches by Democrats (76,864words) and 66 by Republicans (58,138 words). The most associated terms are listed under “Top Demo-crat” and “Top Republican” headings. Interactive version: https://jasonkessler.github.io/st-main.html
§5 discusses how Scattertext can be used to iden-tify category-discriminating terms that are seman-tically similar to a query.
2 On text visualizationThe simplest visualization, a list of words rankedby their scores, is easy to produce, interpret andis thus very common in the literature. Thereare numerous ways of producing word scores forranking which are thoroughly covered in previ-ous work. The reader is directed to Monroe et al.(2008) (subsequently referred to as MCQ) for anoverview of model-based term scoring algorithms.Also of interest, Bitvai and Cohn (2015) present amethod for finding sparse words and phrase scoresfrom a trained ANN (with bag-of-words features)and its training data.
Regardless of how complex the calculation,word scores capture a number of different mea-sures of word-association, which can be interest-ing when viewed independently instead of as partof a unitary score. These loosely defined measuresinclude:
Precision A word’s discriminative power regard-less of its frequency. A term that appears once inthe categorized corpus will have perfect precision.This (and subsequent metrics) presuppose a bal-anced class distribution. Words close to the x and
y-axis in Scattertext have high precision.
Recall The frequency a word appears in a partic-ular class, or P (word|class). The variance of pre-cision tends to decrease as recall increases. Ex-tremely high recall words tend to be stop-words.High recall words occur close to the top and rightsides of Scattertext plots.
Non-redundancy The level of a word’s discrimi-native power given other words that co-occur withit. If a word wa always co-occurs with wb andword wb has a higher precision and recall, wa
would have a high level of redundancy. Measur-ing redundancy is non-trivial, and has tradition-ally been approached through penalized logisticregression (Joshi et al., 2010), as well as throughother feature selection techniques. In configu-rations of Scattertext such as the one discussedat the end of §4, terms can be colored basedon their regression coefficients that indicate non-redundancy.
Characteristicness How much more does a wordoccur in than the categories examined than inbackground in-domain text? For example, if com-paring positive and negative reviews of a singlemovie, a logical background corpus may be re-views of other movies. Highly associated termstend to be characteristic because they frequently
86
appear in one category and not the other. Somevisualizations explicitly highlight these, ex. (Cop-persmith and Kelly, 2014).
3 Past work and design motivation
Text visualizations manipulate the position and ap-pearance of words or points representing them toindicate their relative scores in these measures.For example, in Schwartz et al. (2013), twoword clouds are given, one per each category oftext being compared. Words (and selected n-grams) are sized by their linear regression coef-ficients (a composite metric of precision, recall,and redundancy) and colored by frequency. Onlywords occurring in ≥1% of documents and hav-ing Bonferroni-corrected coefficient p-values of<0.001 were shown. Given that these words arehighly correlated to their class of interest, the fre-quency of use is likely a good proxy for recall.
Coppersmith and Kelly (2014) also describe aword-cloud based visualization for discriminatingterms, but intend it for categories which are bothsmall subsets of a much larger corpus. They in-clude a third, middle cloud for terms that appearcharacteristic.
Word clouds can be difficult to interpret. Itis difficult to compare the sizes of two non-horizontally adjacent words, as well as the relativecolor intensities of any two words. Longer wordsunintentionally appear more important since theynaturally occupy more space in the cloud. Sizingof words can be a source of confusion when usedto represent precision, since a larger word maynaturally be seen as more frequent.
Bostock et al. (2012)2 features an interactiveword-bubble visualization for exploring differentword usage among Republicans and Democrats inthe 2012 US presidential nominating conventions.Each term displayed is represented by a bubble,sized proportionate to their frequency. Each bub-ble is colored blue and red, s.t. the blue parti-tion’s size corresponds to the term’s relative useby Democrats. Terms were manually chosen,and arranged along the x-axis based on their dis-criminative power. When clicked, sentences fromspeeches containing the word used are listed be-low the visualization.
The dataset used in Bostock et al. (2012) isused to demonstrate the capabilities of Scattertext
2nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
in each of these figures. The dataset is availablevia the Scattertext Github page.
3.1 Scatterplot visualizations
aðiÞkw 5 aðiÞk0pMLE 5 y # a0
nð23Þ
where aðiÞk0 determines the implied amount of information in the prior. This prior shrinks thepðiÞkw and XðiÞ
kw to the global values, and shrinks the feature evaluation measures, the fðiÞkw andthe fði2 jÞ
kw , toward zero. The greater aðiÞk0 is, the more shrinkage we will have.Figure 5 illustrates results from the running example using this approach. We set a0 to
imply a ‘‘prior sample’’ of 500 words per party every day, roughly the average number ofwords per day used per party on each topic in the data set.
As is shown in Figure 5, this has a desirable effect on the function-word problem notedabove. For example, the Republican top 20 list has shuffled, with the, you, not, of, and bebeing replaced by the considerably more evocative aliv, infant, brutal, brain, and necessari.
Fig. 5 Feature evaluation and selection based on fðD2RÞkw . Plot size is proportional to evaluation
weight, fðD2RÞkw ; those with
!!!fðD2RÞkw
!!!<1:96 are gray. The top 20 Democratic and Republican words are
labeled and listed in rank order to the right.
388 Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn
at University of Pennsylvania Library on June 3, 2013
http://pan.oxfordjournals.org/D
ownloaded from
's
@b0rk@benmarwick@coursera@gshotwell
auto@seanhacks@daattali@aaleargganimateandroid
@onlybluefeet
@juliasilge
@bigkage
acceptedhop
appliedcreated
@flowingdatagif
advantagestatssqlbase
broom
pitalgebraconfpipe
columnstoolknitr
stack
@youtube
8th
ad
talksbtw
2xblankboblazynoticelinear
@jc4p
@jasonpunyon
slides
traffic
alive
aloud90schunkhurtruleuser
@kara_woo
tweets
ball
warsapriltrainha
choicefile
article
support
iron
roll
tech
posts
failran
@noamrossdplyr
view@jhollist
link
eye
join
add
bring
art
math
bug
dr
line
analysis
feet
um
yep
laws
rest
public
reason
sit
floor
table
code
fan
@hspter
figure
hotel3rd
question
cooking
store
finewordwrong
miss
@hadleywickham
dogplay
talk
slow
tiny
normal
watch
runningtimes
@quominus
blog
family
package
buy
bought
hours
foundstart
favorite
idea
hot
hardbit
post
lol
stuff
food
data
amazing
#rstats
life
nice
kids
home
dinnertonight
week
lot
school
pretty
people
morning
housebaby
day
time
0.01%
0.10%
1.00%
0.01% 1.00%Julia
David
Figure 2: A sample of existing scatterplot visual-izations. MCQ’s is at the top. Tidytext is below.
MCQ present a visualization to illustrate the useof their proposed word score, log-odds-ratio withan informative Dirichlet prior (top of Figure 2).This visualization plots word-representing pointsalong two axes. The axes are log10 recall vs. thedifference in word scores z-scores. Points with az-score difference<1.96 are grayed-out, while thetop and bottom 20 are labeled, both by each pointand on the right-hand side. The side-labeling isnecessary because labels are permitted to overlap,hindering their on-plot readability. The sizes ofpoints and labels are increased proportionally tothe word score. This word score encompasses pre-cision, recall, and characteristicness since it penal-izes scores of terms used more frequently in thebackground corpus. MCQ used this type of plotto illustrate the different effects of various scoringtechniques introduced in the paper. However, thesmall number of points which are possible to labellimit its utility for in-depth corpus analysis.
Schofield and Mehr (2016) use essentially the
87
same visualization, but plot over 100 correspond-ing n-grams next to an unlabeled frequency/z-score plot. While this is appropriate for publica-tion, displaying associated terms and the shape ofthe score distribution, it is impossible to align allbut the highest scoring points to their labels.
The tidytext R-package (Silge and Robinson,2016) documentation includes a non-interactiveggplot2-based scatter plot that is very similar toScattertext. The x and y-axes both, like in Scat-tertext, correspond to word frequencies in the twocontrasting categories, with jitter added.3 In theexample in Figure 2 (bottom), the contrasting cate-gories are tweets from two different accounts. Thered diagonal line separates words based on theirodds-ratio. Importantly, compared to MCQ, lessof this chart’s area is occupied by whitespace.
While tidytext’s labels do not overlap each other(in contrast to MCQ) they do overlap points. Thepoints’ semi-transparency makes labels in less-dense areas legible, the dense interior of the chartis nearly illegible, with both points and labels ob-scured. Figure 3 shows an excerpt of the same
's
@b0rk
@benmarwick
@coursera
@gshotwell
blastauto@sf99
@seanhacksprobability
cc@daattali
@nssdeviations@aalear
@gaborcsardigganimate
@simplystatsandroid@fody
@nick_craver
@onlybluefeet
@juliasilge
@bigkage
accepted
clinton's
@hairboat
created
articles@flowingdatadistributioncomparesimulationnetworkadvantage
log
stats
sqlusers
base
overflow
broom
iffy
pit
facet
effect
york
false
tidyingdf
baseball
knitrbeta
dataset
stack
aired
8th
cmd
@alexwhan
adchapter
comparingaddinggonna
regressiontalks
btw
@kwbroman
vi
2x
easily
beat
bc
@jc4p
@jasonpunyon
chia
fish
access
form
api
@timelyportfolio
traffic
chunk
hurt
closed
user
future
@kara_woo
arms
bar
fill
melt
sentiment
frame
scalematter
title
tweets
april
avoid
ha
save
examples
packagesarticle
cup
fear
count
answer
biggest
lines
tech
adviceaw
chance
results
posts
github
sell
san
fail
ran
hold
@noamross
dplyr
@jhollist
link
statistics
tidy
join
add
dry
hat
voice
shiny
tidytext
ggplot2
art
math
bug
function
easier
optionsline
based
analysis
nom
guy
due
um
yep
rest
public
easy
loved
ideas
la
talking
table
write
code
jam
fan
learning
names
social
fit
@hspter
close
list
tweet
3rd
realized
question
cooking
age
store
fine
word
call
literally
live
hope
library
wrong
class
@hadleywickham
dog
play
stop
talk
halfcool
taking
science
tiny
meet
normal
trip
person
watch
awesome
@quominus
wearing
redwait
blog
late
coming
package
buy
sad
hours
wear
found
started
start
ago
sleep
favorite
idea
hot
tomorrow
thinkingbadmakes
totally
yeah
hard
read
post
lol
stuff
real
data
sick
amazing
#rstats
life
nice
kids
@drob
happy
home
birthday
dinner
tonight
night
week
lot
school
prettyfeel
people
house
baby
day
time
0.01%
0.10%
1.00%
0.01% 1.00%Julia
David
Figure 3: A small cropping from an un-jitteredversion at the bottom of Figure 2. The dark,opaque points indicate stacks of points.
plot, but with no jitter. Words appearing withthe same frequency in both categories all becomestacked atop each other, however, this providesmore interior space for labeling.
As a side note, many text visualizations plotwords in a 2D space according to their similarityin a high dimensional space. For example, Cho etal. 2014 uses the Barnes-Hut-SNE to plot words ina 2D space s.t. those with similar representationsare grouped close together. Class-association doesnot play a role in this line of research, and globalposition is essentially irrelevant.
The next section presents Scattertext and howits approach to word ordering solves the problemsdiscussed above.
3This type of visualization may have first been introduced inRudder (2014).
4 ScattertextScattertext builds on tinytext and Rudder (2014).It plots a set of unigrams and bigrams (referredto in this paper as “terms”) found in a corpus ofdocuments assigned to one of two categories on atwo-dimensional scatterplot.
In the following notation, user-supplied param-eters are in bold typeface.
Consider a corpus of documents C with disjointsubsets A and B s.t. A ∪B ≡ C. Let φT (t, C) bethe number of times term t occurs in C, φT (t, A)be the the number of times t occurs in A. LetφD(t, A) refer to the number of documents in Acontaining t. Let tij be the jth word in term ti. Inpractice, j ∈ 1, 2. The parameter φ may be φT
or φD.4 Other feature representations (ex., tf.idf)may be used for φ.
Pr[ti] =φ(ti, C)∑
t∈C∧|t|≡|ti|φ(t, C). (1)
The construction of the set of terms included inthe visualization V is a two-step process. Termsmust occur geqm times, and if bigrams, appearto be phrases. In order to keep the approach lan-guage neutral, I follow Schartz et al. (2013), anduse a pointwise mutual information score to filterout bigrams that do not occur far more frequentlythan would be expected. Let
PMI(ti) = logPr[ti]∏
tij∈tiPr[Tij ]
. (2)
The minimum PMI accepted is p. Now, V canbe defined as
t|φ(t, C) ≥m∧(|t| ≡ 1∨PMI(t) > p) (3)
Let a term t’s coordinates on the scatterplot be(xAt , x
Bt ), where A and B are the two docu-
ment categories. Although xKt is proportional toφ(t,K), many terms will have identical φ(t,K)values. To break ties the word that appears lastalphabetically will have a larger xKt .
Let us define rKt s.t. t ∈ V and K ∈ A,Bas the ranks of φ(t,K), sorted in ascending order,where ties are broken by terms’ alphabetical order.This allows us to define
xKt =rKt
argmax rK(4)
4φD is useful when documents contain unique, characteristic,highly frequent terms. For example, names of movies canhave high φT when finding differences in positive and neg-ative film reviews. The may lead to them receiving higherscores than sentiment terms.
88
This limits x values to [0, 1], ensuring both axesare scaled identically. This keeps the chart frombecoming lopsided toward the corpus that had alarger number of terms.5
The charts in Figures 1, 4, and 5, were madewith parametersm=5, p=8, and φ=φT .
Breaking ties alphabetically is a simple butimportant alternative to jitter. While jitter (i.e.,randomly perturbing xAt and xBt ) breaks up thestacked points shown in Figure 3, it eliminatesempty space to legibly label points. Jitter canmake it seem like identically frequent points arecloser to an upper left or lower right corner. Al-phabetic tie-breaking makes identical adjustmentsto both axes, leading to the horizontal (lower-leftto upper-right) alignments of identically frequentpoints. This angle does not cause one point to besubstantially closer to either of the category asso-ciated corners (the upper-left and lower-right).
These alignments provide two advantages.First, they open up point-free tracts in the centerof the chart which allow for unobstructed interiorlabels. Second, they arrange points in a way that itis easy to hover a mouse over all of them, to indi-cate what term they correspond to, and be clickedto see excerpts of that term.
In the running example, 154 points were la-beled when a jitter of 10% of each axis and notie-breaking was applied. 210 points (a 36% lift)were labeled when no jitter was applied. 140 werelabeled if no tie breaking was used.
Rudder (2014) observed terms closer to thelower-right corner were used frequently in A andinfrequently in B, indicating they have both highrecall and precision wrt category A. Symmetri-cally, the same relationship exists for B and theupper-right corner. I can formalize this score be-tween a point’s coordinates and it’s respective cor-ner. This intuition is represented by a score func-tion sK(t) (K ∈ A,B and t ∈ V ) where
sK(t) =
‖〈1− xAt , xBt 〉‖ if K = A,‖〈xAt , 1− xBt 〉‖ if K = B
. (5)
Other term scoring methods (e.g., regressionweights or a weighted log-odd-ratio with a prior)may be used in place of Formula 5.
Maximal non-overlapping labeling of scatter-plots is NP-hard (Been et al., 2007). Scattertext’sheuristic is labeling points if space is available in5While both are available, ordinal ranks are preferable to logfrequency since uninteresting stop-words often occupy dis-proportionate axis space.
one of many places around a point. This is per-formed iteratively, beginning with points havingthe highest score (regardless of category) and pro-ceeding downward in score. An optimized datastructure automatically constructed using Cozy(Loncaric et al., 2016) holds the locations of drawnpoints and labels.
The top scoring terms in classes B and A(Democrats and Republicans in Figure 1) are listedto the right of the chart. Hovering over points andterms highlights the point and displays frequencystatistics.
Point colors are determined by their scores on s.Those corresponding to terms with a high sB col-ored in progressively darker shades of blue, whilethose with a higher sA are colored in progressivelydarker shades of red. When both scores are aboutequal, the point colors become more yellow, whichcreates a visual divide between the two classes.The colors are provided by D3’s “RdYlBu” di-verging color scheme from Colorbrewer6 via d37.
Other point colors (and scorings) can be used.For example, Figure 4 shows coefficients of an `1penalized log. reg. classifier on V features. Scat-tertext, in this example, is set to color 0-scoringcoefficients light gray. Terms’ univariate predic-tive power are still evident by their chart position.See below8 for an interactive version.
Democratic frequency:8 per 25,000 terms
Some of the 25 mentions:
Republican frequency:26 per 25,000 terms
Some of the 62 mentions:BARACK OBAMAThey knew they were part of something larger — a nation that triumphed over fascism and depression, a nation where the most innovative businesses turn out the world's best products, and everyone shared in that pride and success from the corner office to the factory floor.
MITT ROMNEYThat business we started with 10 people has now grown into a great American success story.
MARCO RUBIO
Infrequent Average Frequent
Republican Frequency
Infre
quen
tAv
erag
eFr
eque
nt
Dem
ocra
tic F
requ
ency Top Democratic
president obamabarack'm hereforforwardclassobamanotcharlottebostonmedicareeducationshemake
Top Republicangovernmentmittadministrationsuccessleadershipstorybusinessto bepauli thinkthisbuilt itisand
Characteristicobamaromneybarackmittobamacarebidenromneyshardworkingbaingrandkidsbillionairesmillionairesledbetterbuenaspellnochesblessdreamerscongresswomanbipartisanwealthiestriskedtrillionrepublicansrecessioninsurmountablegentlemenelectingpelosiunderstands
obama
romneybarack
millionaires
pell
wealthiest
trillion
gentlemen
foughtamericans
medicare
repeal
prosper
reagan
america
blaming
founding
seniors
unemployment
wealthy
veterans
blessed
insult
fathers
olympics
stake
stimulus
democratic
stronger
fighting
rebuild
cuts
ours
forward
liberty
strongest
freedom
blame
staples
roosevelt
we
michellegreatest
deeply
stark
hopes
charlotte
affordable
caring
success
belief
uniform
loyal
walks
loving
fulfill
troops mayor
better
voting
committed
pursuit
detroit
cents
ran
class
fair
drill
kept
jillfarmers
hands
pay
coal
loved
principles
notion
move
led
son
story
rid
administration
visionlady
wealth
doors
israel
breaks
minds
iraq
mate
visited
threats
regulationshundreds
pulledapart
platform
fix
lie
trust
hunt
younger
rise
era
flying
facts
birth
fear
enemy
begin
coverage
que
rules
auto
para
hit
lay
federal
pass
felt
deep
chief
twice
solar
act
ownerscast
lesson
last
nations
safesigned
insurance
simple
fail
bus
fall
battle edge
loans
church
dealbad
planet
if
paid
fill
six
tradefuel
bank
bit
lots
join
hall
save
sent
none
girl
self
re
until
field
hours
ok
accessline
point
top
visit
girlsage
free
low
store
the story
1
roll back
23
his dream
story of
big government
ok.
16
mitt was
42
natural gas
3
take away
re elect
that has
it has
insurance companies
18
d.c.
2
a business
8 built it
i think
'm here
Democratic document count: 123; word count: 76,864Republican document count: 66; word count: 58,138
Search for term Type a word or two…
Term: successDownload SVG
Figure 4: A cropped view of points being coloredusing `1-logreg coefficients. Interactive version:jasonkessler.github.io/st-sparse.html
5 Topical category discriminatorsIn 2012, how did Republicans and Democrats uselanguage relating to “jobs”, “healthcare”, or “mil-itary” differently? Figure 5 shows, in the runningexample, words similar to “jobs” that were char-acteristic of the political parties.6colorbrewer2.org7github.com/d3/d3-scale-chromatic8jasonkessler.github.io/sparseviz.html
89
Infrequent Average Frequent
Republican Frequency
Infre
quen
tAv
erag
eFr
eque
nt
Dem
ocra
tic F
requ
ency Top Democratic
workersinsurance companiescompanieseducationfamiliesmillionairesmedicarepell grantspayseniorsinsuranceaffordableindustryauto industry
Top Republicanjob creatorsunemploymentbusinesssmall businessproblemsmitt romneyenterprisemittleadershipfree enterprisefederal governmentfathersolympicsreagan
Most similarjobsjobs overseascreate jobscreating jobsjobjob growthjob creatorsjob creationopportunitiesbusinessesresponsibilitiespaycheckemployeesworkersinvestmentsinsurance companiessmall businessesbudgetshomesmomslaborprogramslobbyistscustomersfactoriesindustriesbusiness ownerssolutionsschoolsraise taxes
jobs
jobs overseas
workers
investments
insurance companies
moms
lobbyists
loans
skills
unemployment
business
role
small business
problems
burdens
costs
resources
business owner
people
duty
medicare
pell grants
pay
seniorsinsurance
cents
staples
affordable
doctors
invest
hours
years later
tax cut
nights
grandparents
women
coal
owners
deal
grantspell
experts
fathers
jillhundreds
olympics16 trillion
reagan
they
paid
iraq
veterans
girls
roosevelt
banks
bain capital
administrationiowa hire
water
action
ballot
government
york
we
fix
texas
second term
regulations
firm
detroit
ok.
wealthy
freedom
lake city
60
others
early childhood
clean
fillfuel
race
cuts
quality
game
40
america
8
girl
boys
tea
kate
buy
30
18
my grandfather
pays
simple
care
libertyrunning mate
feet
team
15
doors
bain
hunt
street
decent
5
talks
hands
bus
earth
code
prosper
fulfill
law
risk
begin
42
line
gas
ceo
decade
move
story
bit
five
finefiscal
israel
16
term
fair shot
wall
rural
29
coverage
4federal
lift
wait
edgeprinciples
ship
district
act
d.c.
fought
ok
guarantee
$ 1mate
class
teaching
steel2
capital
thanks
pursuit
better
wisdom
con
six
arms
save
core
notion
fighting
join
re elect
church
bad
era
auto
age
birth
lie
trust
learn
hero
hitgone
charlotte
fair
visitlay
son
table
point
greatest
felt allow
short
led
until
free
fall
blessed
fear
tells
kept
deny
signed
el
re
top
que Figure 5: Words and phrases that are seman-tically similar to the word “jobs” are coloreddarker on a gray-to-purple scale, and general andcategory-specific related terms are listed to theright. Note that this is a cropping of the upperleft-hand corner of the plot. Interactive version:jasonkessler.github.io/st-sim.html.
In this configuration of Scattertext, words arecolored by their cosine similarity to a queryphrase. This is done using spaCy9-provided GloVe(Pennington et al., 2014) word vectors (trained onthe Common Crawl corpus). Mean vectors areused for phrases.
The calculation of the most similar terms as-sociated with each category is a simple heuristic.First, sets of terms closely associated with a cat-egory are found. Second, these terms are rankedbased on their similarity to the query, and the toprank terms are displayed to the right of the scatter-plot (Figure 5).
A term is considered associated if its p-valueis <0.05. P-values are determined using MCQ’sdifference in the weighted log-odds-ratio with anuninformative Dirichlet prior. This is the onlymodel-based method discussed in Monroe et al.that does not rely on a large in-domain backgroundcorpus. Since I am scoring bigrams in addition tothe unigrams scored by MCQ, the size of the cor-pus would have to be larger to have high enoughbigram counts for proper penalization.
This function relies the Dirichlet distribution’sparameter α ∈ R
|V |+ . Following MCQ, αt = 0.01.
Formulas 16, 18 and 22 are used to compute z-scores, which are then converted to p-values usingthe Normal CDF of ζA−Bw , letting y(K)
t = φ(t,K)st K ∈ A,B and t ∈ V .
As seen in Figure 5, the top Republican wordrelated to “jobs” is “job creators”, while “workers”is the top Democratic term.
9spacy.io
6 Conclusion and future workScattertext, a tool to make legible, comprehen-sive visualizations of class-associated term fre-quencies, was introduced. Future work will in-volve rigorous human evaluation of the usefulnessof the visualization strategies discussed.
AcknowledgmentsJay Powell, Kyle Lo, Ray Little-Herrick, WillHeadden, Chuck Little, Nancy Kessler and KamWoods helped proofread this work.
ReferencesKen Been, Eli Daiches, and Chee Yap. 2007. Dynamic
map labeling. IEEE-VCG .Zsolt Bitvai and Trevor Cohn. 2015. Non-linear text
regression with a deep convolutional neural network.In ACL.
Mike Bostock, Shan Carter, and Matthew Ericson.2012. At the national conventions, the words theyused. In The New York Times.
Kyunghyun Cho, Bart van Merrienboer, CaglarGulcehre, Fethi Bougares, Holger Schwenk, andYoshua Bengio. 2014. Learning phrase representa-tions using RNN encoder-decoder for statistical ma-chine translation. CoRR .
Glen Coppersmith and Erin Kelly. 2014. Dynamicwordclouds and vennclouds for exploratory dataanalysis. In ACL-ILLVI.
Justin Ryan Grimmer. 2010. Representational Style:The Central Role of Communication in Representa-tion. Ph.D. thesis, Harvard University.
Mahesh Joshi, Dipanjan Das, Kevin Gimpel, andNoah A. Smith. 2010. Movie reviews and revenues:An experiment in text regression. In HLT-NAACL.
Dan Jurafsky, Victor Chahuneau, Bryan Routledge, andNoah Smith. 2014. Narrative framing of consumersentiment in online restaurant reviews. First Mon-day .
Calvin Loncaric, Emina Torlak, and Michael D. Ernst.2016. Fast synthesis of fast collections. In PLDI.
Burt L. Monroe, Michael P. Colaresi, and Kevin M.Quinn. 2008. Fightin’ words: Lexical feature se-lection and evaluation for identifying the content ofpolitical conflict. Political Analysis .
Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP.
Christian Rudder. 2014. Dataclysm: Who We Are(When We Think No One’s Looking). Crown Pub-lishing Group.
Alexandra Schofield and Leo Mehr. 2016. Gender-distinguishing features in film dialogue. NAACL-CLfL .
H. Andrew Schwartz, Johannes C. Eichstaedt, and Mar-garet L. et al. Kern. 2013. Personality, gender, andage in the language of social media: The open-vocabulary approach. PLOS ONE .
Julia Silge and David Robinson. 2016. tidytext: Textmining and analysis using tidy data principles in r.JOSS .
90