+ All Categories
Home > Documents > Scattertext: a Browser-Based Tool for Visualizing how ... · Scattertext: a Browser-Based Tool for...

Scattertext: a Browser-Based Tool for Visualizing how ... · Scattertext: a Browser-Based Tool for...

Date post: 05-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
6
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, pages 85–90 Vancouver, Canada, July 30 - August 4, 2017. c 2017 Association for Computational Linguistics https://doi.org/10.18653/v1/P17-4015 Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, pages 85–90 Vancouver, Canada, July 30 - August 4, 2017. c 2017 Association for Computational Linguistics https://doi.org/10.18653/v1/P17-4015 Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ Jason S. Kessler CDK Global [email protected] Abstract Scattertext is an open source tool for visu- alizing linguistic variation between docu- ment categories in a language-independent way. The tool presents a scatterplot, where each axis corresponds to the rank- frequency a term occurs in a category of documents. Through a tie-breaking strategy, the tool is able to display thou- sands of visible term-representing points and find space to legibly label hundreds of them. Scattertext also lends itself to a query-based visualization of how the use of terms with similar embeddings differs between document categories, as well as a visualization for comparing the impor- tance scores of bag-of-words features to univariate metrics. 1 Introduction Finding words and phrases that discriminate cat- egories of text is a common application of sta- tistical NLP. For example, finding words that are most characteristic of a political party in congressional speeches can help political scien- tists identify means of partisan framing (Mon- roe et al., 2008; Grimmer, 2010), while identify- ing differences in word usage between male and female characters in films can highlight narra- tive archetypes (Schofield and Mehr, 2016). Lan- guage use in social media can inform understand- ing of personality types (Schwartz et al., 2013), and provides insights into customers’ evaluations of restaurants (Jurafsky et al., 2014). A wide range of visualizations have been used to highlight discriminating words– simple ranked lists of words, word clouds, word bubbles, and word-based scatter plots. These techniques have a number of limitations. For example, the difficulty in comparing the relative frequencies of two terms in a word cloud, or in legibly displaying term la- bels in scatterplots. Scattertext 1 is an interactive, scalable tool which overcomes many of these limitations. It is built around a scatterplot which displays a high number of words and phrases used in a corpus. Points representing terms are positioned to allow a high number of unobstructed labels and to in- dicate category association. The coordinates of a point indicate how frequently the word is used in each category. Figure 1 shows an example of a Scattertext plot comparing Republican and Democratic political speeches. The higher up a point is on the y-axis, the more it was used by Democrats, and similarly, the further right on the x-axis a point appears, the more its corresponding word was used by Re- publicans. Highly associated terms fall closer to the upper left and lower right-hand corners of the chart, while stop words fall in the far upper right-hand corner. Words occurring infrequently in both classes fall closer to the lower left-hand corner. When used interactively, mousing-over a point shows statistics about a term’s relative use in the two contrasting categories, and clicking on a term shows excerpts from convention speeches used. The point placement, intelligent word-labeling, and auxiliary term-lists ensure a low-whitespace, legible plot. These are issues which have plagued other scatterplot visualizations showing discrimi- native language. §2 discusses different views of term-category association that make up the basis of visualiza- tions. In §3, the objectives, strengths, and weak- nesses of existing visualization techniques. §4 presents the technical details behind Scattertext. 1 github.com/JasonKessler/scattertext 85
Transcript
Page 1: Scattertext: a Browser-Based Tool for Visualizing how ... · Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ Jason S. Kessler CDK Global jason.kessler@gmail.com

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, pages 85–90Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-4015

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics-System Demonstrations, pages 85–90Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-4015

Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ

Jason S. KesslerCDK Global

[email protected]

Abstract

Scattertext is an open source tool for visu-alizing linguistic variation between docu-ment categories in a language-independentway. The tool presents a scatterplot,where each axis corresponds to the rank-frequency a term occurs in a categoryof documents. Through a tie-breakingstrategy, the tool is able to display thou-sands of visible term-representing pointsand find space to legibly label hundredsof them. Scattertext also lends itself to aquery-based visualization of how the useof terms with similar embeddings differsbetween document categories, as well asa visualization for comparing the impor-tance scores of bag-of-words features tounivariate metrics.

1 IntroductionFinding words and phrases that discriminate cat-egories of text is a common application of sta-tistical NLP. For example, finding words thatare most characteristic of a political party incongressional speeches can help political scien-tists identify means of partisan framing (Mon-roe et al., 2008; Grimmer, 2010), while identify-ing differences in word usage between male andfemale characters in films can highlight narra-tive archetypes (Schofield and Mehr, 2016). Lan-guage use in social media can inform understand-ing of personality types (Schwartz et al., 2013),and provides insights into customers’ evaluationsof restaurants (Jurafsky et al., 2014).

A wide range of visualizations have been usedto highlight discriminating words– simple rankedlists of words, word clouds, word bubbles, andword-based scatter plots. These techniques have anumber of limitations. For example, the difficulty

in comparing the relative frequencies of two termsin a word cloud, or in legibly displaying term la-bels in scatterplots.

Scattertext1 is an interactive, scalable toolwhich overcomes many of these limitations. It isbuilt around a scatterplot which displays a highnumber of words and phrases used in a corpus.Points representing terms are positioned to allowa high number of unobstructed labels and to in-dicate category association. The coordinates of apoint indicate how frequently the word is used ineach category.

Figure 1 shows an example of a Scattertext plotcomparing Republican and Democratic politicalspeeches. The higher up a point is on the y-axis,the more it was used by Democrats, and similarly,the further right on the x-axis a point appears,the more its corresponding word was used by Re-publicans. Highly associated terms fall closer tothe upper left and lower right-hand corners ofthe chart, while stop words fall in the far upperright-hand corner. Words occurring infrequentlyin both classes fall closer to the lower left-handcorner. When used interactively, mousing-over apoint shows statistics about a term’s relative usein the two contrasting categories, and clicking ona term shows excerpts from convention speechesused.

The point placement, intelligent word-labeling,and auxiliary term-lists ensure a low-whitespace,legible plot. These are issues which have plaguedother scatterplot visualizations showing discrimi-native language.§2 discusses different views of term-category

association that make up the basis of visualiza-tions. In §3, the objectives, strengths, and weak-nesses of existing visualization techniques. §4presents the technical details behind Scattertext.

1github.com/JasonKessler/scattertext

85

Page 2: Scattertext: a Browser-Based Tool for Visualizing how ... · Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ Jason S. Kessler CDK Global jason.kessler@gmail.com

Infrequent Average Frequent

Republican Frequency

Infre

quen

tAv

erag

eFr

eque

nt

Dem

ocra

tic F

requ

ency Top Democratic

autoauto industryinsurance companiespelllast weekpell grantsaffordablegrantsplatformreduceaccessfairgrandmotherclean

Top Republicanunemploymentlibertyolympicsreaganannfoundingconstitutionchurchfree enterprisefederal governmententerprisesonsboygreatness

Characteristicobamaromneybarackmittobamacarebidenromneyshardworkingbaingrandkidsbillionairesmillionairesledbetterbuenaspellnochesblessdreamerscongresswomanbipartisanwealthiestriskedtrillionrepublicansrecessioninsurmountablegentlemenelectingpelosiunderstands

obama

romneybarack

millionairespell

wealthiest

trillion

gentlemen

fought

president

americansmedicare

prosper

reagan

blaming

founding

seniors

unemployment

wealthy

veterans

blessedrhetoric

fathers

olympics

stake

stimulus

strengthen

fightingcuts

fights

ours

liberty

bridges

strongest

freedom

blame

roosevelt

we

ballot

failing

michelle

greatest

stark

hopes

politicians

heal

charlotte

affordable

caring

success

belief

uniform

vote

loyal

walks

telling

depend

fulfill

troops mayor

better

roads

voting

pursuit

workers

detroit

cents

broke

laidpays ran

moving

opponent

earn

class

myth

fair

grants

drill

kept

pushed

shared

jill

hands

refuse

sick

pay

spend

coal

worry

solve

loved

principles

move

won

led

son

story

rid

administration

that

fiscal

lady

wanted

achievement

annpray

israel

minds

iraq

mate

bills

justice

reducing

shut

regulations

tells fix

lie

trust

gm

younger

rise

era

brings

kate

iowa

fear

begin

coverage

que

rules

auto

bold

hit

mean

federal

pass

felt

carefully

deepseem

played

act

cast

last

safe

signed

insurance

simple

role

bus

tea

fall

edge

wall

loans

church

deal

seat

clock

risk

along

con

lower

fill

pre

strategy

fivepain

train

hold

firm

race

leading

bit

bed

join

ship

savecost

re

eitherhours

feet ok

access

soundfine

line

los

point

top

visit

loss

full

baby

free

add

1

re elect

18

3

running mate

second term$ 10

jobs overseaschildhood education

42

ok.

4

no matter

816

insurance companies

Democratic document count: 123; word count: 76,864Republican document count: 66; word count: 58,138

Search for term Type a word or two…

Figure 1: Scattertext visualization of words and phrases used in the 2012 Political Conventions. 2,202points are colored red or blue based on the association of their corresponding terms with Democrats orRepublicans, 215 of which were labeled. The corpus consists of 123 speeches by Democrats (76,864words) and 66 by Republicans (58,138 words). The most associated terms are listed under “Top Demo-crat” and “Top Republican” headings. Interactive version: https://jasonkessler.github.io/st-main.html

§5 discusses how Scattertext can be used to iden-tify category-discriminating terms that are seman-tically similar to a query.

2 On text visualizationThe simplest visualization, a list of words rankedby their scores, is easy to produce, interpret andis thus very common in the literature. Thereare numerous ways of producing word scores forranking which are thoroughly covered in previ-ous work. The reader is directed to Monroe et al.(2008) (subsequently referred to as MCQ) for anoverview of model-based term scoring algorithms.Also of interest, Bitvai and Cohn (2015) present amethod for finding sparse words and phrase scoresfrom a trained ANN (with bag-of-words features)and its training data.

Regardless of how complex the calculation,word scores capture a number of different mea-sures of word-association, which can be interest-ing when viewed independently instead of as partof a unitary score. These loosely defined measuresinclude:

Precision A word’s discriminative power regard-less of its frequency. A term that appears once inthe categorized corpus will have perfect precision.This (and subsequent metrics) presuppose a bal-anced class distribution. Words close to the x and

y-axis in Scattertext have high precision.

Recall The frequency a word appears in a partic-ular class, or P (word|class). The variance of pre-cision tends to decrease as recall increases. Ex-tremely high recall words tend to be stop-words.High recall words occur close to the top and rightsides of Scattertext plots.

Non-redundancy The level of a word’s discrimi-native power given other words that co-occur withit. If a word wa always co-occurs with wb andword wb has a higher precision and recall, wa

would have a high level of redundancy. Measur-ing redundancy is non-trivial, and has tradition-ally been approached through penalized logisticregression (Joshi et al., 2010), as well as throughother feature selection techniques. In configu-rations of Scattertext such as the one discussedat the end of §4, terms can be colored basedon their regression coefficients that indicate non-redundancy.

Characteristicness How much more does a wordoccur in than the categories examined than inbackground in-domain text? For example, if com-paring positive and negative reviews of a singlemovie, a logical background corpus may be re-views of other movies. Highly associated termstend to be characteristic because they frequently

86

Page 3: Scattertext: a Browser-Based Tool for Visualizing how ... · Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ Jason S. Kessler CDK Global jason.kessler@gmail.com

appear in one category and not the other. Somevisualizations explicitly highlight these, ex. (Cop-persmith and Kelly, 2014).

3 Past work and design motivation

Text visualizations manipulate the position and ap-pearance of words or points representing them toindicate their relative scores in these measures.For example, in Schwartz et al. (2013), twoword clouds are given, one per each category oftext being compared. Words (and selected n-grams) are sized by their linear regression coef-ficients (a composite metric of precision, recall,and redundancy) and colored by frequency. Onlywords occurring in ≥1% of documents and hav-ing Bonferroni-corrected coefficient p-values of<0.001 were shown. Given that these words arehighly correlated to their class of interest, the fre-quency of use is likely a good proxy for recall.

Coppersmith and Kelly (2014) also describe aword-cloud based visualization for discriminatingterms, but intend it for categories which are bothsmall subsets of a much larger corpus. They in-clude a third, middle cloud for terms that appearcharacteristic.

Word clouds can be difficult to interpret. Itis difficult to compare the sizes of two non-horizontally adjacent words, as well as the relativecolor intensities of any two words. Longer wordsunintentionally appear more important since theynaturally occupy more space in the cloud. Sizingof words can be a source of confusion when usedto represent precision, since a larger word maynaturally be seen as more frequent.

Bostock et al. (2012)2 features an interactiveword-bubble visualization for exploring differentword usage among Republicans and Democrats inthe 2012 US presidential nominating conventions.Each term displayed is represented by a bubble,sized proportionate to their frequency. Each bub-ble is colored blue and red, s.t. the blue parti-tion’s size corresponds to the term’s relative useby Democrats. Terms were manually chosen,and arranged along the x-axis based on their dis-criminative power. When clicked, sentences fromspeeches containing the word used are listed be-low the visualization.

The dataset used in Bostock et al. (2012) isused to demonstrate the capabilities of Scattertext

2nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html

in each of these figures. The dataset is availablevia the Scattertext Github page.

3.1 Scatterplot visualizations

aðiÞkw 5 aðiÞk0pMLE 5 y # a0

nð23Þ

where aðiÞk0 determines the implied amount of information in the prior. This prior shrinks thepðiÞkw and XðiÞ

kw to the global values, and shrinks the feature evaluation measures, the fðiÞkw andthe fði2 jÞ

kw , toward zero. The greater aðiÞk0 is, the more shrinkage we will have.Figure 5 illustrates results from the running example using this approach. We set a0 to

imply a ‘‘prior sample’’ of 500 words per party every day, roughly the average number ofwords per day used per party on each topic in the data set.

As is shown in Figure 5, this has a desirable effect on the function-word problem notedabove. For example, the Republican top 20 list has shuffled, with the, you, not, of, and bebeing replaced by the considerably more evocative aliv, infant, brutal, brain, and necessari.

Fig. 5 Feature evaluation and selection based on fðD2RÞkw . Plot size is proportional to evaluation

weight, fðD2RÞkw ; those with

!!!fðD2RÞkw

!!!<1:96 are gray. The top 20 Democratic and Republican words are

labeled and listed in rank order to the right.

388 Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn

at University of Pennsylvania Library on June 3, 2013

http://pan.oxfordjournals.org/D

ownloaded from

's

@b0rk@benmarwick@coursera@gshotwell

auto@seanhacks@daattali@aaleargganimateandroid

@onlybluefeet

@juliasilge

@bigkage

acceptedhop

appliedcreated

@flowingdatagif

advantagestatssqlbase

broom

pitalgebraconfpipe

columnstoolknitr

stack

@youtube

8th

ad

talksbtw

2xblankboblazynoticelinear

@jc4p

@jasonpunyon

slides

traffic

alive

aloud90schunkhurtruleuser

@kara_woo

tweets

ball

warsapriltrainha

choicefile

article

support

iron

roll

tech

posts

failran

@noamrossdplyr

view@jhollist

link

eye

join

add

bring

art

math

bug

dr

line

analysis

feet

um

yep

laws

rest

public

reason

sit

floor

table

code

fan

@hspter

figure

hotel3rd

question

cooking

store

finewordwrong

miss

@hadleywickham

dogplay

talk

slow

tiny

normal

watch

runningtimes

@quominus

blog

family

package

buy

bought

hours

foundstart

favorite

idea

hot

hardbit

post

lol

stuff

food

data

amazing

#rstats

life

nice

kids

home

dinnertonight

week

lot

school

pretty

people

morning

housebaby

day

time

0.01%

0.10%

1.00%

0.01% 1.00%Julia

David

Figure 2: A sample of existing scatterplot visual-izations. MCQ’s is at the top. Tidytext is below.

MCQ present a visualization to illustrate the useof their proposed word score, log-odds-ratio withan informative Dirichlet prior (top of Figure 2).This visualization plots word-representing pointsalong two axes. The axes are log10 recall vs. thedifference in word scores z-scores. Points with az-score difference<1.96 are grayed-out, while thetop and bottom 20 are labeled, both by each pointand on the right-hand side. The side-labeling isnecessary because labels are permitted to overlap,hindering their on-plot readability. The sizes ofpoints and labels are increased proportionally tothe word score. This word score encompasses pre-cision, recall, and characteristicness since it penal-izes scores of terms used more frequently in thebackground corpus. MCQ used this type of plotto illustrate the different effects of various scoringtechniques introduced in the paper. However, thesmall number of points which are possible to labellimit its utility for in-depth corpus analysis.

Schofield and Mehr (2016) use essentially the

87

Page 4: Scattertext: a Browser-Based Tool for Visualizing how ... · Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ Jason S. Kessler CDK Global jason.kessler@gmail.com

same visualization, but plot over 100 correspond-ing n-grams next to an unlabeled frequency/z-score plot. While this is appropriate for publica-tion, displaying associated terms and the shape ofthe score distribution, it is impossible to align allbut the highest scoring points to their labels.

The tidytext R-package (Silge and Robinson,2016) documentation includes a non-interactiveggplot2-based scatter plot that is very similar toScattertext. The x and y-axes both, like in Scat-tertext, correspond to word frequencies in the twocontrasting categories, with jitter added.3 In theexample in Figure 2 (bottom), the contrasting cate-gories are tweets from two different accounts. Thered diagonal line separates words based on theirodds-ratio. Importantly, compared to MCQ, lessof this chart’s area is occupied by whitespace.

While tidytext’s labels do not overlap each other(in contrast to MCQ) they do overlap points. Thepoints’ semi-transparency makes labels in less-dense areas legible, the dense interior of the chartis nearly illegible, with both points and labels ob-scured. Figure 3 shows an excerpt of the same

's

@b0rk

@benmarwick

@coursera

@gshotwell

blastauto@sf99

@seanhacksprobability

cc@daattali

@nssdeviations@aalear

@gaborcsardigganimate

@simplystatsandroid@fody

@nick_craver

@onlybluefeet

@juliasilge

@bigkage

accepted

clinton's

@hairboat

created

articles@flowingdatadistributioncomparesimulationnetworkadvantage

log

stats

sqlusers

base

overflow

broom

iffy

pit

facet

effect

york

false

tidyingdf

baseball

knitrbeta

dataset

stack

aired

8th

cmd

@alexwhan

adchapter

comparingaddinggonna

regressiontalks

btw

@kwbroman

vi

2x

easily

beat

bc

@jc4p

@jasonpunyon

chia

fish

access

form

api

@timelyportfolio

traffic

chunk

hurt

closed

user

future

@kara_woo

arms

bar

fill

melt

sentiment

frame

scalematter

title

tweets

april

avoid

ha

save

examples

packagesarticle

cup

fear

count

answer

biggest

lines

tech

adviceaw

chance

results

posts

github

sell

san

fail

ran

hold

@noamross

dplyr

@jhollist

link

statistics

tidy

join

add

dry

hat

voice

shiny

tidytext

ggplot2

art

math

bug

function

easier

optionsline

based

analysis

nom

guy

due

um

yep

rest

public

easy

loved

ideas

la

talking

email

table

write

code

jam

fan

learning

names

social

fit

@hspter

close

list

tweet

3rd

realized

question

cooking

age

store

fine

word

call

literally

live

hope

library

wrong

class

@hadleywickham

dog

play

stop

talk

halfcool

taking

science

tiny

meet

normal

trip

person

watch

awesome

@quominus

wearing

redwait

blog

late

coming

package

buy

sad

hours

wear

found

started

start

ago

sleep

favorite

idea

hot

tomorrow

thinkingbadmakes

totally

yeah

hard

read

post

lol

stuff

real

data

sick

amazing

#rstats

life

nice

kids

@drob

happy

home

birthday

dinner

tonight

night

week

lot

school

prettyfeel

people

house

baby

day

time

0.01%

0.10%

1.00%

0.01% 1.00%Julia

David

Figure 3: A small cropping from an un-jitteredversion at the bottom of Figure 2. The dark,opaque points indicate stacks of points.

plot, but with no jitter. Words appearing withthe same frequency in both categories all becomestacked atop each other, however, this providesmore interior space for labeling.

As a side note, many text visualizations plotwords in a 2D space according to their similarityin a high dimensional space. For example, Cho etal. 2014 uses the Barnes-Hut-SNE to plot words ina 2D space s.t. those with similar representationsare grouped close together. Class-association doesnot play a role in this line of research, and globalposition is essentially irrelevant.

The next section presents Scattertext and howits approach to word ordering solves the problemsdiscussed above.

3This type of visualization may have first been introduced inRudder (2014).

4 ScattertextScattertext builds on tinytext and Rudder (2014).It plots a set of unigrams and bigrams (referredto in this paper as “terms”) found in a corpus ofdocuments assigned to one of two categories on atwo-dimensional scatterplot.

In the following notation, user-supplied param-eters are in bold typeface.

Consider a corpus of documents C with disjointsubsets A and B s.t. A ∪B ≡ C. Let φT (t, C) bethe number of times term t occurs in C, φT (t, A)be the the number of times t occurs in A. LetφD(t, A) refer to the number of documents in Acontaining t. Let tij be the jth word in term ti. Inpractice, j ∈ 1, 2. The parameter φ may be φT

or φD.4 Other feature representations (ex., tf.idf)may be used for φ.

Pr[ti] =φ(ti, C)∑

t∈C∧|t|≡|ti|φ(t, C). (1)

The construction of the set of terms included inthe visualization V is a two-step process. Termsmust occur geqm times, and if bigrams, appearto be phrases. In order to keep the approach lan-guage neutral, I follow Schartz et al. (2013), anduse a pointwise mutual information score to filterout bigrams that do not occur far more frequentlythan would be expected. Let

PMI(ti) = logPr[ti]∏

tij∈tiPr[Tij ]

. (2)

The minimum PMI accepted is p. Now, V canbe defined as

t|φ(t, C) ≥m∧(|t| ≡ 1∨PMI(t) > p) (3)

Let a term t’s coordinates on the scatterplot be(xAt , x

Bt ), where A and B are the two docu-

ment categories. Although xKt is proportional toφ(t,K), many terms will have identical φ(t,K)values. To break ties the word that appears lastalphabetically will have a larger xKt .

Let us define rKt s.t. t ∈ V and K ∈ A,Bas the ranks of φ(t,K), sorted in ascending order,where ties are broken by terms’ alphabetical order.This allows us to define

xKt =rKt

argmax rK(4)

4φD is useful when documents contain unique, characteristic,highly frequent terms. For example, names of movies canhave high φT when finding differences in positive and neg-ative film reviews. The may lead to them receiving higherscores than sentiment terms.

88

Page 5: Scattertext: a Browser-Based Tool for Visualizing how ... · Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ Jason S. Kessler CDK Global jason.kessler@gmail.com

This limits x values to [0, 1], ensuring both axesare scaled identically. This keeps the chart frombecoming lopsided toward the corpus that had alarger number of terms.5

The charts in Figures 1, 4, and 5, were madewith parametersm=5, p=8, and φ=φT .

Breaking ties alphabetically is a simple butimportant alternative to jitter. While jitter (i.e.,randomly perturbing xAt and xBt ) breaks up thestacked points shown in Figure 3, it eliminatesempty space to legibly label points. Jitter canmake it seem like identically frequent points arecloser to an upper left or lower right corner. Al-phabetic tie-breaking makes identical adjustmentsto both axes, leading to the horizontal (lower-leftto upper-right) alignments of identically frequentpoints. This angle does not cause one point to besubstantially closer to either of the category asso-ciated corners (the upper-left and lower-right).

These alignments provide two advantages.First, they open up point-free tracts in the centerof the chart which allow for unobstructed interiorlabels. Second, they arrange points in a way that itis easy to hover a mouse over all of them, to indi-cate what term they correspond to, and be clickedto see excerpts of that term.

In the running example, 154 points were la-beled when a jitter of 10% of each axis and notie-breaking was applied. 210 points (a 36% lift)were labeled when no jitter was applied. 140 werelabeled if no tie breaking was used.

Rudder (2014) observed terms closer to thelower-right corner were used frequently in A andinfrequently in B, indicating they have both highrecall and precision wrt category A. Symmetri-cally, the same relationship exists for B and theupper-right corner. I can formalize this score be-tween a point’s coordinates and it’s respective cor-ner. This intuition is represented by a score func-tion sK(t) (K ∈ A,B and t ∈ V ) where

sK(t) =

‖〈1− xAt , xBt 〉‖ if K = A,‖〈xAt , 1− xBt 〉‖ if K = B

. (5)

Other term scoring methods (e.g., regressionweights or a weighted log-odd-ratio with a prior)may be used in place of Formula 5.

Maximal non-overlapping labeling of scatter-plots is NP-hard (Been et al., 2007). Scattertext’sheuristic is labeling points if space is available in5While both are available, ordinal ranks are preferable to logfrequency since uninteresting stop-words often occupy dis-proportionate axis space.

one of many places around a point. This is per-formed iteratively, beginning with points havingthe highest score (regardless of category) and pro-ceeding downward in score. An optimized datastructure automatically constructed using Cozy(Loncaric et al., 2016) holds the locations of drawnpoints and labels.

The top scoring terms in classes B and A(Democrats and Republicans in Figure 1) are listedto the right of the chart. Hovering over points andterms highlights the point and displays frequencystatistics.

Point colors are determined by their scores on s.Those corresponding to terms with a high sB col-ored in progressively darker shades of blue, whilethose with a higher sA are colored in progressivelydarker shades of red. When both scores are aboutequal, the point colors become more yellow, whichcreates a visual divide between the two classes.The colors are provided by D3’s “RdYlBu” di-verging color scheme from Colorbrewer6 via d37.

Other point colors (and scorings) can be used.For example, Figure 4 shows coefficients of an `1penalized log. reg. classifier on V features. Scat-tertext, in this example, is set to color 0-scoringcoefficients light gray. Terms’ univariate predic-tive power are still evident by their chart position.See below8 for an interactive version.

Democratic frequency:8 per 25,000 terms

Some of the 25 mentions:

Republican frequency:26 per 25,000 terms

Some of the 62 mentions:BARACK OBAMAThey knew they were part of something larger — a nation that triumphed over fascism and depression, a nation where the most innovative businesses turn out the world's best products, and everyone shared in that pride and success from the corner office to the factory floor.

MITT ROMNEYThat business we started with 10 people has now grown into a great American success story.

MARCO RUBIO

Infrequent Average Frequent

Republican Frequency

Infre

quen

tAv

erag

eFr

eque

nt

Dem

ocra

tic F

requ

ency Top Democratic

president obamabarack'm hereforforwardclassobamanotcharlottebostonmedicareeducationshemake

Top Republicangovernmentmittadministrationsuccessleadershipstorybusinessto bepauli thinkthisbuilt itisand

Characteristicobamaromneybarackmittobamacarebidenromneyshardworkingbaingrandkidsbillionairesmillionairesledbetterbuenaspellnochesblessdreamerscongresswomanbipartisanwealthiestriskedtrillionrepublicansrecessioninsurmountablegentlemenelectingpelosiunderstands

obama

romneybarack

millionaires

pell

wealthiest

trillion

gentlemen

foughtamericans

medicare

repeal

prosper

reagan

america

blaming

founding

seniors

unemployment

wealthy

veterans

blessed

insult

fathers

olympics

stake

stimulus

democratic

stronger

fighting

rebuild

cuts

ours

forward

liberty

strongest

freedom

blame

staples

roosevelt

we

michellegreatest

deeply

stark

hopes

charlotte

affordable

caring

success

belief

uniform

loyal

walks

loving

fulfill

troops mayor

better

voting

committed

pursuit

detroit

cents

ran

class

fair

drill

kept

jillfarmers

hands

pay

coal

loved

principles

notion

move

led

son

story

rid

administration

visionlady

wealth

doors

israel

breaks

minds

iraq

mate

visited

threats

regulationshundreds

pulledapart

platform

fix

lie

trust

hunt

younger

rise

era

flying

facts

birth

fear

enemy

begin

coverage

que

rules

auto

para

hit

lay

federal

pass

felt

deep

chief

twice

solar

act

ownerscast

lesson

last

nations

safesigned

insurance

simple

fail

bus

fall

battle edge

loans

church

dealbad

planet

if

paid

fill

six

tradefuel

bank

bit

lots

join

hall

save

sent

none

girl

self

re

until

field

hours

ok

accessline

point

top

visit

girlsage

free

low

store

the story

1

roll back

23

his dream

story of

big government

ok.

16

mitt was

42

natural gas

3

take away

re elect

that has

it has

insurance companies

18

d.c.

2

a business

8 built it

i think

'm here

Democratic document count: 123; word count: 76,864Republican document count: 66; word count: 58,138

Search for term Type a word or two…

Term: successDownload SVG

Figure 4: A cropped view of points being coloredusing `1-logreg coefficients. Interactive version:jasonkessler.github.io/st-sparse.html

5 Topical category discriminatorsIn 2012, how did Republicans and Democrats uselanguage relating to “jobs”, “healthcare”, or “mil-itary” differently? Figure 5 shows, in the runningexample, words similar to “jobs” that were char-acteristic of the political parties.6colorbrewer2.org7github.com/d3/d3-scale-chromatic8jasonkessler.github.io/sparseviz.html

89

Page 6: Scattertext: a Browser-Based Tool for Visualizing how ... · Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ Jason S. Kessler CDK Global jason.kessler@gmail.com

Infrequent Average Frequent

Republican Frequency

Infre

quen

tAv

erag

eFr

eque

nt

Dem

ocra

tic F

requ

ency Top Democratic

workersinsurance companiescompanieseducationfamiliesmillionairesmedicarepell grantspayseniorsinsuranceaffordableindustryauto industry

Top Republicanjob creatorsunemploymentbusinesssmall businessproblemsmitt romneyenterprisemittleadershipfree enterprisefederal governmentfathersolympicsreagan

Most similarjobsjobs overseascreate jobscreating jobsjobjob growthjob creatorsjob creationopportunitiesbusinessesresponsibilitiespaycheckemployeesworkersinvestmentsinsurance companiessmall businessesbudgetshomesmomslaborprogramslobbyistscustomersfactoriesindustriesbusiness ownerssolutionsschoolsraise taxes

jobs

jobs overseas

workers

investments

insurance companies

moms

lobbyists

loans

skills

unemployment

business

role

small business

problems

burdens

costs

resources

business owner

people

duty

medicare

pell grants

pay

seniorsinsurance

cents

staples

affordable

doctors

invest

hours

years later

tax cut

nights

grandparents

women

coal

owners

deal

grantspell

experts

fathers

jillhundreds

olympics16 trillion

reagan

they

paid

iraq

veterans

girls

roosevelt

banks

bain capital

administrationiowa hire

water

action

ballot

government

york

we

fix

texas

second term

regulations

firm

detroit

ok.

wealthy

freedom

lake city

60

others

early childhood

clean

fillfuel

race

cuts

quality

game

40

america

8

girl

boys

tea

kate

buy

30

18

my grandfather

pays

simple

care

libertyrunning mate

feet

team

15

doors

bain

hunt

street

decent

5

talks

hands

bus

earth

code

prosper

fulfill

law

risk

begin

42

line

gas

ceo

decade

move

story

bit

five

finefiscal

israel

16

term

fair shot

wall

rural

29

coverage

4federal

lift

wait

edgeprinciples

ship

district

act

d.c.

fought

ok

guarantee

$ 1mate

class

teaching

steel2

capital

thanks

pursuit

better

wisdom

con

six

arms

save

core

notion

fighting

join

re elect

church

bad

era

auto

age

birth

lie

trust

learn

hero

hitgone

charlotte

fair

visitlay

son

table

point

greatest

felt allow

short

led

until

free

fall

blessed

fear

tells

kept

deny

signed

el

re

top

que Figure 5: Words and phrases that are seman-tically similar to the word “jobs” are coloreddarker on a gray-to-purple scale, and general andcategory-specific related terms are listed to theright. Note that this is a cropping of the upperleft-hand corner of the plot. Interactive version:jasonkessler.github.io/st-sim.html.

In this configuration of Scattertext, words arecolored by their cosine similarity to a queryphrase. This is done using spaCy9-provided GloVe(Pennington et al., 2014) word vectors (trained onthe Common Crawl corpus). Mean vectors areused for phrases.

The calculation of the most similar terms as-sociated with each category is a simple heuristic.First, sets of terms closely associated with a cat-egory are found. Second, these terms are rankedbased on their similarity to the query, and the toprank terms are displayed to the right of the scatter-plot (Figure 5).

A term is considered associated if its p-valueis <0.05. P-values are determined using MCQ’sdifference in the weighted log-odds-ratio with anuninformative Dirichlet prior. This is the onlymodel-based method discussed in Monroe et al.that does not rely on a large in-domain backgroundcorpus. Since I am scoring bigrams in addition tothe unigrams scored by MCQ, the size of the cor-pus would have to be larger to have high enoughbigram counts for proper penalization.

This function relies the Dirichlet distribution’sparameter α ∈ R

|V |+ . Following MCQ, αt = 0.01.

Formulas 16, 18 and 22 are used to compute z-scores, which are then converted to p-values usingthe Normal CDF of ζA−Bw , letting y(K)

t = φ(t,K)st K ∈ A,B and t ∈ V .

As seen in Figure 5, the top Republican wordrelated to “jobs” is “job creators”, while “workers”is the top Democratic term.

9spacy.io

6 Conclusion and future workScattertext, a tool to make legible, comprehen-sive visualizations of class-associated term fre-quencies, was introduced. Future work will in-volve rigorous human evaluation of the usefulnessof the visualization strategies discussed.

AcknowledgmentsJay Powell, Kyle Lo, Ray Little-Herrick, WillHeadden, Chuck Little, Nancy Kessler and KamWoods helped proofread this work.

ReferencesKen Been, Eli Daiches, and Chee Yap. 2007. Dynamic

map labeling. IEEE-VCG .Zsolt Bitvai and Trevor Cohn. 2015. Non-linear text

regression with a deep convolutional neural network.In ACL.

Mike Bostock, Shan Carter, and Matthew Ericson.2012. At the national conventions, the words theyused. In The New York Times.

Kyunghyun Cho, Bart van Merrienboer, CaglarGulcehre, Fethi Bougares, Holger Schwenk, andYoshua Bengio. 2014. Learning phrase representa-tions using RNN encoder-decoder for statistical ma-chine translation. CoRR .

Glen Coppersmith and Erin Kelly. 2014. Dynamicwordclouds and vennclouds for exploratory dataanalysis. In ACL-ILLVI.

Justin Ryan Grimmer. 2010. Representational Style:The Central Role of Communication in Representa-tion. Ph.D. thesis, Harvard University.

Mahesh Joshi, Dipanjan Das, Kevin Gimpel, andNoah A. Smith. 2010. Movie reviews and revenues:An experiment in text regression. In HLT-NAACL.

Dan Jurafsky, Victor Chahuneau, Bryan Routledge, andNoah Smith. 2014. Narrative framing of consumersentiment in online restaurant reviews. First Mon-day .

Calvin Loncaric, Emina Torlak, and Michael D. Ernst.2016. Fast synthesis of fast collections. In PLDI.

Burt L. Monroe, Michael P. Colaresi, and Kevin M.Quinn. 2008. Fightin’ words: Lexical feature se-lection and evaluation for identifying the content ofpolitical conflict. Political Analysis .

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP.

Christian Rudder. 2014. Dataclysm: Who We Are(When We Think No One’s Looking). Crown Pub-lishing Group.

Alexandra Schofield and Leo Mehr. 2016. Gender-distinguishing features in film dialogue. NAACL-CLfL .

H. Andrew Schwartz, Johannes C. Eichstaedt, and Mar-garet L. et al. Kern. 2013. Personality, gender, andage in the language of social media: The open-vocabulary approach. PLOS ONE .

Julia Silge and David Robinson. 2016. tidytext: Textmining and analysis using tidy data principles in r.JOSS .

90


Recommended