+ All Categories
Home > Documents > Strategy and tactics for graphic multiples in Stata

Strategy and tactics for graphic multiples in Stata

Date post: 06-Jan-2016
Category:
Upload: lavonn
View: 20 times
Download: 0 times
Share this document with a friend
Description:
Strategy and tactics for graphic multiples in Stata. Nicholas J. Cox Department of Geography Durham University, UK. Comparison. Many useful graphs compare two or more sets of values, and so can be thought as of multiples. - PowerPoint PPT Presentation
Popular Tags:
74
1 Strategy and tactics for graphic multiples in Stata Nicholas J. Cox Department of Geography Durham University, UK
Transcript
Page 1: Strategy and tactics for                  graphic multiples in Stata

1

Strategy and tactics for graphic multiples in Stata

Nicholas J. Cox Department of Geography Durham University, UK

Page 2: Strategy and tactics for                  graphic multiples in Stata

2

Comparison

Many useful graphs compare two or more sets of values, and so can be thought as of multiples.

Often there can be a fine line between richly detailed graphics and busy, unintelligible graphics that lead nowhere.

In this presentation I survey strategy and tactics for developing good graphic multiples in Stata.

Page 3: Strategy and tactics for                  graphic multiples in Stata

3

Strategies: what to do

superimpose (on top) or juxtapose (alongside)?

plot different versions or reductions of the data

transform scales for easier comparison

linear reference patterns

backdrops of context

Page 4: Strategy and tactics for                  graphic multiples in Stata

4

Tactics: details of what to do over() and by() options and graph combine

kill the key or lose the legend if you can

annotations and self-explanatory markers

Page 5: Strategy and tactics for                  graphic multiples in Stata

5

Datasets visited

James Short’s collation from the transit of Venus

Florence Nightingale’s data on deaths in the Crimean War

deaths from the Titanic sinkingGrunfeld panel dataadmissions to Berkeley hostility in response to insult or apologyfluctuations in Arctic sea ice

Page 6: Strategy and tactics for                  graphic multiples in Stata

6

Original programs discussed

catplot (SSC) devnplot (SSC)qplot (Stata Journal) sparkline (SSC)spineplot (SJ)stripplot (SSC) tabplot (SSC)

Page 7: Strategy and tactics for                  graphic multiples in Stata

7

Categorical comparisons

Page 8: Strategy and tactics for                  graphic multiples in Stata

8

Berkeley admissions data

A classic dataset covers admissions to six graduate majors by gender at UC Berkeley.

At first sight, females were discriminated against.

But there is an underlying interaction: major by major, females generally do well, yet their acceptance rates are worse on more popular majors.

This is an example of an amalgamation paradox named for E.H. Simpson (1922–) but known to K. Pearson (1857–1936) and G.U. Yule (1871–1951).

Page 9: Strategy and tactics for                  graphic multiples in Stata

9

Berkeley data references

The original reference was Bickel, P.J., E.A. Hammel and J.W. O’Connell. 1975. Sex bias in graduate admissions: Data from Berkeley. Science 187: 398–404.

The Berkeley data were discussed as an example for Stata in Cox, N.J. 2008. Spineplots and their kin. Stata Journal 8: 105–121.

Page 10: Strategy and tactics for                  graphic multiples in Stata

10

A simple problem?

The structure of the data is already well known. The challenge is how best to present it.

There are three categorical variables major (anonymously A, B, C, D, E, F) gender (male, female) decision (accept, reject) so the data are just 24 frequencies.

Page 11: Strategy and tactics for                  graphic multiples in Stata

11

Bar chart

Many researchers would reach first for a bar chart.

Here is a slightly non-standard example, produced by tabplot (SSC), which is for one-way, two-way or three-way bar charts.

One feature here is showing numbers too in a hybrid of graph and table.

A cosmetic detail is toning down the use of colour. Large blocks with strong colours are unsubtle.

Page 12: Strategy and tactics for                  graphic multiples in Stata

12

44.5% 30.4%

55.5% 69.6%

rejected

admitted

de

cisi

on

male female

Page 13: Strategy and tactics for                  graphic multiples in Stata

13

Mosaic plot or spineplot

The previous bar chart omitted the frequencies. We can show them using a mosaic plot or spineplot.

The proportions of both variables are shown, giving marginal and conditional distributions.

Areas of tiles are proportional to raw frequencies. Departures from independence are easily seen.

The program here is spineplot.

Page 14: Strategy and tactics for                  graphic multiples in Stata

14

44.5%30.4%30.4%

55.5%69.6%69.6%

admitted

rejected

0

25

50

75

100

pe

rce

nt

by

de

cisi

on

0 25 50 75 100percent by gender

male female

Page 15: Strategy and tactics for                  graphic multiples in Stata

15

Drilling down

The bar chart and spineplot do a fair job of showing the gross breakdown with four percents. (Two are redundant.)

Predictably, both would be rejected as trivial by many journal reviewers, but both could be useful for presentations.

But clearly we need to drill down to see the patterns for different majors.

Page 16: Strategy and tactics for                  graphic multiples in Stata

16

More detailed bar chart

Stacking bars is a standard strategy, but the result is immediately much more complicated.

Showing all the detail does not always help. Focusing more sharply on the response of interest is a way forward.

In general there is no need for alphabetical order. Here majors A to F are already ordered by admission rate.

Page 17: Strategy and tactics for                  graphic multiples in Stata

17

0 200 400 600 800frequency

F

E

D

C

B

A

femalemale

femalemale

femalemale

femalemale

femalemale

femalemale

admitted rejected

Page 18: Strategy and tactics for                  graphic multiples in Stata

18

Dot chart

Dot charts as advocated by W.S. Cleveland remain under-used by comparison with bar charts.

In Stata that usually means graph dot.By using marker position alone, rather than

bar length, they are less busy and thus ease more detailed comparison.

Here it is easier to identify that female admission rates are higher for four majors and lower for the other two.

Page 19: Strategy and tactics for                  graphic multiples in Stata

19

0 25 50 75 100admission rate (%)

F

E

D

C

B

A

male female

Page 20: Strategy and tactics for                  graphic multiples in Stata

20

Details for dot charts

Open symbols (e.g. ○ not ●) tolerate overlap much better than closed symbols. ○ can even be combined with + whenever nearly equal values are possible.

Legends (keys) are at best a necessary evil. Self-explanatory or at least memorable symbolisation is to be prized wherever it is possible. Using blue for males and pink for females is a simple example.

Page 21: Strategy and tactics for                  graphic multiples in Stata

21

A scatter plot?

Many statistically-minded people find the idea of bar charts trivial, but their practice not very helpful. Where is the scatter plot, they cry?

Plotting admission rate against number of applicants re-introduces a crucial aspect, size of major. This allows identification of positive correlation for males and negative correlation for females, hence the paradox.

This is currently my favourite plot for these data.

Page 22: Strategy and tactics for                  graphic multiples in Stata

22

A

B

CD

E

F

AB

CD

E

F0

20

40

60

80a

dm

issi

on

ra

te (

%)

0 200 400 600 800number of applicants

malesfemales

Page 23: Strategy and tactics for                  graphic multiples in Stata

23

Previously…

In an earlier version of this plot I had admissions versus applications, both raw frequencies.

Reference lines here are lines through the origin such as y = x and y = 0.5x for 100% and 50% admission rates.

But it is simpler to plot admission rates. Then the reference lines are horizontal.

Page 24: Strategy and tactics for                  graphic multiples in Stata

24

Slogans: the banal in search of the profound

Focus as far as possible on the response or outcome, the variable you most want to explain.

Linear reference patterns are good and horizontal patterns better.

Omit what is unimportant and keep what is important.

Even for a very simple problem, it is rare that a single graph meets all needs.

Page 25: Strategy and tactics for                  graphic multiples in Stata

25

Continuous comparisons

Page 26: Strategy and tactics for                  graphic multiples in Stata

26

Hostility change

Results of an experiment reported by Atkinson, C. and J. Polivy. 1976. Effects of delay, attack, and retaliation on state depression and hostility. Journal of Abnormal Psychology 85: 570–576.

Male and female subjects were made to wait and then either were insulted or received an apology.

Half were given a chance to retaliate by negatively evaluating the experimenter.

Hostility was measured before and after the experiment.

Page 27: Strategy and tactics for                  graphic multiples in Stata

27

Variables in hostility study

Response: Change in hostility, a difference of scores

and so approximately continuous

Predictors all binary: Treatment: insult, apology Gender: male, female Retaliation allowed: yes, no

Page 28: Strategy and tactics for                  graphic multiples in Stata

28

ANOVA-type problems: What to plot?

Change in hostility is adequately modelled by a simple linear model, using analysis of variance.

What to plot for similar analyses is key here. Box plots (with medians etc.) are surprisingly

common even when comparison of means is the central question.

Plotting means with standard errors or confidence intervals is also common, but what about the detail omitted?

Page 29: Strategy and tactics for                  graphic multiples in Stata

29

devnplot (SSC)

devnplot (SSC) is named for its emphasis on plotting deviations. Deviations are measured from any level you care to specify, but deviations from means are the default.

“devplot” was too ugly and “deviationplot” too long.

Quantile enthusiasts will see it as a way to plot ordered quantiles side by side. Compare quantile or qplot (SJ).

Page 30: Strategy and tactics for                  graphic multiples in Stata

30

devnplot syntax

The syntax resembles standard modelling syntax, response named first and any predictors following.

With one variable named we get in essence a quantile plot for that variable, a plot of the ordered values versus an implicit cumulative probability scale.

The scaffolding emphasising that each value can be represented by a deviation from a level might seem redundant, but bear with me.

Page 31: Strategy and tactics for                  graphic multiples in Stata

31

-20

0

20

40

60ch

an

ge

Page 32: Strategy and tactics for                  graphic multiples in Stata

32

Adding predictors to the syntaxYou can specify either one or two predictors.

The result is a quantile plot for each subset, namely a category or combination of categories.

An undocumented upper limit arising from a limit in graph is 20 subsets, but more than 20 would likely be too busy any way.

A third binary predictor can be shown indirectly by a separate() option.

Page 33: Strategy and tactics for                  graphic multiples in Stata

33

-20

0

20

40

60

cha

ng

e

insult apologytreatment

Page 34: Strategy and tactics for                  graphic multiples in Stata

34

-20

0

20

40

60ch

an

ge

insult apologytreatment

male female male femalegender

Page 35: Strategy and tactics for                  graphic multiples in Stata

35

-20

0

20

40

60ch

an

ge

insult apologytreatment

male female male femalegender

no retaliation retaliation

Page 36: Strategy and tactics for                  graphic multiples in Stata

36

devnplot virtues

The display serves well in showing variation within subsets as well as variation between.

Interactions can be seen.

The scaffolding (in subtle gray) helps to tie the values of a group together visually.

The separate() option is best used to highlight a few unusual or interesting cases.

Page 37: Strategy and tactics for                  graphic multiples in Stata

37

Waterfall plots

Similar plots have been called waterfall plots, especially in clinical oncology.

But watch out: waterfall plots (or charts) have at least two quite different meanings elsewhere, in business and physical science contexts.

Sometimes the jungle of plot names is just a confounded nuisance.

Page 38: Strategy and tactics for                  graphic multiples in Stata

38

James Short and the transit of Venus (1763) Short collated and corrected observations

made by various astronomers during the transit of Venus in 1761.

The parallax here is the angle subtended by the earth’s radius, as if viewed and measured from the surface of the sun.

The data will be published and discussed in Stata Journal 13(3).

Page 39: Strategy and tactics for                  graphic multiples in Stata

39

Deviation plot

A deviation plot adjusts to the differing sample sizes.

Here deviations are relative to 25% trimmed means (otherwise known as midmeans or interquartile means). Boxplot fans can think that they average values within the box.

The context here of careful precise measurement does not rule out the occasional mild or even strong outlier.

Page 40: Strategy and tactics for                  graphic multiples in Stata

40

6

7

8

9

10

11p

ara

llax

(se

con

ds)

310 316 325_1 325_2page in Short (1763)

25% trimmed means shown

Page 41: Strategy and tactics for                  graphic multiples in Stata

41

Quantile plots

Deviation plots (waterfall plots, if you prefer) are in essence quantile plots.

qplot from SJ can superimpose through its over() option or juxtapose through its by() option.

How well does that compare?

Page 42: Strategy and tactics for                  graphic multiples in Stata

42

6

7

8

9

10

11q

ua

ntil

es

of

pa

ralla

x (s

eco

nd

s)

0 .2 .4 .6 .8 1fraction of the data

310316325_1

325_2

Page 43: Strategy and tactics for                  graphic multiples in Stata

43

6

8

10

12

0 .5 1 0 .5 1 0 .5 1 0 .5 1

310 316 325_1 325_2

qu

an

tile

s o

f p

ara

llax

(se

con

ds)

fraction of the dataGraphs by page in Short (1763)

Page 44: Strategy and tactics for                  graphic multiples in Stata

44

devnplot or qplot?

I prefer devnplot here, although qplot has useful options too, including flexibility over axis scales.

For example, if we plot against standard normal quantiles, normal (Gaussian) distributions will follow straight lines.

Page 45: Strategy and tactics for                  graphic multiples in Stata

45

6

8

10

12

-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2

310 316 325_1 325_2

qu

an

tile

s o

f p

ara

llax

(se

con

ds)

standard normal quantileGraphs by page in Short (1763)

Page 46: Strategy and tactics for                  graphic multiples in Stata

46

Strip plot

An alternative display is a strip plot or dot plot. (Many other names exist.)

Here it takes on the flavour of a histogram but with markers or point symbols for each value. Some binning allows stacking.

stripplot from SSC offers an alternative to official Stata’s dotplot.

Page 47: Strategy and tactics for                  graphic multiples in Stata

47

310

316

325_1

325_2

pa

ge

in S

ho

rt (

17

63

)

6 7 8 9 10 11parallax (seconds)

25% trimmed means shown

Page 48: Strategy and tactics for                  graphic multiples in Stata

48

Histograms or box plots?

Many statistical people would start almost automatically with histograms or box plots for such data. How do they compare?

You can judge for yourself.

A specific problem with histograms is keeping the amount of scaffolding down. It is easy to lose valuable real estate in axis and title information.

Page 49: Strategy and tactics for                  graphic multiples in Stata

49

05

10

05

10

05

10

05

10

6 8 10 12

310

316

325_1

325_2

Fre

qu

en

cy

parallax (seconds)Graphs by page in Short (1763)

Page 50: Strategy and tactics for                  graphic multiples in Stata

50

0

5

10

0

5

10

0

5

10

0

5

10

6 8 10 12

310

316

325_1

325_2

parallax (seconds)Graphs by page in Short (1763)

frequency

Page 51: Strategy and tactics for                  graphic multiples in Stata

51

How did we do that?

The main trick here is moving the subtitles to the left. It only works here because they are so short, but accept good fortune, however it comes.

The incantation is subtitle(, ring(1) pos(9) nobox nobexpand)

Page 52: Strategy and tactics for                  graphic multiples in Stata

52

Box plots

Box plots do work fairly well, but they just leave out too much detail for my taste.

If the details are accessible, you can decide for yourself whether they are trivial.

Page 53: Strategy and tactics for                  graphic multiples in Stata

53

6 7 8 9 10 11parallax (seconds)

325_2

325_1

316

310

Page 54: Strategy and tactics for                  graphic multiples in Stata

54

Timed comparisons

Page 55: Strategy and tactics for                  graphic multiples in Stata

55

Time series

Comparisons of time series are an especially rich, and especially challenging, area of statistical graphics.

The widespread term spaghetti plot hints immediately at the difficulties.

As always, we want to combine a grasp of general patterns with access to individual details.

Page 56: Strategy and tactics for                  graphic multiples in Stata

56

sparkline

The Grunfeld data (webuse grunfeld) are a classic dataset in panel-based economics.

Ten companies were monitored for 1935–54.

This can be an example for sparkline (SSC).

The name sparkline was suggested by Edward Tufte for intense text-like graphics. Time series are the most obvious example.

Page 57: Strategy and tactics for                  graphic multiples in Stata

57

Vertical and horizontal

By default sparkline stacks small graphs vertically.

If several graphs are combined, it is typical to cut down on axis labels and rely on differences in shape to convey information.

Horizontal stacking is also supported, which can be useful for archaeological or environmental problems focused on variations with depth or height.

Page 58: Strategy and tactics for                  graphic multiples in Stata

58

257.7

1486.7

2792.2

6241.7

2.8

2226.3

invest

mvalue

kstock

1935 1940 1945 1950 1955year

Page 59: Strategy and tactics for                  graphic multiples in Stata

59

invest

mvalue

kstock

invest

mvalue

kstock

invest

mvalue

kstock

1935 1940 1945 1950 1955 1935 1940 1945 1950 1955

1935 1940 1945 1950 1955 1935 1940 1945 1950 1955

1 2 3 4

5 6 7 8

9 10

Graphs by company

Page 60: Strategy and tactics for                  graphic multiples in Stata

60

257.7

1486.72792.2

6241.72.8

2226.3

209.9

645.51362.4

2676.350.5

669.7

33.1

189.61170.6

2803.397.8

888.9

40.29

174.93410.9

1001.510.2

414.9

39.67

91.9151.2

398.4183.2

804.9

20.36

135.72197

927.36.5

238.7

23.21

89.51

210.1

98.1

100.2

511.3

12.93

90.08191.5

1193.5.8

213.5

20.89

66.11213.3

496162

468

.93

6.53

87.94

58.12

3.23

14.33

invest

mvalue

kstock

invest

mvalue

kstock

invest

mvalue

kstock

1935 1940 1945 1950 1955 1935 1940 1945 1950 1955

1935 1940 1945 1950 1955 1935 1940 1945 1950 1955

1 2 3 4

5 6 7 8

9 10

Page 61: Strategy and tactics for                  graphic multiples in Stata

61

Nightingale’s data

Florence Nightingale (1820-1910) is well remembered for her nursing in the Crimean war and less so as a pioneer in data analysis.

Her most celebrated dataset is often reproduced using her polar diagram, but is easier to think about as time series.

Zymotic (loosely, infectious) disease mortality dominates other kinds, so much so that a square root scale helps comparison. (A logarithmic scale over-transforms here.)

Page 62: Strategy and tactics for                  graphic multiples in Stata

62

0

200

400

600

800

1000

18551854 1856

zymotic disease

wounds and injuriesall other causes

annualised rates per 1000

Nightingale's data on mortality in the Crimea

Page 63: Strategy and tactics for                  graphic multiples in Stata

63

0

25

100

225

400

625

900

18551854 1856

zymotic disease

wounds and injuriesall other causes

annualised rates per 1000

Nightingale's data on mortality in the Crimea

Page 64: Strategy and tactics for                  graphic multiples in Stata

64

Sparkline?

A sparkline display is useful to show relative shape, such as times of peaks.

We see that seasonality is only part of what is being seen. The harsh winter of 1854–5 coincided with some of the hardest battles of the war.

Page 65: Strategy and tactics for                  graphic multiples in Stata

65

1.4

1022.8

.4

115.8

2.5

140.1

zymotic disease

wounds and injuries

all other causes

18551854 1856annualised rates per 1000

Nightingale's data on mortality in the Crimea

Page 66: Strategy and tactics for                  graphic multiples in Stata

66

Arctic sea ice

Another time series example concerns seasonal variation in Arctic sea ice for 2002-13, just 12 annual series.

The usual spaghetti plot shows the similarity of series well, but makes comparing them difficult. Although some people try using a key or legend, that rarely works well beyond a very few series.

Separating out the series runs into the opposite problem.

Page 67: Strategy and tactics for                  graphic multiples in Stata

67

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

Arctic sea ice extent (million km²) 2002-13

Page 68: Strategy and tactics for                  graphic multiples in Stata

68

5

10

15

5

10

15

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec 1 Jan 1 Apr 1 Jul 1 Oct 31 Dec 1 Jan 1 Apr 1 Jul 1 Oct 31 Dec 1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2002 2003 2004 2005

2006 2007 2008 2009

2010 2011 2012 2013

Arctic sea ice extent (million km²)

Page 69: Strategy and tactics for                  graphic multiples in Stata

69

Combine: backdrop as contextSo, use both ideas:

Plot all data as a backdrop (subdued, say using grayscale).

Plot each series within its context(with stronger colour, thicker line).

See for discussion Cox, N. J. 2010. Graphing subsets. Stata Journal 10: 670–681.

Page 70: Strategy and tactics for                  graphic multiples in Stata

70

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2002

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2003

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2004

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2005

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2006

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2007

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2008

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2009

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2010

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2011

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2012

0

5

10

15

1 Jan 1 Apr 1 Jul 1 Oct 31 Dec

2013

Arctic sea ice extent (million km²)

Page 71: Strategy and tactics for                  graphic multiples in Stata

71

Cross-fertilisation

Page 72: Strategy and tactics for                  graphic multiples in Stata

72

Titanic data

The Titanic sank in 1912. Statistically, we want to explain fraction survived in terms of age, sex and class of those on board.

A standard graph is a stacked or divided bar graph, but it lacks punch. The command used was catplot (SSC).

So, we end with something rather different, produced with devnplot.

Page 73: Strategy and tactics for                  graphic multiples in Stata

73

0

0.2

0.4

0.6

0.8

1

first second third first second third crew

f m f m f m f m f m f m f m

child adult

died survived

fra

ctio

n

Page 74: Strategy and tactics for                  graphic multiples in Stata

74

1 2

3

1 2

3

1

2

3

C

1

23

C

0

0.2

0.4

0.6

0.8

1

fra

ctio

n s

urv

ive

dchild adult

age

female male female malesex

1,2,3,C = first, second, third class and crew

level is weighted mean for age and sex


Recommended