Design principles for data presentation: Translating analysis into
visualizationAndrew Eppig, UC BerkeleyCAIR Annual Conference
8 November 2012
CAIR 2012 Andrew Eppig, 8 November 2012
Overview• Design Principles
• Data Sample and Summary
• Why Charts Matter
• Visualization Process
• Chart Selection
• Layout
• Aesthetics
• Self-Sufficiency Check
• Institutional Examples and Applications2
CAIR 2012 Andrew Eppig, 8 November 2012
Guiding Design Principles
• Good visualizations start with good data and detailed analysis
• Know your data
• Good visualizations directly answer specific, focused questions
• Know what question(s) you are asking
• Good visualizations get out of the way of the data
• Let the data tell its story without excess clutter or distraction
3
“Too often we pay more attention to ‘pretty’ than to the most important element: information.”
-- Dona Wong, The Secrets of Graphics Presentation
CAIR 2012 Andrew Eppig, 8 November 2012
Data Sample, Summary, and Metrics
4
Name Weight (lb.) Height (in.) BMIBatman 210 74 27.0Michael Phelps 165 75 20.6Wonder Woman 130 72 17.6Hope Solo 140 69 20.7
Gender Group N BMI Mean BMI Std. Dev.
MaleComics 1,239 26.0 4.2
Male Athletes 403 23.8 3.5Male
Models 493 21.6 2.3
FemaleComics 505 20.3 3.4
Female Athletes 254 22.0 3.4FemaleModels 489 18.2 2.7
BMI = 703 xweight[lb]height[in]2
CAIR 2012 Andrew Eppig, 8 November 2012
Default Excel Charting
5
0.000
5.000
10.000
15.000
20.000
25.000
30.000
35.000
Models Athletes Comics
Average BMI by Group and Gender
Female
Male
CAIR 2012 Andrew Eppig, 8 November 2012
Excessive Chart Junk
6
17.000
18.000
19.000
20.000
21.000
22.000
23.000
24.000
25.000
26.000
27.000
Female
Male
18.200
21.600
22.000
23.800
20.300
26.000
BM
I
Average BMI by Group and Gender
Models Athletes Comics
CAIR 2012 Andrew Eppig, 8 November 2012
Improved Excel Charting
7
Average BMI by Group and Gender
18!
22!
22!
24!
20!
26!
Female!
Male!
Comics
Comics
Athletes
Athletes
Models
Models
CAIR 2012 Andrew Eppig, 8 November 2012
Visualization Checklist• What stories does the data tell? Which story do
you want to tell?
• What visualization will best aid the story?
• Who is the audience?
• What metric should you use?
• Which type of chart should you use?
• What is the layout of the visualization?
• How can details enhance the chart?
• Font, color, lines/shading, and text
Sour
ce:
Kai
ser
Fung
, Jun
k Ch
arts
8
“What are the content-reasoning tasks that this display is supposed to help with?” -- Edward Tufte, Beautiful Evidence
CAIR 2012 Andrew Eppig, 8 November 2012
Chart Selection
“Meaningful quantitative information always involves relationships. When displayed in graphs, these relationships always boil down to one or more of eight specific relationships: time series, ranking, part-to-whole, deviation,
distribution, correlation, geospatial, nominal comparison.” -- Stephen Few, Designing Effective Tables and Graphs
9
Body Mass Index Distributions
nalysis Question: Are Comic Book Superheroes’ bodies more like Top Athletes’ or Top Models’ bodies? Are comparisons the same for both men and women? To account for differences in height and weight, Body Mass Index (BMI) is used: BMI = 703 x weight / height². Comparing the BMI distributions will reveal similarities or differences.
Male Superheroes FeMale Superheroes
Male Athletes feMale Athletes
Male Models feMale Models
Body Mass Index
Body Mass Index
Body Mass Index
Body Mass Index
Body Mass Index Body Mass Index
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
15 20 25 30 35 15 20 25 30 35
15 20 25 30 35 15 20 25 30 35
15 20 25 30 35 15 20 25 30 35
0.0
0
.10
.0 0
.10
.0 0
.1
0.0
0
.1 0
.20
.0 0
.1 0
.20
.0 0
.1 0
.2
50% of distribution is INSIDE shaded areaMedian BMI = 25.2
39% of distribution is INSIDE shaded areaMedian BMI = 23.2
17% of distribution is INSIDE shaded areaMedian BMI = 21.6
50% of distribution is INSIDE shaded areaMedian BMI = 19.7
31% of distribution is INSIDE shaded area
Median BMI = 21.4
28% of distribution is INSIDE shaded area
Median BMI = 17.7
ummary: Male superheroes Tend to have a higher bmi than top MAle Athletes and a much higher bmi than top male models. So Male superheroes are beyond super human - but more like top male athletes than like top male models. Female superheroes tend to have a lower BMI than top female athletes and a higher bmi than top female models. So female superheroes are Neither super humans nor Super models but between Top female athletes and top female models. Female superheroes also have less variation in their BMI than male Superheroes or Top athletes of either gender.
Super humans- or -
Super Models?Analysis, Writing, Art, and Lettering by: Andrew Eppig
Sources: DC Comics (dc.wikia.com); marvel comics (Marvel.com/universe); 2008 US Olympic Team (www.2008.nbcolympics.com); Models.com (models.com)
(N = 1,239)
(N = 403)
(N = 493)
(N = 505)
(N = 254)
(N = 489)
Blue Shaded area shows middle 50% of superhero BMI distributions
A
S
CAIR 2012 Andrew Eppig, 8 November 2012
Visualization Layout
“Tufte’s (1990) recommendation of ‘small multiples’ [...] uses the replication in the display to facilitate comparison to the
implicit model of no change between the displays.” -- Andrew Gelman, Exploratory Data Analysis for Complex Models
10
Body Mass Index Distributions
nalysis Question: Are Comic Book Superheroes’ bodies more like Top Athletes’ or Top Models’ bodies? Are comparisons the same for both men and women? To account for differences in height and weight, Body Mass Index (BMI) is used: BMI = 703 x weight / height². Comparing the BMI distributions will reveal similarities or differences.
Male Superheroes FeMale Superheroes
Male Athletes feMale Athletes
Male Models feMale Models
Body Mass Index
Body Mass Index
Body Mass Index
Body Mass Index
Body Mass Index Body Mass Index
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
15 20 25 30 35 15 20 25 30 35
15 20 25 30 35 15 20 25 30 35
15 20 25 30 35 15 20 25 30 35
0.0
0.1
0.0
0.1
0.0
0.1
0.0
0.1
0.2
0.0
0.1
0.2
0.0
0.1
0.2
50% of distribution is INSIDE shaded areaMedian BMI = 25.2
39% of distribution is INSIDE shaded areaMedian BMI = 23.2
17% of distribution is INSIDE shaded areaMedian BMI = 21.6
50% of distribution is INSIDE shaded areaMedian BMI = 19.7
31% of distribution is INSIDE shaded area
Median BMI = 21.4
28% of distribution is INSIDE shaded area
Median BMI = 17.7
ummary: Male superheroes Tend to have a higher bmi than top MAle Athletes and a much higher bmi than top male models. So Male superheroes are beyond super human - but more like top male athletes than like top male models. Female superheroes tend to have a lower BMI than top female athletes and a higher bmi than top female models. So female superheroes are Neither super humans nor Super models but between Top female athletes and top female models. Female superheroes also have less variation in their BMI than male Superheroes or Top athletes of either gender.
Super humans- or -
Super Models?Analysis, Writing, Art, and Lettering by: Andrew Eppig
Sources: DC Comics (dc.wikia.com); marvel comics (Marvel.com/universe); 2008 US Olympic Team (www.2008.nbcolympics.com); Models.com (models.com)
(N = 1,239)
(N = 403)
(N = 493)
(N = 505)
(N = 254)
(N = 489)
Blue Shaded area shows middle 50% of superhero BMI distributions
A
S
CAIR 2012 Andrew Eppig, 8 November 2012
Aesthetic Considerations• Font: How do you increase legibility
and decrease distraction?
• Color: Which color palette is appropriate?
• Line/Shading: Which weight, color, and style will enhance the final product?
• Text: Can adding labels and narrative provide useful context?
“Hue contrast is easy to overuse to the point of visual clutter. A better approach is to use a few high chroma colors as color contrast in a presentation consisting primarily of grays and muted colors.”
-- Maureen Stone, Choosing Colors for Data Visualization
Sour
ce:
Che
mic
al C
olor
Pla
te C
orp.
11
CAIR 2012 Andrew Eppig, 8 November 2012
Final Infographic
12
Body Mass Index Distributions
SuperHeroes nalysis Question: Are Comic Book Superheroes’ bodies more like Top Athletes’ or Top Models’ bodies? Are comparisons the same for both men and women? To account for differences in height and weight, Body Mass Index (BMI) is used: BMI = 703 x weight / height². Comparing the BMI distributions will reveal similarities or differences.
Male Superheroes FeMale Superheroes
Male Athletes feMale Athletes
Male Models feMale Models
Body Mass Index
Body Mass Index
Body Mass Index
Body Mass Index
Body Mass Index Body Mass Index
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
15 20 25 30 35 15 20 25 30 35
15 20 25 30 35 15 20 25 30 35
15 20 25 30 35 15 20 25 30 35
0.0
0.1
0.0
0.1
0.0
0.1
0.0
0.1
0.2
0.0
0.1
0.2
0.0
0.1
0.2
50% of distribution is INSIDE shaded areaMedian BMI = 25.2
39% of distribution is INSIDE shaded areaMedian BMI = 23.2
17% of distribution is INSIDE shaded areaMedian BMI = 21.6
50% of distribution is INSIDE shaded areaMedian BMI = 19.7
31% of distribution is INSIDE shaded area
Median BMI = 21.4
28% of distribution is INSIDE shaded area
Median BMI = 17.7
ummary: Male superheroes Tend to have a higher bmi than top MAle Athletes and a much higher bmi than top male models. So Male superheroes are beyond super human - but more like top male athletes than like top male models. Female superheroes tend to have a lower BMI than top female athletes and a higher bmi than top female models. So female superheroes are Neither super humans nor Super models but between Top female athletes and top female models. Female superheroes also have less variation in their BMI than male Superheroes or Top athletes of either gender.
Super humans- or -
Super Models?Analysis, Writing, Art, and Lettering by: Andrew Eppig
Sources: DC Comics (dc.wikia.com); marvel comics (Marvel.com/universe); 2008 US Olympic Team (www.2008.nbcolympics.com); Models.com (models.com)
(N = 1,239)
(N = 403)
(N = 493)
(N = 505)
(N = 254)
(N = 489)
Blue Shaded area shows middle 50% of superhero BMI distributions
A
S
CAIR 2012 Andrew Eppig, 8 November 2012
Self-Sufficiency Test
“Can the graphical elements stand on their own feet? If one removes the numbers from the graphic, can one still understand the
key messages?” -- Kaiser Fung, Junk Charts13
Male Superheroes FeMale Superheroes
Male Athletes feMale Athletes
Male Models feMale Models
Body Mass Index
Body Mass Index
Body Mass Index
Body Mass Index
Body Mass Index Body Mass Index
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
CAIR 2012 Andrew Eppig, 8 November 2012
Chart Function and Selection
14
Analytical Relationship Highlighted Feature
Time Series Changes over time
Ranking Relative position
Part-to-Whole Fraction of whole
Deviation Differences between sets
Distribution Range and frequency
Correlation Relationship between sets
Geospatial Location
Nominal Comparison Group values
Source: Stephen Few, Show Me the Numbers (2nd edition)
CAIR 2012 Andrew Eppig, 8 November 2012
Line Charts - Time Series
15
20,000
22,000
24,000
26,000
1980
19
82 19
84 19
86 19
88 19
90 19
92 19
94 19
96 19
98 20
00 20
02 20
04 20
06 20
08 20
10 20
12 0
5,000
10,000
15,000
20,000
25,000
30,000
1980 1985 1990 1995 2000 2005 2010 20,000
22,000
24,000
26,000
1980 1985 1990 1995 2000 2005 2010
x-axis has too many labelsand labels are slanted✘ y-axis scale is too large✘
y-axis gridlines are too heavy✘
20,000
22,000
24,000
26,000
1980 1985 1990 1995 2000 2005 2010
UC Berkeley Undergraduate Fall Enrollment, 1983-2012Headcount
✓ x-axis labels are horizontal and legibly spaced
✓ data range is roughly 2/3 of the y-axis scale
✓ y-axis gridlines are light in color and weight
Source: UC Berkeley, Cal Answers
CAIR 2012 Andrew Eppig, 8 November 2012
0.00
500.00
1,000.00
1,500.00
2,000.00
2,500.00
1 2 3 4 5 6 7 0
500
1,000
1,500
2,000
2,500
1 2.5 4 5.5 7
0 500
1,000 1,500 2,000 2,500
1 2 3 4 5 6 7
Line Charts - Distribution
16
y-axis labels have extraneous decimal points✘
x-axis labels use an unintuitive interval✘ axis labels are too heavy✘
✓ y-axis labels are rounded to the nearest major increment
✓ axis labels use natural intervals
1, 2, 3... 2, 4, 6... 10, 20, 30...
✓ x-axis labels are light in size and weight
0
500
1,000
1,500
2,000
2,500
1 2 3 4 5 6 7
Undergraduate Years to Degree, UC Berkeley Fall 2004 Cohort
Headcount
Years to Graduation Source: UC Berkeley, Cal Answers
CAIR 2012 Andrew Eppig, 8 November 2012
6
1,464
3,757
977
1,151
1,731
1,101
680
1,774
579
3,158
846
Other EVCP Programs
L&S-Undergraduate
L&S-Social Sciences
L&S-Math & Phys. Sciences
L&S-Bio Sciences
L&S-Arts & Humanities
L&S-Administered Programs
Haas School of Business
College of Natural Resources
College of Env. Design
College of Engineering
College of Chemistry
0
1000
2000
3000
4000
L&S-S
ocial
Scie
nces
Colleg
e of E
ngine
ering
Colleg
e of N
atural
Res
ource
s
L&S-A
rts &
Hum
anitie
s
L&S-U
nderg
radua
te
L&S-B
io Scie
nces
L&S-A
dmini
stered
Prog
rams
L&S-M
ath &
Phy
s. Scie
nces
Colleg
e of C
hemist
ry
Haas S
choo
l of B
usine
ss
Colleg
e of E
nv. D
esign
Other E
VCP Prog
rams
Bar Charts - Ranking
17
6
579
680
846
977
1,101
1,151
1,464
1,731
1,774
3,158
3,757
Other EVCP Programs
College of Env. Design
Haas School of Business
College of Chemistry
L&S-Math & Phys. Sciences
L&S-Administered Programs
L&S-Bio Sciences
L&S-Undergraduate
L&S-Arts & Humanities
College of Natural Resources
College of Engineering
L&S-Social Sciences
Undergraduate Major Headcount by Division, UC Berkeley Fall 2012
✓ labels are horizontal and easy to read
✓ units are usefully ranked by descending value
x-axis labels are slanted and small✘
units are ranked alphabetically✘Source: UC Berkeley, Cal Answers
CAIR 2012 Andrew Eppig, 8 November 2012
0%
1%
3%
5%
10%
13%
29%
39%
Pacific Islander
Native American/Alaskan Native
African American
Decline to State
International
Chicano/Latino
White
Asian
0% 10% 20% 30% 40%
Pacific Islander
Native American/Alaskan Native
African American
Decline to State
International
Chicano/Latino
White
Asian
Bar Charts - Part-to-Whole
18
✓ values are easy to read
✓ gaps between bars are 25-50% of the bar width
values are hard to determine✘
gaps between bars are too large✘Source: UC Berkeley, Cal Answers
Undergraduate Demographic Shares by Race/Ethnicity, UC Berkeley Fall 2012
0%
1%
3%
5%
10%
13%
29%
39%
Pacific Islander
Native American/Alaskan Native
African American
Decline to State
International
Chicano/Latino
White
Asian
CAIR 2012 Andrew Eppig, 8 November 2012
0!
500!
1,000!
1,500!
2,000!
2,500!
0.0! 1.0! 2.0! 3.0! 4.0!
0!
500!
1,000!
1,500!
2,000!
2,500!
0.0! 1.0! 2.0! 3.0! 4.0!
23! 2! 0! 2! 6! 3! 4! 2! 5! 5! 10! 8! 13!17!22!23!29!49!66!82!
151!201!291!352!459!
621!748!874!
1,049!
1,235!
1,446!
1,677!
1,969!
2,118!2,104!
2,260!2,339!
2,024!
1,817!
1,295!
394!
0!
500!
1,000!
1,500!
2,000!
2,500!
0.0! 1.0! 2.0! 3.0! 4.0!
Bar Charts - Histogram
19
y-axis gridlines are too dense✘ data labels distract, add clutter✘
✓ y-axis gridlines are only on the major increments
✓ shape of the distribution is easily seen without distraction
UC Berkeley Undergraduate Cumulative GPA, Spring 2012
Count
Cumulative GPA Source: UC Berkeley, Cal Answers
CAIR 2012 Andrew Eppig, 8 November 2012
0! 500! 1,000! 1,500! 2,000! 2,500! 3,000!
1990!
1995!
2000!
2005!
2010!
0!
500!
1,000!
1,500!
2,000!
2,500!
3,000!
1990! 1995! 2000! 2005! 2010!
0!
500!
1,000!
1,500!
2,000!
2,500!
3,000!
2012! 2007! 2002! 1997! 1992!
Bar Charts - Time Series
20
time variable is shown from right to left✘
y-axis is used for time variable✘
multiple hues are used for the same kind of data✘
0!
500!
1,000!
1,500!
2,000!
2,500!
3,000!
1990! 1995! 2000! 2005! 2010!
UC Berkeley International Undergraduate Fall Enrollment, 1990-2012
Headcount
✓ years increase from left to right
✓ years are plotted on x-axis
✓ bars are colored using shades of a single hue
Source: UC Berkeley, Cal Answers
CAIR 2012 Andrew Eppig, 8 November 2012
60%! 70%! 80%! 90%! 100%!
60%! 70%! 80%! 90%! 100%!
Dot Plots - Deviation
21
✓ dots’ size makes them easy to see
✓ group colors are complementary
points are too small✘
group colors are too similar✘Source: UC Berkeley, Cal Answers
Asian
Other/Decline to State
White
Pacific Islander
Native American/Alaskan Native
Chicano/Latino
International
African American
Asian
Other/Decline to State
White
Pacific Islander
Native American/Alaskan Native
Chicano/Latino
International
African American
UC Berkeley New Freshmen 6-Year Graduation Rates by Race/Ethnicity, Fall 2004 Cohort
60%! 70%! 80%! 90%! 100%!
Asian
Other/Decline to State
White
Pacific Islander
Native American/Alaskan Native
Chicano/Latino
International
African American
Men Women
6-Year Graduation Rate
CAIR 2012 Andrew Eppig, 8 November 2012
40%!
50%!
60%!
70%!
80%!
90%!
100%!
0%! 10%! 20%! 30%! 40%! 50%!
Scatter Plots - Correlation
22
too many bright hues are used to mark groups✘
chart length distorts the correlation of the data✘
✓ groups are marked using muted hues
✓ chart dimensions are close to the golden ratio (1:1.618)
Source: UC Accountability Report, 2011
40%!
50%!
60%!
70%!
80%!
90%!
100%!
0%! 10%! 20%! 30%! 40%! 50%!
* Respect Rate = percentage of students of a given race/ethnicity who responded strongly agree, agree, or somewhat agree to the prompt “students of my race/ethnicity are respected at this campus” on UCUES.
40%!
50%!
60%!
70%!
80%!
90%!
100%!
0%! 10%! 20%! 30%! 40%! 50%!
Impact of Critical Mass on Respect Rates by Race/Ethnicity and by UC Campus, 2007-2008 AY
Respect Rate*
Share of Race/Ethnicity Among New Students on Campus
African American
Chicano/Latino
Asian/Pacific Islander
White
CAIR 2012 Andrew Eppig, 8 November 2012
37%!
25%!
17%!
11%!
10%!
37%!
10%!11%!
11%!
14%!
17%!
37%!
10%!11%!
17%!
25%!
Pie Charts - Part-to-Whole
23
more than five groups are shown✘
slices are blown out of the pie✘
smallest slices are given too prominent location✘
✓ Only five groups are shown with the rest aggregated together
✓ center of the pie is shown making angles visible
✓ groups are ordered usefully with the largest slices at the top
Source: UC Berkeley, Cal Answers
37%!
10%!11%!
17%!
25%!
UC Berkeley College of Letters & Sciences Enrollment Shares, Fall 2012
Social Sciences
Arts & Humanities
Biological SciencesMath & Physical Sciences
Other L&S Divisions
CAIR 2012 Andrew Eppig, 8 November 2012
Visualization Layout: Attention Areas
24
High Visual Focus
Good for primary content
Medium Visual Focus
Good for secondary content
Medium Visual Focus
Good for secondary content
Low Visual Focus
Good for tertiary content
CAIR 2012 Andrew Eppig, 8 November 2012
Visualization Aesthetics: Color
25
avoid alternating high contrast hues
✘
avoid using more than one high chroma hue✘
✓ use a palette mostly of grays and muted hues
✓ choose a few high chroma colors for contrast
✓ use shades and tints to ensure that a black-and-white copy will still be coherent
!"#$%&'(()"*+,(-"*.,(/010,(!"#$%&''$()*##)$+,-*.&'$/-01#$),$2.3,*4&)0,.$/*&5"0678
2"3"$(4.3&55&(6 7$8+95
!"#$%&'(()"*+,(-"*.,(/0102
3"4"$(5.4&66&(7 8#6&9
Bright (high chroma) Muted
Sour
ce:
Don
a W
ong,
The W
all S
tree
t Jou
rnal
Gui
de to
Info
rmat
ion
Gra
phic
s: Th
e D
os a
nd D
on’ts
of P
rese
ntin
g D
ata,
Fact
s, an
d Fi
gure
s
CAIR 2012 Andrew Eppig, 8 November 2012
20%! 30%! 40%! 50%!
Visualization Aesthetics: Font
26
✓ font choice, weight, and spacing aid clarity
✓ single font used for labels -- second font only used for the title
Source: UC Berkeley, Cal Answers
20%! 30%! 40%! 50%!
UC Berkeley New Freshmen Yield Rates by Race/Ethnicity, Fall 2010 Cohort
Asian
Other/Decline to State
African American
Pacific Islander
Chicano/Latino
White
Native American/Alaskan Native
International
Women Men
Yield Rate
UC BERKELEY NEW FRESHMEN YIELD RATES BY RACE/ETHNICITY, FALL 2010 COHORT
Asian
Other/Decline to State
African American
Pacific Islander
Chicano/Latino
White
Native American/Alaskan Native
International
WOMEN MEN
YIELD RATE
bold and condensed fonts confuse the viewermultiplicity of fonts deters legibility
✘
✘
CAIR 2012 Andrew Eppig, 8 November 2012
0!
1,000!
2,000!
3,000!
4,000!
5,000!
6,000!
7,000!
1995! 2000! 2005! 2010!
Davis Irvine Los Angeles
Merced Riverside San Diego
Santa Barbara Santa Cruz Berkeley
0!
1,000!
2,000!
3,000!
4,000!
5,000!
6,000!
7,000!
1995! 2000! 2005! 2010!
0!
1,000!
2,000!
3,000!
4,000!
5,000!
6,000!
7,000!
1995! 2000! 2005! 2010!
Visualization Aesthetics: Lines/Shading
27
more than four groups are identified in one chart✘
weight of lines blurs trend details✘
label position makes identification hard✘
✓ only two groups are identified
✓ line weights are used for emphasis
✓ lines are directly labeled
Source: UC Accountability Report, 2011
0!
1,000!
2,000!
3,000!
4,000!
5,000!
6,000!
7,000!
1995! 2000! 2005! 2010!
UC New Fall Undergraduate Enrollment by Campus, 1995-2011
Headcount
UC Berkeley
Other UC Campuses
CAIR 2012 Andrew Eppig, 8 November 2012
Visualization Aesthetics: Labels/Text
28
UC Berkeley Undergraduate New Enrollment Shares by Gender and Race/Ethnicity, 1983-2012
Source: UC Berkeley, Cal Answers
Share
Share
Share
Share
Prop 209 enacted
Women
Men
Men
Women
Men
Women
Men
Women
White Asian/Pacific Islander
Underrepresented Minority International
0%!
10%!
20%!
30%!
1980! 1985! 1990! 1995! 2000! 2005! 2010!
0%!
10%!
20%!
30%!
1980! 1985! 1990! 1995! 2000! 2005! 2010!
0%!
10%!
20%!
30%!
1980! 1985! 1990! 1995! 2000! 2005! 2010!
0%!
10%!
20%!
30%!
1980! 1985! 1990! 1995! 2000! 2005! 2010!
Prop 209 banned affirmative action in 1997, precipitating a sharp decline in underrepresented minority (URM) students shares, which have yet to recover.
The overall gender gap with women outnumbering men is driven by Asian and URM students where the gender gaps are largest.
CAIR 2012 Andrew Eppig, 8 November 2012
Summary
• Know what question you are asking a visualization to answer
• Choose the best metric for your analysis and your audience
• Choose your chart to fit your question rather than your question to fit your chart
• Let the data tell its story without excess clutter or distraction
• Keep the focus of the visualization on the data
• Make sure all use of font, color, shading, and text enhance rather than distract
• Provide narrative to contextualize the highlights of the data
29
CAIR 2012 Andrew Eppig, 8 November 2012
Contact Information
30
Please feel free to contact me with questions or comments
Andrew Eppig
Research Analyst
Equity & Inclusion
UC Berkeley
104 California Hall #1500
Berkeley, CA 94720-1500
CAIR 2012 Andrew Eppig, 8 November 2012
Web Resources
31
• Junk Charts -- Kaiser Fung
• http://junkcharts.typepad.com
• Flowing Data -- Nathan Yau
• http://flowingdata.com/
• Charts ‘n’ Things -- NY Times Graphics Department
• http://chartsnthings.tumblr.com/
• Perceptual Edge -- Stephen Few
• http://www.perceptualedge.com
CAIR 2012 Andrew Eppig, 8 November 2012
Print Resources
32
Edward Tufte
• The Visual Display of Quantitative Information, 1983, Cheshire, CT: Graphics Press
• Visual Explanations: Images and Quantities, Evidence and Narrative, 1997, Cheshire, CT: Graphics Press
William Cleveland
• The Elements of Graphing Data, 1994, revised ed., Murray Hill, NJ: AT&T Bell Laboratories
Dona Wong
• The Wall Street Journal Guide to Information Graphics: The Dos and Don’ts of Presenting Data, Facts, and Figures, 2010, New York: W.W. Norton and Co.
Stephen Few
• Information Dashboard Design: The Effective Visual Communication of Data, 2006, Oakland, CA: Analytics Press
• Now You See It: Simple Visualization Techniques for Quantitative Analysis, 2009, Oakland, CA: Analytics Press
• Show Me the Numbers: Designing Tables and Graphs to Enlighten, 2012, second ed., Oakland, CA: Analytics Press
CAIR 2012 Andrew Eppig, 8 November 2012
Appendices
33
CAIR 2012 Andrew Eppig, 8 November 2012
Classic Charts
34
Charles Minard's 1869 chart showing the number of men in Napoleon’s 1812 Russian campaign army, their movements, as well as the temperature they encountered on the return path. Lithograph, 62 x 30 cm
CAIR 2012 Andrew Eppig, 8 November 2012
Classic Charts
35
Detail from John Snow's spot map of the Golden Square outbreak [1854 London cholera outbreak] showing area enclosed within the Voronoi network diagram. Snow's original dotted line to denote equidistance between the Broad Street pump and the nearest alternative pump for procuring water has been replaced by a solid line for legibility. Fold lines and tear in original (adapted from CIC, between 106 and 07).
CAIR 2012 Andrew Eppig, 8 November 2012
Bad Chart Examples
36
!"#$%&'($)*++,-$."*/+)$0#$1234#$,.$035$,.$."*267
8*/-4#9$:-,;"34$-#;-361#)$36$<)2,-)$!/=1#>$%&(?7
The problem:
• The 1978 dollar should be roughly half as big as the 1958 dollar ($0.44 vs $1.00) instead of the roughly one quarter as big
How the problem occurred:
• The chart uses 2-D graphics (i.e., representations of dollar bills with length and width), and both the length and the height were scaled by 1/2 -- resulting in the area being scaled by 1/4 (1/2 x 1/2)
The fix:
• When dealing with 2-D area representations (never use 3-D), remember to scale the area rather than scaling each dimension separately
Source: Tufte, 1983
CAIR 2012 Andrew Eppig, 8 November 2012
Bad Chart Examples
37
The problem:
• The message (growth of medical spending in emerging markets) is obfuscated and exaggerated
How the problem occurred:
• The chart uses too many bold colors, which creates visual confusion
• The chart uses pie charts for each year, which makes it hard to see trends
• The chart scales the pie charts incorrectly by scaling only the radius opposed to the area which distorts the changes
The fix:
• When dealing with trend data, time series using line charts are the best choice
Source: “Expanding Circles of Error”, Junk Charts
CAIR 2012 Andrew Eppig, 8 November 2012
Data Exploration via Visualization
38
!"#$%&"'()*+,(-.)/
0%1&'/2**!%3(&4*5("6/&7**899:;
<%=6*>&?1$=6%$@#*ABA9*(6()C#"#*%D*E%64%6*!"##$%&'%(&)*+#"*,%4"4*6%$*1$")"F/*G&(.="'()*-/$=%4#;**5("6/&H#*.)%$*%D*$=/*$=/*4($(*"#*(*&/I/)($"%6;
Howard Wainer’s visualization of John Arbuthnot’s 1710 analysis of London Bills of Mortality not only depicts historical incidents, it also provides a check for data quality. The 1704 spike is not associated with any historical incident. A check of the data reveals a transcription error by Arbuthnot where the 1674 data point was mistakenly labeled as 1704.
Source: Wainer, 2009
CAIR 2012 Andrew Eppig, 8 November 2012
Infographic Creation Details
39
CAIR 2012 Andrew Eppig, 8 November 2012
Data Preparation Steps
• Source identification
• Data collection
• Data scrubbing
• Data analysis
40
CAIR 2012 Andrew Eppig, 8 November 2012
Infographic Source Identification
• Super heroes and villains: DC and Marvel
• http://dc.wikia.com/
• http://marvel.com/universe/Main_Page
• Top athletes: 2008 US Olympic Team
• http://www.2008.nbcolympics.com/athletes/index.html
• Top models: models.com listings
• http://models.com/
41
CAIR 2012 Andrew Eppig, 8 November 2012
Infographic Data Collection
42
• Create Python web scraper
• Crawl web sites
• Download web pages
• Extract height, weight, and gender data
• Save data to file
CAIR 2012 Andrew Eppig, 8 November 2012
Infographic Data Scrubbing
• Check data quality
• Did extraction get correct height and weight?
• Are there duplicate entries?
• Remove super hero and super villain outliers
• Define height window based on athlete and model data
• Define weight window based on athlete and model data
43
CAIR 2012 Andrew Eppig, 8 November 2012
Infographic Data Analysis
• Combine all data in R
• Super heroes and villains, athletes, and models
• Create dummy variables
• Gender: male, female
• Source: super hero/villain, athlete, model
• Calculate BMI for each record
• Check summary statistics
• Data ranges, mean, standard deviation
• Run t-tests between groups
44
CAIR 2012 Andrew Eppig, 8 November 2012
Height Distributions
45
Male Superheroes FeMale Superheroes
Male Athletes feMale Athletes
Male Models feMale Models
Height (ft.)
height (ft.)
height (ft.)
height (ft.)
height (ft.) height (ft.)
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
0.0
0
.1 0
.2
(N = 1,289) (N = 1,289)
(N = 403) (N = 254)
(N = 493) (N = 479) 0
.0 0
.1 0
.20
.0 0
.1 0
.20
.0 0
.1 0
.2
0.0
0
.1 0
.20
.0 0
.1 0
.2
5’0” 5’6” 6’0” 6’6” 7’0”
59% of distribution is INSIDE shaded areaMedian height = 5’8”
66% of distribution is INSIDE shaded areaMedian height = 6’0”
5’0” 5’6” 6’0” 6’6” 7’0”
5’0” 5’6” 6’0” 6’6” 7’0”
35% of distribution is INSIDE shaded areaMedian height = 6’1”
50% of distribution is INSIDE shaded areaMedian height = 6’0”
50% of distribution is INSIDE shaded areaMedian height = 5’7”
42% of distribution is INSIDE shaded areaMedian height = 5’8”
5’0” 5’6” 6’0” 6’6” 7’0”
5’0” 5’6” 6’0” 6’6” 7’0”
5’0” 5’6” 6’0” 6’6” 7’0”
CAIR 2012 Andrew Eppig, 8 November 2012
Weight Distributions
46
Male Superheroes FeMale Superheroes
Male Athletes feMale Athletes
Male Models feMale Models
weight (lbs.)
weight (lbs.)
weight (lbs.)
weight (lbs.)
weight (lbs.) weight (lbs.)
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
100 150 200 250
0.0
0
.1
(N = 1,289) (N = 1,289)
(N = 403) (N = 254)
(N = 493) (N = 479)
0.0
0
.1
0.0
0
.1
0.0
0
.2 0
.40
.0 0
.2 0
.40
.0 0
.2 0
.4
50% of distribution is INSIDE shaded areaMedian weight = 185
100 150 200 250
50% of distribution is INSIDE shaded areaMedian weight = 128
40% of distribution is INSIDE shaded areaMedian weight = 175
100 150 200 250 100 150 200 250
38% of distribution is INSIDE shaded areaMedian weight = 140
35% of distribution is INSIDE shaded areaMedian weight = 160
100 150 200 250
39% of distribution is INSIDE shaded areaMedian weight = 115
100 150 200 250
CAIR 2012 Andrew Eppig, 8 November 2012
With Revised Data
47
Male Superheroes FeMale Superheroes
Male Athletes feMale Athletes
Male Models feMale Models
Body Mass Index
Body Mass Index
Body Mass Index
Body Mass Index
Body Mass Index Body Mass Index
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
Pro
babi
lity
15 20 25 30 35
0.0
0
.1
(N = 1,289)
15 20 25 30 35
15 20 25 30 35
15 20 25 30 35
15 20 25 30 35
15 20 25 30 35
(N = 1,289)
(N = 3,631) (N = 2,610)
(N = 493) (N = 479)
0.0
0
.1
0.0
0
.1
0.0
0
.2 0
.4
50% of distribution is INSIDE shaded areaMedian BMI = 25.4
47% of distribution is INSIDE shaded areaMedian BMI = 23.9
17% of distribution is INSIDE shaded areaMedian BMI = 21.6
0.0
0
.2 0
.40
.0 0
.2 0
.428% of distribution is INSIDE shaded areaMedian BMI = 18.3
50% of distribution is INSIDE shaded areaMedian BMI = 19.7
32% of distribution is INSIDE shaded areaMedian BMI = 21.5