An introduction to designing and building data visualizations

Post on 25-Feb-2016

68 views 3 download

Tags:

description

An introduction to designing and building data visualizations. Kristen Sosulski ksosulsk@stern.nyu.edu. About me. One of many influences…. Agenda. I. What is data visualization? II. What types of stories can you tell with a visualization? III. How to approach c reating visualizations? - PowerPoint PPT Presentation

transcript

An introduction to designing and building data visualizations

Kristen Sosulskiksosulsk@stern.nyu.edu

About me

One of many influences…

Agenda

I. What is data visualization?II. What types of stories can you tell with a visualization?III. How to approach creating visualizations?III. Try it and apply it.

I. DEFINING VISUALIZATION

Visualization is a kind of narrative, providing aclear answer to a question without extraneous

details.

-- Ben Fry, 2008, p. 4.

Visualization is a graphical representation of some data or concepts

-- Colin Ware, 2008, p. 20

Visual design is mapping datato visual form. It should conveythe unique properties of thedata set it represents.

Visualizations

• Help us think• Use perception to

offload cognition• Serves as an external

aid to augment working memory

• Boost our cognitive abilities

Visualizations are helpful in communication and analysis

Dual channels

Limited capacity

Active Processing

However, visualizations can hinder our message when designed poorly.

Wong, 2010, p. 15

Good Chart Design

Use natural increments for the y-axis scale

Include a zero baseline in all bar charts

Place the larger segments of a pie chart on top at the 12 o’clock

Wong, 2010, p. 143

Data visualization enables us to record, analyze, and communicate

Past Present Future

Rationale

• Traditional reports using tables, rows, and columns do not paint the whole picture or, even worse, lead an analyst to a wrong conclusion.

• Firms need to use data visualization because information workers: – Cannot see a pattern without data visualization– Cannot fit all of the necessary data points onto a single

screen– Cannot effectively show deep and broad data sets on a

single screen.

Source: Evelson, B. & Yuhanna, N. (2012). The Forrester Wave: Advanced data visualization (ADV) Platforms, Q3, 2012. Forrester Research, July 17.

Patterns: Violence in Video Games News Stores: Using a filled density plot

Source: http://www.ted.com/talks/david_mccandless_the_beauty_of_data_visualization.html. Begin at 4:45

Data Points: U.S. Unemployment Rate using a choropleth map

Source: Forbes

Data Points: Student Loan Debt using a bar, line, and area charts

Source: The Federal Reserve Bank of New York: http://www.newyorkfed.org/studentloandebt/

Deep and Broad: Four Ways to Slice Obama’s 2013 Budget Proposal using a bubble pie chart

Source: New York Times: http://www.nytimes.com/interactive/2012/02/13/us/politics/2013-budget-proposal-graphic.html

II. WHAT STORIES CAN YOU TELL WITH DATA VISUALIZATION?

Hans Rosling on Poverty using a bubble chart with sliders

http://www.ted.com/talks/hans_rosling_reveals_new_insights_on_poverty.html

All medalists racing the 100 meter sprint

Source: http://www.nytimes.com/interactive/2012/08/05/sports/olympics/the-100-meter-dash-one-race-every-medalist-ever.html

Old vs. New Data Visualization

• Dynamic data = = Dynamic Visualizations• Visual querying. Drill downs. Drop downs.• Animated visualization.

– If a particular dimension, such as time, has hundreds or thousands of values (i.e. daily values over multiple years), manually clicking through every day is not practical.

– An animated scroll up/down is more practical.

You could tell a story like this… or

Patterns: How people spend their time using a stacked area/line graph

Source: New York Times

The growth of Target from 1962 to 2008 using an animated graduated symbol map

Source: Flowing data

How long does it take to afford a beer? Using a horizontal bar chart.

III. HOW TO APPROACH CREATING VISUALIZATIONS

A framework to get started…

Who’s the audience?

What’s the task?

What’s the data?

What’s the best visual display?

What’s the best visual display?

What do these charts have in common?

Scatter plot Matrix chart Network diagram

They show a relationship between points.

What do these charts have in common?

Bar Chart Block Histogram

Bubble Chart

They compare a set of values.

What do these charts have in common?

Line Graph Stacked Line/Area Graph

Track rises and falls over time

What do these charts have in common?

Pie Chart Treemaps

Seeing parts of the whole

What do these charts have in common?

Phrase Nets Word Clouds Word Trees

Edward Tufte: On exploring forms of display

http://www.youtube.com/watch?v=Th_1azZA2OY&noredirect=1

After we select our display, we need to apply effective design principles.

Let’s test our knowledge with a graph IQ test.

Graph Design IQ Test

This test will ask you 10 questions to determine how well you understand the

principles of good table and graph design. Good luck!

1: Which graph makes it easier to determine whether Mid-Cap U.S. Stock or Small-Cap U.S. Stock has the greater share?

International Stock

Large Cap US Stock

Bonds

Real Estate

Mid-Cap US Stock

Investment Portfolio Breakdown

Small Cap US Stock

Commodities

1: Which graph makes it easier to determine whether Mid-Cap U.S. Stock or Small-Cap U.S. Stock has the greater share?

International Stock

Large-Cap U.S. Stock

Bonds

Real Estate

Mid-Cap U.S. Stock

Small-Cap U.S. Stock

Commodities

Investment Portfolio Breakdown

0% 4% 8% 12% 16% 20%

1: Which graph makes it easier to determine whether Mid-Cap U.S. Stock or Small-Cap U.S. Stock has the greater share?

A. Pie ChartB. Bar Graph

Pie Chart Bar Graph

2: Which of these line graphs is easier to read?2-D Line Graph

60

Millions of USD

50

40

30

20

10

0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Company Sales

2: Which of these line graphs is easier to read?3-D Line GraphMillions

of USD

60

50

40

30

20

10

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Company Sales

2: Which of these line graphs is easier to read?

A. 2-D Line GraphB. 3-D Line Graph

2-D Line Graph 3-D Line Graph

3: Which of these two tables is easier to read?Table A

Region Revenue % of Total Revenue

Expenses Profit % of Total Profit

Europe $75,904,604 31.06% $40,988,486 $34,916,117 22.31%

Canada $51,572,694 21.10% $17,534,715 $34,037,978 21.75%

Western US $42,660,178 17.46% $11,944,849 $30,715,328 19.63%

Eastern US $33,977,385 13.90% $7,135,150 $26,842,134 47.15%

Central US $26,139,598 10.70% $3,920,939 $22,218,658 14.20&

Asia $14,135,278 5.78% $6,360,875 $7,774,402 4.97%

Total (or Avg) $244,389,737 100.00% $87,885,117 $156,504,619 100.00%

Sales Summary by Region

1st Quarter, 2007Regions are Sorted by Revenue

3: Which of these two tables is easier to read?Table B

Region Revenue % of Total Revenue

Expenses Profit % of Total Profit

Europe $75,904,604 31.06% $40,988,486 $34,916,117 22.31%

Canada $51,572,694 21.10% $17,534,715 $34,037,978 21.75%

Western US $42,660,178 17.46% $11,944,849 $30,715,328 19.63%

Eastern US $33,977,385 13.90% $7,135,150 $26,842,134 47.15%

Central US $26,139,598 10.70% $3,920,939 $22,218,658 14.20&

Asia $14,135,278 5.78% $6,360,875 $7,774,402 4.97%

Total (or Avg) $244,389,737 100.00% $87,885,117 $156,504,619 100.00%

Sales Summary by Region(USD) 1st Quarter, 2007Regions are Sorted by Revenue

3: Which of these two tables is easier to read?

A. Table AB. Table B

Table A

Table B

4: Which graph makes it easier to focus on the pattern of change through time, instead of the individual

values.

Bar Graph

Unique Visitors

Page Views

Millions

3.0

2.5

2.0

1.5

1.0

0.5

0.0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

2006 Web Traffic

4: Which graph makes it easier to focus on the pattern of change through time, instead of the individual

values.

Line Graph

Millions

3.0

2.5

2.0

1.5

1.0

0.5

0.0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

2006 Web Traffic

4: Which graph makes it easier to focus on the pattern of change through time, instead of the individual values.

A. Bar GraphB. Line Graph

Line Graph

Bar Graph

5: Only one of these graphs accurately encodes the values. The other skews the values in a misleading

manner. Which graph presents the data accurately?

Graph ANumbers of Shareholders

2,500

2,400

2,300

2,200

2,100

2000Yes No Undecided

5: Only one of these graphs accurately encodes the values. The other skews the values in a misleading

manner. Which graph presents the data accurately?

Graph BNumbers of Shareholders

2,500

2,000

1,500

1,000

500

0Yes No Undecided

5: Only one of these graphs accurately encodes the values. The other skews the values in a misleading

manner. Which graph presents the data accurately?

A. Graph AB. Graph B

Graph A Graph B

6: Which map makes it easier to find all of the counties with positive growth rates?

Map A2006 Growth Rate by County

-3% 0% +3%

6: Which map makes it easier to find all of the counties with positive growth rates?

Map B2006 Growth Rate by County

-3% 0% +3%

6: Which map makes it easier to find all of the counties with positive growth rates?

A. Map AB. Map B

Map A Map B

7: Which graph makes it easier to determine R&D’s travel expense?

USD 70

60

50

40

30

20

10 0

Payroll

Equipment

Travel

Supplies

Software

Misc.

R&D Sales

Management

Accounting

2006 Expenses by Department 3D Bar Graph

7: Which graph makes it easier to determine R&D’s travel expense?

R&D Sales Management Accounting Payroll

Equipment

Travel Supplies

Software

Misc.

2006 Expenses by Department

0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80

2D Bar Graph

7: Which graph makes it easier to determine R&D’s travel expense?

A. 3D Bar Graph (left)B. 2D Bar Graph (below)

8: In which graph are the labels easier to read?

Graph A

2006 Marketing Expenditures By CountryThousands of USD

7,000

6,000

5,000

4,000

3,000

2,000

1,000

0 United States

Canada United Kingdom

Japan France Germany Mexico China

8: In which graph are the labels easier to read?Graph B

2006 Marketing Expenditures By Country

Thou

sand

s of U

SD

7,000

6,000

5,000

4,000

3,000

2,000

1,000

0,000

Uni

ted

Stat

es

Cana

da

Uni

ted

King

dom

Japa

n

Fran

ce

Germ

any

Mex

ico

Chin

a

8: In which graph are the labels easier to read?

A. Graph AB. Graph B

Graph A Graph B

9: Which graph is easier to look at?

Graph A

Nebraska Oklahoma Kansas

USD in Thousands

100

80

60

40

20

0

Human Accounting Management Sales Manufacturing Resources

Median Employee Salary by Department and State

9: Which graph is easier to look at?

Graph B

Nebraska Oklahoma Kansas

USD in Thousands

100

80

60

40

20

0

Human Accounting Management Sales Manufacturing Resources

Median Employee Salary by Department and State

9: Which graph is easier to look at?

A. Graph AB. Graph B

Graph B

Graph A

10: Which table allows you to see the areas of poor performance more quickly?

Table A

Region Overall Revenue Expenses Profit Avg. Order Size

East Good $4,652,462 $2,682,765 $1,969,697 $6,845

West Fair 3,705,426 2,211,773 1,493,653 4,266

North Fair 3,215,789 2,712,984 502,805 4,568

South Poor 2,215,752 1,562,735 653,017 1,358

Overall Fair $13,789,429 $9,170,257 $4,619,172 $4,259

2006 Key Metrics

10: Which table allows you to see the areas of poor performance more quickly?

Table B

Region Overall Revenue Expenses Profit Avg. Order Size

East Good $4,652,462 $2,682,765 $1,969,697 $6,845

West Fair 3,705,426 2,211,773 1,493,653 4,266

North Fair 3,215,789 2,712,984 502,805 4,568

South Poor 2,215,752 1,562,735 653,017 1,358

Overall Fair $13,789,429 $9,170,257 $4,619,172 $4,259

2006 Key Metrics

10: Which table allows you to see the areas of poor performance more quickly?

A. Table AB. Table B

Table B

Table A

Above all else show the data

---Edward Tufte

Sometimes decorations can help editorializeabout the substance of the graphic. But it’swrong to distort the data measures—the inklocating values of numbers—in order to makean editorial comment or fit a decorative scheme.

--Edward Tufte

Principles• Chartjunk• Data-ink ratio• Data integrity

– Lie Factor• Data Richness• Scales

– Pie chart. Zero point.• Color.

– Color blindness– Using color sparingly– Use red for negative earnings

• Attribution

Avoid chart junk

Useless, non-informative, or information-obscuring elements of quantitative information displays.

Chart Junk: Remove grid lines

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr0

1

2

3

4

5

6

7

8

9

Sales

Sales

Chart Junk: Remove the frame around the visual

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr0

1

2

3

4

5

6

7

8

9

2010 Sales Data (in millions)

Chart Junk: Consider if tick marks are necessary

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr0

1

2

3

4

5

6

7

8

92010 Sales Data (in millions)

Tables and Charts: Remove Gridlines

2010 Forecast vs. Performance (U.S. $)

Forecast Performance

Qtr 1 $85,000 $95,000

Qrt 2 $80,000 $75,000

Qtr 3 $75,000 $65,000

Qtr 4 $60,000 $60,000

Total $300,000 $295,000

Tables and Charts: Remove Gridlines

2010 Forecast vs. Performance (U.S. $)

Forecast Performance

Qtr 1 $85,000 $95,000

Qrt 2 $80,000 $75,000

Qtr 3 $75,000 $65,000

Qtr 4 $60,000 $60,000

Total $300,000 $295,000

Data Ink Ratio

Reduce the amount of “ink” used to represent the data.

Data Ink Ratio: Too many bars to represent a single data point

Data Ink Ratio: Consider bin size.

Data Ink Ratio: Would an area chart work better?

Data Ink Ratio: Or a line Chart?

Data Integrity: Lie Factor

Lie Factor = size of effect shown in graphic size of effect of data

Data Integrity: Lie Factor = 14.8

Data Integrity: Decorate data without lying

Data Integrity: Does a change in perspective help tell your story?

1st Qtr2nd Qtr

3rd Qtr4th Qtr

0123456789

2010 Sales Data (in millions)

Data Integrity: Ensure a zero point scale

Proportions

Sales

1st Qtr2nd Qtr3rd Qtr4th Qtr

Proportions

Proportions

Proportions: What else is wrong?

8.23.2

1.4

1.2

Sales

1st Qtr2nd Qtr3rd Qtr4th Qtr

Doesn’t add up to 1 or 100%

Better. What’s still wrong?

0.5860.229

0.100

0.086

Sales

1st Qtr2nd Qtr3rd Qtr4th Qtr

Qtr 1 Qrt 2 Qtr 3 Qtr 4$0

$20,000

$40,000

$60,000

$80,000

$100,000

$120,000

$140,000

$160,000

$180,000

$200,000

PerformanceForecast

Sales performance compared to forecasted sales 2010U.S. $

Data RichnessRich data means quality data – accurate data from reputable sources plus effective filtering of data for the audience.

Wong, 2010, p. 28

Data Richness. Tell the whole story with an excerpt

Wong, 2010, p. 29

This Year Last Year

Data Richness. However, don’t be misleading….

Wong, 2010, p. 29

This Year Last Year

Data Quantity!= Data Richness

Wong, 2010, p. 29

Inconclusive

An upward trend

Color

• Minimize the use of color

• Use shading instead– From lightest to darkest

(no zebra pattern• Consider using red for

negative earnings.

Color. Some people are color blind

Labeling and Attribution• Explain encodings. • The design of every graph has a similar flow. You get the

data; encoded it with circles, bars, and colors; and then you let others read it.

• The readers have to decode your encodings at this point. • Describe what do the circles, bars, and colors represent.• Label directly on the data instead of/or in addition to

using a legend. • Cite your data source.

Source: Wong (2010); Yau (2011), p. 13

IV. TRY IT. APPLY IT.

Which MBA?

Let’s try to create something similar

Run mbarankpart1.pyDefault Sorted

01 – Bigger Figure Size

02 - Removing gray background and frame

03 - Make room for others… Remove frame

04 – Iterate to remove tick marks

05 – Bar height and bar color and edge

mbarank_part3.py: Now, plot 2 others and add ranks…

Refine your visual display in Adobe Illustrator

Tips for saving your image file

• If you are going to modify the image in Illustrator save your file as a PDF from PYTHON– Use savfig(filename)

savfig(mbarankings.pdf)– Or save from the function show() that launches

the interactive window in ipython

Working in Adobe Illustrator1. Open pdf document in Adobe Illustrator2. If you don’t see the Tools window, go to the Window menu and click

Tools to turn it on.3. The black arrow is called the Selection tool. Select it, and your

mouse pointer becomes a black arrow. 4. Click and drag it over the border. The border appears highlighted.

This is know as a clipping mask.5. Press delete on your keyboard to get rid of it.6. If this deletes the graphic, undo the edit, and use the Direct

Selection tool, which is represented by a white arrow, to highlight the clipping mask instead.

7. Use the Selection tool to change fonts, change colors, add text, etc.

One Mistake: Don’t Average Percentages

• You must go back to the original data source to recalculate the new percentage.

RESOURCES

Edward Tufte

Nathan Yau and Dona Wong

Casey Reas and Ben Fry

Stephen Few

Colin Ware and Richard Mayer

Seth Godin and Andy Goodman