+ All Categories
Home > Documents > DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1...

DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1...

Date post: 06-Oct-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
36
DSC 201: Data Analysis & Visualization Visualization Dr. David Koop D. Koop, DSC 201, Fall 2018
Transcript
Page 1: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

DSC 201: Data Analysis & Visualization

Visualization Dr. David Koop

D. Koop, DSC 201, Fall 2018

Page 2: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

“Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks more effectively.”

— T. Munzner

�2D. Koop, DSC 201, Fall 2018

Page 3: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Why Visual?

�3D. Koop, DSC 201, Fall 2018

[F. J. Anscombe]

I II III IV

x y x y x y x y

10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58

8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76

13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71

9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84

11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47

14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04

6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25

4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50

12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56

7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91

5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

Page 4: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Why Visual?

�3D. Koop, DSC 201, Fall 2018

[F. J. Anscombe]

I II III IV

x y x y x y x y

10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58

8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76

13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71

9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84

11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47

14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04

6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25

4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50

12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56

7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91

5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

Mean of x 9Variance of x 11Mean of y 7.50Variance of y 4.122Correlation 0.816

Page 5: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

●●

●●

●●

4 6 8 10 12 14 16 18

4

6

8

10

12

x1

y 1

●●

●●●

4 6 8 10 12 14 16 18

4

6

8

10

12

x2

y 2●

●●

●●

●●

4 6 8 10 12 14 16 18

4

6

8

10

12

x3

y 3

●●

●●

4 6 8 10 12 14 16 18

4

6

8

10

12

x4

y 4

Why Visual?

�4D. Koop, DSC 201, Fall 2018

[F. J. Anscombe]

Page 6: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Designing Visualizations

�5D. Koop, DSC 201, Fall 2018

[M. Stefaner, 2013]

Page 7: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Visualization Effectiveness

�6D. Koop, DSC 201, Fall 2018

[A. Kitaoka]

Page 8: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Visualization Effectiveness

�6D. Koop, DSC 201, Fall 2018

[A. Kitaoka]

Red, yellow, blue

Purple, orange do not exist!

Page 9: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Channels: Visual Appearance• How should we encode this data?

�7D. Koop, DSC 201, Fall 2018

Name Region Population Life Expectancy Income

China East Asia & Pacific 1335029250 73.28 7226.07

India South Asia 1140340245 64.01 2731

United States America 306509345 79.43 41256.08

Indonesia East Asia & Pacific 228721000 71.17 3818.08

Brazil America 193806549 72.68 9569.78

Pakistan South Asia 176191165 66.84 2603

Bangladesh South Asia 156645463 66.56 1492

Nigeria Sub-Saharan Africa 141535316 48.17 2158.98

Japan East Asia & Pacific 127383472 82.98 29680.68

Mexico America 111209909 76.47 11250.37

Philippines East Asia & Pacific 94285619 72.1 3203.97

Vietnam East Asia & Pacific 86970762 74.7 2679.34

Germany Europe & Central Asia 82338100 80.08 31191.15

Ethiopia Sub-Saharan Africa 79996293 55.69 812.16

Turkey Europe & Central Asia 72626967 72.06 8040.78

Page 10: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Share ! " #Bubbles $

Color

Select

Size

Zoom20152015

30

40

50

60

70

80

year

s

Life

exp

ecta

ncy ▼

1800 1900 2000

World Regions

Search...

Afghanistan

Albania

Algeria

Andorra

Angola

Antigua and Barbuda

Argentina

Armenia

Australia

Austria

Azerbaijan

Bahamas

Bahrain

Bangladesh

Barbados

Belarus

Population, total

100100%%

OPTIONS EXPAND PRESENT

English ▼ FACTS TEACH ABOUT ►HOW TO USE

Potential Solution

�8D. Koop, DSC 201, Fall 2018

[Gapminder, Wealth & Health of Nations]

Page 11: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Assignment 2• Visualizing Hurricane Data using Tableau and Altair • Tasks:

- Statistics - Number of hurricanes over the years - Windspeed versus pressure - Locations of hurricanes

�9D. Koop, DSC 201, Fall 2018

Page 12: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Visual Encoding• How do we encode data visually?

- Marks are the basic graphical elements in a visualization - Channels are ways to control the appearance of the marks

• Marks classified by dimensionality:

• Also can have surfaces, volumes • Think of marks as a mathematical definition, or if familiar with tools

like Adobe Illustrator or Inkscape, the path & point definitions

�10D. Koop, DSC 201, Fall 2018

Points Lines Areas

Page 13: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Horizontal

Position

Vertical Both

Color

Shape Tilt

Size

Length Area Volume

Visual Channels

�11D. Koop, DSC 201, Fall 2018

[Munzner (ill. Maguire), 2014]

Page 14: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Channels• Usually map an attribute to a single channel

- Could use multiple channels but… - Limited number of channels

• Restrictions on size and shape - Points are nothing but location so size and shape are ok - Lines have a length, cannot easily encode attribute as length - Maps with boundaries have area, changing size can be

problematic

�12D. Koop, DSC 201, Fall 2018

Page 15: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Cartograms

�13D. Koop, DSC 201, Fall 2018

[Election Results by Population, M. Newman, 2012]

Page 16: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Channel Types• Identity => what or where, Magnitude => how much

�14D. Koop, DSC 201, Fall 2018

[Munzner (ill. Maguire), 2014]

Magnitude Channels: Ordered Attributes Identity Channels: Categorical Attributes

Spatial region

Color hue

Motion

Shape

Position on common scale

Position on unaligned scale

Length (1D size)

Tilt/angle

Area (2D size)

Depth (3D position)

Color luminance

Color saturation

Curvature

Volume (3D size)

Channels: Expressiveness Types and Effectiveness Ranks

Page 17: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Mark Types• Can have marks for items and links

- Connection => pairwise relationship - Containment => hierarchical relationship

�15D. Koop, DSC 201, Fall 2018

[Munzner (ill. Maguire), 2014]

Marks as Items/Nodes

Marks as Links

Points Lines Areas

Containment Connection

Page 18: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Expressiveness and Effectiveness• Expressiveness Principle: all data from the dataset and nothing

more should be shown - Do encode ordered data in an ordered fashion - Don’t encode categorical data in a way that implies an ordering

• Effectiveness Principle: the most important attributes should be the most salient - Saliency: how noticeable something is - How do the channels we have discussed measure up? - How was this determined?

�16D. Koop, DSC 201, Fall 2018

Page 19: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

How do we test effectiveness?

�17D. Koop, DSC 201, Fall 2018

Page 20: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in length between elements

�18D. Koop, DSC 201, Fall 2018

[Heer & Bostock, 2010]

Page 21: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in length between elements

�19D. Koop, DSC 201, Fall 2018

[Heer & Bostock, 2010]

Answer: Left is ~5.6x longer than Right

Page 22: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in length between elements

�20D. Koop, DSC 201, Fall 2018

[Heer & Bostock, 2010]

Page 23: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in length between elements

�21D. Koop, DSC 201, Fall 2018

[Heer & Bostock, 2010]

Page 24: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in length between elements

�22D. Koop, DSC 201, Fall 2018

[Modified from Heer & Bostock, 2010]

Page 25: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in length between elements

�23D. Koop, DSC 201, Fall 2018

[Modified from Heer & Bostock, 2010]

Answer: Right is 4x larger than Left

Page 26: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in area between elements

�24D. Koop, DSC 201, Fall 2018

[Heer & Bostock, 2010]

Page 27: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in area between elements

�25D. Koop, DSC 201, Fall 2018

[Heer & Bostock, 2010]

Answer: A is ~2.25x larger (in area) than B

Page 28: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in area between elements

�26D. Koop, DSC 201, Fall 2018

[Heer & Bostock, 2010]

Page 29: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in area between elements

�27D. Koop, DSC 201, Fall 2018

[Heer & Bostock, 2010]

Answer: B is ~6.1x larger (in area) than A

Page 30: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in area between elements

�28D. Koop, DSC 201, Fall 2018

[Heer & Bostock, 2010]

Page 31: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

esting set of perceptual tasks, we replicated Cleveland &McGill’s [7] classic study (Exp. 1A) of proportionality es-timates across spatial encodings (position, length, angle),and Stone & Bartram’s [30] alpha contrast experiment (Exp.2), involving transparency (luminance) adjustment of chartgrid lines. Our second goal was to conduct additional ex-periments that demonstrate the use of Mechanical Turk forgenerating new insights. We studied rectangular area judg-ments (Exp. 1B), following the methodology of Cleveland &McGill to enable comparison, and then investigated optimalchart heights and gridline spacing (Exp. 3). Our third goalwas to analyze data from across our experiments to character-ize the use of Mechanical Turk as an experimental platform.

In the following four sections, we describe our experimentsand focus on details specific to visualization. Results of amore general nature are visited in our performance and costanalysis; for example, we delay discussion of response timeresults. Our experiments were initially launched with a lim-ited number of assignments (typically 3) to serve as a pilot.Upon completion of the trial assignments and verification ofthe results, the number of assignments was increased.

EXPERIMENT 1A: PROPORTIONAL JUDGMENTWe first replicated Cleveland & McGill’s seminal study [7]on Mechanical Turk. Their study was among the first to rankvisual variables empirically by their effectiveness for con-veying quantitative values. It also has influenced the designof automated presentation techniques [21, 22] and been suc-cessfully extended by others (e.g., [36]). As such, it is a nat-ural experiment to replicate to assess crowdsourcing.

MethodSeven judgment types, each corresponding to a visual en-coding (such as position or angle) were tested. The first fivecorrespond to Cleveland & McGill’s original position-lengthexperiment; types 1 through 3 use position encoding along acommon scale (Figure 1), while 4 and 5 use length encoding.Type 6 uses angle (as a pie chart) and type 7 uses circulararea (as a bubble chart, see Figure 2).

Ten charts were constructed at a resolution of 380⇥380 pix-els, for a total of 70 trials (HITs). We mimicked the number,values and aesthetics of the original charts as closely as pos-sible. For each chart, N=50 subjects were instructed first toidentify the smaller of two marked values, and then “makea quick visual judgment” to estimate what percentage thesmaller was of the larger. The first question served broadly toverify responses; only 14 out of 3,481 were incorrect (0.4%).Subjects were paid $0.05 per judgment.

To participate in the experiment, subjects first had to com-plete a qualification test consisting of two labeled examplecharts and three test charts. The test questions had the sameformat as the experiment trials, but with multiple choicerather than free text responses; only one choice was cor-rect, while the others were grossly wrong. The qualificationthus did not filter inaccurate subjects—which would bias theresponses—but ensured that subjects understood the instruc-tions. A pilot run of the experiment omitted this qualificationand over 10% of the responses were unusable. We discussthis observation in more detail later in the paper.

0

100

A B0

100

A B0

100

A B

Figure 1: Stimuli for judgment tasks T1, T2 & T3. Sub-jects estimated percent differences between elements.

A

B

B

A

A B

Figure 2: Area judgment stimuli. Top left: Bubblechart (T7), Bottom left: Center-aligned rectangles (T8),Right: Treemap (T9).

In the original experiment, Cleveland & McGill gave eachsubject a packet with all fifty charts on individual sheets.Lengthy tasks are ill-suited to Mechanical Turk; they aremore susceptible to “gaming” since the reward is higher, andsubjects cannot save drafts, raising the possibility of lost datadue to session timeout or connectivity error. We instead as-signed each chart as an individual task. Since the vast ma-jority (95%) of subjects accepted all tasks in sequence, theexperiment adhered to the original within-subjects format.

ResultsTo analyze responses, we replicated Cleveland & McGill’sdata exploration, using their log absolute error measure ofaccuracy: log2(|judged percent - true percent| + 1

8 ). We firstcomputed the midmeans of log absolute errors1 for each chart(Figure 3). The new results are similar (though not identical)to the originals: the rough shape and ranking of judgmenttypes by accuracy (T1-5) are preserved, supporting the valid-ity of the crowdsourced study.

Next we computed the log absolute error means and 95%confidence intervals for each judgment type using bootstrap-ping (c.f., [7]). The ranking of types by accuracy is consistentbetween the two experiments (Figure 4). Types 1 and 2 arecloser in the crowdsourced study; this may be a result of asmaller display mitigating the effect of distance. Types 4 and5 are more accurate than in the original study, but positionencoding still significantly outperformed length encoding.

We also introduced two new judgment types to evaluate an-gle and circular area encodings. Cleveland & McGill con-ducted a separate position-angle experiment; however, theyused a different task format, making it difficult to compare

1The midmean–the mean of the middle two quartiles–is a robust measureless susceptible to outliers. A log scale is used to measure relative propor-tional error and the 1

8 term is included to handle zero-valued differences.

Test % difference in area between elements

�29D. Koop, DSC 201, Fall 2018

[Heer & Bostock, 2010]

Answer: B is ~2.5 larger (in area) than A

Page 32: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Positions

Rectangular areas

(aligned or in a treemap)

Angles

Circular areas

Cleveland & McGill’s Results

Crowdsourced Results

1.0 3 .01.5 2 .52 .0Log Error

1.0 3 .01.5 2 .52 .0Log Error

Results Summary

�30D. Koop, DSC 201, Fall 2018

[Munzner (ill. Maguire) based on Heer & Bostock, 2014]

Page 33: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Visualization Tools• Analysis Apps: Tableau, Excel, SAS • Illustration Apps: Illustrator, Inkscape • R Libraries: base, ggplot • Python Modules: matplotlib, seaborn, bokeh, altair • Lower-level Frameworks: D3, Processing • Many, many more: Google "data visualization tools"

�31D. Koop, DSC 201, Fall 2018

Page 34: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Tableau Overview• Grew out of research at Stanford University on how to explore

multidimensional datasets & relational databases • Tableau Desktop: standalone (free trial, student license) • Tableau Public: cloud-based system (free) • Tableau Vizable: mobile app • Tableau's Introduction Videos • https://www.youtube.com/watch?v=6py0jyZc7K4

�32D. Koop, DSC 201, Fall 2018

Page 35: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Data In Tableau

• Categorical data = Dimension • Quantitative data = Measures

�33D. Koop, DSC 201, Fall 2018

Attributes

Attribute Types

Ordering Direction

Categorical Ordered

Ordinal Quantitative

Sequential Diverging Cyclic

Page 36: DSC 201: Data Analysis & Visualization · 2018. 10. 4. · 4 6 8 10 12 14 16 18 4 6 8 10 12 x 1 y 1 4 6 8 10 12 14 16 18 4 6 8 12 x 2 y 2 4 6 8 10 12 14 16 18 4 6 8 10 12 x 3 y 3

Tableau Example

�34D. Koop, DSC 201, Fall 2018


Recommended