Principles of Data VisualizationRodrigo De Luna Lara
ContentsThe Importance of Data Visualization 2
Planar and Retinal Variables 6Retinal Variables for Qualitative Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Retinal Variables for Quantitative Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
The Importance of Color 11The Color Brewer Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Sequential Brewer Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Diverging Brewer Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Qualitative Brewer Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Best Practices in Visualization 15Avoid 3D Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Avoid Pie Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Beware of Misleading Aspect Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Beware of Spurious Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Avoid dual-scaled axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Declutter your visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Emphasize what is important . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Choosing the correct visualization 27
Storytelling with Data 28Understand the Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Tell a Story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Case Studies 28
Bibliography 29
1
The Importance of Data Visualization
Suppose we have 13 different datasets (Matejka & Fitzmaurice, 2017) and we compute some statisticalmeasures for them, as shown in Table 1 (The R2 corresponds to a Pearson correlation coefficient). If youwere asked to describe these datasets, you could conclude that they are barely different, perhaps you couldconsider they are measurements of the same process with different ammounts of noise.
Table 1: Statistical Measures for 13 Distinct Datasets.
Dataset x y σx σy R2
1 54.26327 47.83225 16.76514 26.93540 -0.06447192 54.26610 47.83472 16.76983 26.93974 -0.06412843 54.26873 47.83082 16.76924 26.93573 -0.06858644 54.26732 47.83772 16.76001 26.93004 -0.06834345 54.26030 47.83983 16.76774 26.93019 -0.06034146 54.26144 47.83025 16.76590 26.93988 -0.06171487 54.26881 47.83545 16.76670 26.94000 -0.06850428 54.26785 47.83590 16.76676 26.93610 -0.06897979 54.26588 47.83150 16.76885 26.93861 -0.068609210 54.26734 47.83955 16.76896 26.93027 -0.062961111 54.26993 47.83699 16.76996 26.93768 -0.069445612 54.26692 47.83160 16.77000 26.93790 -0.066575213 54.26015 47.83972 16.76996 26.93000 -0.0655833
Once having reached your conclusion from the table, you decide out of curiosity to plot the first dataset, andyou come up with Figure 1.
0
25
50
75
100
0 25 50 75 100
x
y
Figure 1: Plot for Dataset 1
2
From the raw data itself, or even the statistical measures in Table 1 you wouldn’t have likely thought thedataset described a dinosaur. Now, considering the different statistical measures for the rest of the datasets,you could still conclude their plots are fairly similar. However, just to confirm you are correct you decide toplot them as well, coming up with Figure 2.
10 11 12 13
6 7 8 9
2 3 4 5
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
x
y
Figure 2: Plots for Datasets 2-13
It is evident that all of them are extremely different datasets, yet each have practically the same statisticalmeasures. This example is a more modern approach to a demonstration constructed by statistician FrancisAnscombe (1973) to demonstrate the importance of data visualization in analysis. Table 2 shows the samedescriptive statistics as the previous example for the Anscombe datasets.
Table 2: Statistical Measures for Anscombe Datasets.
Dataset x y σx σy R2
1 9 7.500909 3.316625 2.031568 0.81642052 9 7.500909 3.316625 2.031657 0.81623653 9 7.500000 3.316625 2.030424 0.81628674 9 7.500909 3.316625 2.030578 0.8165214
3
We could also fit a simple linear model of the form y = β1x + β0 to the 4 datasets to see if they can bemodeled by a common function. The resulting coefficients can be seen in Table 3.
Table 3: Linear Regression Coefficients for Anscombe Datasets.
Dataset β1 β0
1 0.5000909 3.0000912 0.5000000 3.0009093 0.4997273 3.0024544 0.4999091 3.001727
Like in the previous example, at this point we could conclude the datasets are extremely similar and wecan use the function y = 0.50x+ 3.00 for each of them without any issue, but we already know using thisapproach can be misleading, which can be confirmed by looking at the plots in Figure 3.
3 4
1 2
0 5 10 15 20 0 5 10 15 20
0
5
10
15
20
0
5
10
15
20
x
y
Figure 3: Plots for Anscombe Datasets
Despite the fact that these 4 datasets share a common best linear model, each of the datasets evidently hasvery distinct characteristics when visualized. This demonstration also displays the large effect outliers canhave in model predictions and statistical properties, justifying even more the need to visualize the data beforecommitting to any conclusions or insights. In the article which originated this demonstration, Anscombe(1973) explained that:
Most kinds of statistical calculation rest on assumptions about the behavior of the data. Thoseassumptions may be false, and then the calculations are misleading. We ought always to try tocheck whether the assumptions are reasonably correct; and if they are wrong we ought to be able toperceive in what ways they are wrong. Graphs are very valuable for these purposes.
4
Producing visualizations for the purpose of exploring the data in order to gain a better understanding ofthe interactions between variables is one of two main purposes of data visualization, which we will refer toas exploratory analysis. The second main purpose of data visualization is to communicate a message orconvey an insight to a specific audience, which we will refer to as explanatory analysis. Figure 4 shows theprogression from exploratory to explanatory analysis.
Figure 4: Exploratory vs Explanatory Analysis
Exploratory analysis is made to understand the data, with the purpose of detecting underlying patterns andrelationships. Usually, it seeks to answer a specific question or hypothesis from the data. During this stage itis critical to understand the origin and characteristics of the data, to help us understand how to process it inorder to obtain clear, specific insights. On the other hand, the purpose of the explanatory analysis is to takethese insights and find the most effective way of presenting them to a specific audience, in the most concise,simple and clear way that is possible.
Having this in mind, data visualization is most critical for the explanatory analysis. Nonetheless, followinggood practices for data visualization during the exploratory analysis can help us obtain insights more easily,and is very important to avoid reaching misleading interpretations. As the examples in this section showed,not visualizing the data correctly can lead to erroneous or biased conclusions about the data.
5
Planar and Retinal Variables
To make effective visualizations we must be aware of the different types of visual encoding variables thatexist, choosing the correct visual encoding depending on the data is an essential step in data visualization.There are two types of visual encoding variables: planar and retinal. Planar variables represent points in acoordinate system (usually cartesian), and they allow the use of a single variable. A scatter plot of a singlevariable has only planar visual encoding.
2
3
4
5
4 5 6 7 8
Sepal length (cm)
Sep
al W
idth
(cm
)
*Only versicolor species plotted
Figure 5: Simple Scatter Plot.
Retinal variables are visual properties we use to express the data, such as size, color, shape or texture. Theneed for the inclusion of these visual properties arises mainly from the necessity of presenting more thanone variable in a single visualization. The appropriate use of these variables depends on several factors,mainly the type of data (qualitative or quantitative), the number of variables/factors we want to plot andthe medium of the presentation (printed or digital).
Table 4: Recommended Retinal Variables by Data Feature.
Color Shape Size TextureQuantitative data X × X ×Qualitative data X X × XData with many levels X × × ×Data with few levels X X × XPrinted media × X X XDigital media X X X ×
Retinal Variables for Qualitative Data
Let’s start by looking at these variables for qualitative data. Figure 6 shows the sepal width versus the sepallength for all 3 species in the iris dataset. It is possible to differentiate between the 3 species by looking atthe markers, but it is not easy to differentiate some of the points. Using varying shapes is recommendedwhen we have few levels or factors in our data, as it becomes harder to differentiate between the markers asmore of them are added.
Figure 7 shows the available markers in ggplot. Only markers 21-24 can have different fill color, the rest ofthe markers act as symbols rather than geometries.
6
2
3
4
5
4 5 6 7 8
Sepal length (cm)
Sep
al W
idth
(cm
)
Species
setosa
versicolor
virginica
Figure 6: Visual Encoding by Shape
20 21 22 23 24
15 16 17 18 19
10 11 12 13 14
5 6 7 8 9
0 1 2 3 4
Figure 7: ggplot Markers
Next we’ll look at texture, in this case the geometry for the points tends to remain constant, while thetexture of it’s fill changes to reflect a different class. Using textures is not generally recommended, mosttypes of visualizations make it hard to differentiate between them. However, they can be useful for printedvisualizations, as they are appropriate for greyscale color schemes. Figure 8 shows the same data as inFigure 6, but using different textures instead of different shapes.
Note: textures aren’t implemented in ggplot2 by default (and generally on the most popular plotting libraries).However, some different markers can be used to give the same effect.
7
2
3
4
5
4 5 6 7 8
Sepal length (cm)
Sep
al W
idth
(cm
)
Species
setosa
versicolor
virginica
Figure 8: Visual Encoding by Texture
Finally, Figure 9 shows the same plot but with the same marker in different colors. So far, the differentiationbetween classes in this plot is the best. The human eye can distinguish about 10 million different colors(conversely, think about how many different shapes you can differentiate as markers in a plot, or how manyeasily distinguishable textures can be generated), so color tends to be the natural choice for differentiatingclasses in visualization.
2
3
4
5
4 5 6 7 8
Sepal.Length
Sep
al.W
idth Species
setosa
versicolor
virginica
Figure 9: Visual Encoding by Color
Nonetheless, there are still some drawbacks with using color for visualizations. First and foremost, we mustconsider that there are people with color vision deficiency, and that some people may be more adept atdistinguishing between subtle variations in color. More considerations for the proper use of color will becovered in a following chapter.
The retinal variables can be combined for better effect, to allow for an even better visualization. Figure 10shows the result of combining shape and color. It is even easier to distinguish the differences between species.This is the most common combination of retinal variables, as shape/texture and texture/color are less thanideal combinations.
8
2
3
4
5
4 5 6 7 8
Sepal.Length
Sep
al.W
idth Species
setosa
versicolor
virginica
Figure 10: Visual Encoding by Shape & Color
Retinal Variables for Quantitative Data
For quantitative data we’ll look at the mtcars dataset, which is comprised of fuel consumption and severalaspects of automobile design and performance for 32 models (1973-1974 models). As was discussed previously,a simple scatter plot with planar encoding is enough to represent 2-variable relationships. A simple scatterplot of the horse power of the engine vs its displacement can be seen in Figure 11.
0
100
200
300
400
500
0 100 200 300 400 500
Displacement (cu. in.)
Gro
ss h
orse
pow
er
Figure 11: Scatter Plot for mtcars Dataset
The need for using retinal variables with quantitative usually arises from including a third variable on a 2Dvisualization. Let’s consider that we want to look at how the gas mileage varies by engine displacement andhorsepower. One way would be to make a 3D scatter plot, which isn’t the best option (the reasons why arediscussed in the Best practices section).
A better option is to use retinal variables. Figure 12 shows the encoding of this variable in the size of thepoints. We can also use color to encode the variable, as seen in Figure 13, the choice of color is essential inthis case, important considerations are covered in the Importance of Color section. Finally, we can combineboth size and color to emphasize further the effect, as seen in Figure 14.
9
0
100
200
300
400
500
0 100 200 300 400 500
Displacement (cu. in.)
Gro
ss h
orse
pow
ermpg
10
15
20
25
30
35
Figure 12: Visual Encoding by Size
0
100
200
300
400
500
0 100 200 300 400 500
Displacement (cu. in.)
Gro
ss h
orse
pow
er
10
15
20
25
30
35mpg
Figure 13: Visual Encoding by Color
0
100
200
300
400
500
0 100 200 300 400 500
Displacement (cu. in.)
Gro
ss h
orse
pow
er
mpg
10
15
20
25
30
35
Figure 14: Visual Encoding by Size & Color
10
The Importance of Color
In data visualization, the choice of color is not merely aesthetic, color has a function,and improperlyselected colors can distort relationships between values. In general, color should follow these guidelines onvisualizations:
• Color is meant to convey meaning: it must be used sparingly and with a specific purpose in mind.• Color affects how we perceive objects: the relative position/size/shape of an object is affected by
color, perceptive bias should be minimized by using appropriate color schemes.• Color must direct the attention of the audience: color should be used to emphasize what we’re
trying to tell about the data.
The excessive use of color can lead to unpleasant and ineffective visualizations. Consider the plot in Figure 15,the color scheme is eye-straining and makes it difficult to distinguish between values.
0
100
200
300
400
500
0 100 200 300 400 500
Displacement (cu. in.)
Gro
ss h
orse
pow
er
10
15
20
25
30
35mpg
Figure 15: Example of Poor Choice of Colors
There are some considerations to be had when choosing a color scheme for a visualization, some of whichinclude:
• Image background: most color schemes are designed to be displayed on white backgrounds. Theonly situation where dark backgrounds could be used is when the image will be viewed in darkness.
• Supporting elements: elements such as grid lines, text on axes, labels or legends should be color-neutral (greyscale).
• Legibility: everything in the visualization must be clearly legible at first glance.• Color blindness: some members of the audience can have color vision deficiencies, making it harder
for them to distinguish between certain colors.• Consistent colors: when using several plots, the color schemes between them should be consistent.
The Color Brewer Schemes
There are standardized color schemes that are widely used in data visualization. Some of the most popularschemes are the Color Brewer schemes (Brewer, 2017). These schemes were hand-picked and crafted forcartography, although they are widely used in graphics in general. There are 3 types of Color Brewer schemes,depending on the nature of the data:
1. Sequential schemes: suited for data that is ascending in nature, light colors represent low values anddark colors represent high values.
11
2. Diverging schemes: these schemes put equal emphasis at the extremes of the data range with darkcolors, and in the class break in the middle with the lightest color. The class break can represent acritical value in the data such as the mean or median. Different colors mark divergence from the classbreak in opposite directions.
3. Qualitative schemes: these schemes are designed for classes that don’t imply different magnitudes,best used for nominal or categorical data.
In the Color Brewer website, these schemes can be further filtered by colorblind safe, print friendly andphotocopy safe. Figure 16, Figure 17 and Figure 18 show the Color Brewer palettes available in ggplot. Theycan be used for both discrete and continuous color scales.
Sequential Brewer Schemes
1 2 3 4 5 6 7 8 9
Blues
1 2 3 4 5 6 7 8 9
BuGn
1 2 3 4 5 6 7 8 9
BuPu
1 2 3 4 5 6 7 8 9
GnBu
1 2 3 4 5 6 7 8 9
Greens
1 2 3 4 5 6 7 8 9
Greys
1 2 3 4 5 6 7 8 9
Oranges
1 2 3 4 5 6 7 8 9
OrRd
1 2 3 4 5 6 7 8 9
PuBu
1 2 3 4 5 6 7 8 9
PuBuGn
1 2 3 4 5 6 7 8 9
PuRd
1 2 3 4 5 6 7 8 9
RdPu
1 2 3 4 5 6 7 8 9
Reds
1 2 3 4 5 6 7 8 9
YlGn
1 2 3 4 5 6 7 8 9
YlGnBu
1 2 3 4 5 6 7 8 9
YlOrBr
1 2 3 4 5 6 7 8 9
YlOrRd
Figure 16: Sequential Brewer Palettes
12
Diverging Brewer Schemes
1 2 3 4 5 6 7 8 9 10
BrBG
1 2 3 4 5 6 7 8 9 10
PiYG
1 2 3 4 5 6 7 8 9 10
PRGn
1 2 3 4 5 6 7 8 9 10
PuOr
1 2 3 4 5 6 7 8 9 10
RdBu
1 2 3 4 5 6 7 8 9 10
RdGy
1 2 3 4 5 6 7 8 9 10
RdYlBu
1 2 3 4 5 6 7 8 9 10
RdYlGn
1 2 3 4 5 6 7 8 9 10
Spectral
Figure 17: Diverging Brewer Palettes
13
Qualitative Brewer Schemes
1 2 3 4 5 6 7 8
Accent
1 2 3 4 5 6 7 8
Dark2
1 2 3 4 5 6 7 8
Set2
1 2 3 4 5 6 7 8
Pastel2
1 2 3 4 5 6 7 8 9
Pastel1
1 2 3 4 5 6 7 8 9
Set1
1 2 3 4 5 6 7 8 9 10 11 12
Set3
1 2 3 4 5 6 7 8 9 10 11 12
Paired
Figure 18: Qualitative Brewer Palettes
14
Best Practices in Visualization
In this section some good and bad practices in data visualization will be presented. Mainly as reference ofwhat to avoid when creating visualizations.
Avoid 3D Visualizations
Consider the plot in Figure 19, in comparison with Figure 14, it is harder to visualize the data, some pointsare lost in the perspective and the colorbar is not very effective. 3D plots have the issue of perspective, it iseasier for people to visualize things in 2D than in 3D. Most of the time there are ways of circumventing 3Dplots by encoding with retinal variables.
0 100 200 300 400 500
1015
2025
3035
50100
150200
250300
350
Displacement (cu. in.)
Gro
ss h
orse
pow
er
mpg
Figure 19: Example 3D Plot With mtcars Dataset
Even the simplest 3D plots (3D scatter plots) have issues with perception, and are not optimal for staticvisualizations. 3D plots are more useful in exploratory analysis with interactive visualizations. Additionally,in most plotting libraries, producing a high quality 3D plot requires extensive customization and tinkering.
15
Avoid Pie Charts
In general, there is always a visualization that is more effective than a pie chart. Pie charts have an extremelyunfavorable reputation in the world of data visualization. Most of it is due to the fact that it is not easyto interpret and compare data in pie charts. Take a look at Figure 20, can you compare the magnitude ofdeaths in Males and Females in the month of April? Can you tell which month had the most deaths acrossboth genders?
Deaths from Lung Diseases in the UK by Month (1974)month
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Males
MonthJanFebMarAprMayJunJulAugSepOctNovDec
Females
Figure 20: Pie Charts for mdeaths and fdeaths Datasets
Data in pie charts is noticeably hard to compare if there are more than 2-3 points. To aid in the visualizationthe value or corresponding percentage of each “slice” could be added to the plot. But if you need to labeleach individual point then the visualization is inappropriate and ineffective. Walter Hickey (2013), a reporterfor the Business Insider states that “pie charts are the Aquaman of data visualization” in his article “TheWorst Chart in the World”
Consider Figure 21 as an example of a better visualization for the same data.
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Month
Cou
nt
Gender
Male
Female
Figure 21: Alternative Visualization
16
Beware of Misleading Aspect Ratios
Axes can be (unprofessionally) manipulated to change the story data is telling. Let’s see at a real case ofmanipulation by tampering with the axes. In 2015 National Review tweeted a plot similar to the one inFigure 22.
0102030405060708090
100110
1880 1891 1902 1913 1924 1935 1946 1957 1968 1979 1990 2001 2012
Year
Tem
pera
ture
(Â
°F)
Global Average Temperature by Year
Figure 22: Plot with Misleading Axes
It is a fact that a 1° increase in global temperature can have a huge impact on the global climate. A moreappropriate visualization can be seen in Figure 23
56.0
56.5
57.0
57.5
58.0
58.5
59.0
1880 1891 1902 1913 1924 1935 1946 1957 1968 1979 1990 2001 2012
Year
Tem
pera
ture
(Â
°F)
Global Average Temperature by Year
Figure 23: Plot with More Appropriate Axes
Manipulating axes to try to change the interpretation of the audience about the data is not only inappropriate,but also very unprofessional. Any effect can be maximized or minimized by disproportionaly zooming in orout, and if the audience is not familiar with the data, their interpretation can be biased by doing this.
17
Beware of Spurious Correlations
Remember that correlation does not imply causation, the use of spurious correlations can range fromdangerous to absurd. In engineering applications, assuming that a correlation implies causation can result inincorrect models, predictions or recommendations to customers. In more daily-life situations, undoubtedlyunrelated phenomena can show high correlation to comical effect.
Consider the plots in Figure 24, one shows a declining trend in US aviation accidents (U.S. General Ser-vices Administration, 2015), the second shows a rising trend in US consumption of ice cream (U.S. General Ser-vices Administration, 2017). Undoubtedly, both are completely unrelated, but if we plot their correlation(Figure 25) we can see that it is significant. One could (erroneously) conclude that more consumption in icecream is the cause of less aviation accidents in the US.
4000450050005500600065007000
1200135015001650180019502100
1992 1995 1998 2001 2004 2007 2010 2013
Year
Tho
usan
d S
hort
Ton
s o
f Ice
Cre
amA
viat
ion
Acc
iden
ts
Figure 24: Aviation Deaths and Ice Cream Consumption in the US by year.
1200
1350
1500
1650
1800
1950
2100
4000 4500 5000 5500 6000 6500 7000
Thousand Short Tons of Ice Cream
Avi
atio
n A
ccid
ents
Figure 25: Corrleation Between Aviation Deaths and Ice Cream Consumption in the US.
See http://tylervigen.com/spurious-correlations for more examples of ridiculous correlations (Vigen, n.d.).
18
Avoid dual-scaled axes
Stephen Few (2008), a data visualization specialist, presents the following guidelines for using dual-scaledaxes:
• Graphs should only include a dual-scaled axis when needed to compare datasets with different units ofmeasure (and even then it is not encouraged).
• Magnitude comparisons between values with different units of measure and scales are not appropriate,for this reason nothing but lines should be used in graphs with dual-scaled axes.
• Given that only the slopes of the lines are meaningful in dual-scaled axes, it is inappropriate to use adual-scaled axis in a graph that doesn’t display values along an interval scale (time).
• Using dual-scaled axes to show more than one quantitative scale encourage people to compare themagnitude of the values between them, which is meaningless.
Consider Figure 26 as an example of the common issues with dual axes on graphs. Attention is drawntowards the intersections between both plots, which have no real significance. While it is possible to infersome relationship from the plot, a more suitable visualization would look like Figure 24.
1100
1300
1500
1700
1900
2100
0.066
0.078
0.090
0.102
0.115
0.127
1969 1972 1975 1978 1981 1984
Year
Driv
ers
Kill
edP
etrol Price
Road Casualties in Great Britain
Figure 26: Example of Misleading Dual Axis Plot.
The consensus is that perhaps the only acceptable use of dual scale axis is to display a rescaling of a singlevariable, like shown in Figure 27.
36.2
36.4
36.6
36.8
37.0
37.2
37.4
37.6
97.16
97.52
97.88
98.24
98.60
98.96
99.32
99.68
14:0012/12/1990
18:0012/12/1990
22:0012/12/1990
02:0013/12/1990
06:0013/12/1990
10:0013/12/1990
Datetime
Tem
pera
ture
(Â
°C) Tem
perature (°F
)
Castor canadensis Body Temperature
Figure 27: Example of Acceptable Use of Dual Axes.
19
Declutter your visualizations
In visualizations, less is more. The more elements in a visualization, the harder it is to direct attention of theaudience to what we want to emphasize in the visualization. Cluttered graphs are harder to interpret. Let’sstart by looking at the price of diamonds by varying carat, color and cut Figure 28 shows this plot. It is noteasy to present an insight out of this plot, given how many elements it has.
0
5000
10000
15000
20000
1 2 3 4 5
Carat
Pric
e (U
SD
)
Cut
Fair
Good
Very Good
Premium
Ideal
Color
D
E
F
G
H
I
J
Figure 28: Diamond Price by Carat, Cut and Color
To remove clutter we can remove levels in the color and cut, focusing only on the extremes and class break.The resulting plot can be seen in Figure 29.
0
5000
10000
15000
20000
1 2 3 4 5
Carat
Pric
e (U
SD
)
Color
D
G
J
Cut
Fair
Very Good
Ideal
Figure 29: Diamond Price by Carat, Selected Color and Cut
Figure 30, Figure 31 and Figure 32 show sequential decluttering to focus on what is important from the datain a simple visualization.
20
VVS1 IF
VS1 VVS2
SI1 VS2
I1 SI2
0.5 1.0 1.5 0.5 1.0 1.5 2.0
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
1 2 3 4 5 1 2 30
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
Carat
Pric
e (U
SD
)
Color
D
G
J
Cut
Fair
Very Good
Ideal
Figure 30: Diamond Price by Carat, Clarity, Selected Color and Cut
21
Fair Very Good Ideal
I1S
I2S
I1V
S2
VS
1V
VS
2V
VS
1IF
0 1 2 3 4 5 1 2 1 2 3
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
Carat
Pric
e (U
SD
) Color
D
G
J
Figure 31: Diamond Price by Carat, Clarity, Selected Color and Cut, with Regression Lines
22
Fair Very Good Ideal
I1V
S2
IF
1 2 3 4 5 1 2 0.5 1.0 1.5 2.0 2.5
0
5000
10000
15000
20000
0
5000
10000
15000
20000
0
5000
10000
15000
20000
Carat
Pric
e (U
SD
) Color
D
G
J
Figure 32: Diamond Price by Carat, Selected Color, Cut and Clarity, with Regression Lines
This simpler visualization allows us to quickly deduce the following insights:
• Larger diamonds (by carat) tend to have lower quality cut and color. The diamonds with ideal cuttend to have a lower range of carats.
• The diamonds’ color changes the rate at which they become more expensive with increasing carat.• For some clarities and cuts, the price difference between premium color diamonds (D) and good color
dimamonds (G) is not very significant.• There are practically no poor cut diamonds with premium clarity and viceversa.
Emphasize what is important
Emphasizing what is important is more suited in explanatory analysis, when we want to convey specificinformation in a visualization. So far, all of the plots seen as examples have been exploratory. In thissection, we will look at an example of explanatory analysis while focusing on the importance of emphasis invisualization.
Suppose you’re given exchange data versus US Dollars for some currencies (shown in Table 5) for the year2016 (Myfxbook Ltd, 2017), and are given the general task of analyzing interesting effects on the Mexicanpeso. The dataset contains how the exchange rate of each currecy versus the US Dollar changes after eachclosing of the markets (in percentage).
Table 5: Currencies in the Dataset
Acronym CurrencyBRL Brazilian RealCAD Canadian DollarCNY Chinese YuanEUR EuroGBP British PoundMXN Mexican Peso
23
You begin by plotting the data for the whole year (Figure 33), and notice unusual behavior around November,there is a very large increase in the exchange rate, in both the Mexican Peso and the Brazilian Real. Youthen decide to zoom-in on that month, as shown in Figure 34. The Latinamerican currencies in the dataset(MXN and BRL), both show a sharp increase between November 8 and November 10.
−3
−2
−1
0
1
2
3
4
5
6
7
8
01/0
6/16
01/0
7/16
01/0
8/16
01/0
9/16
01/1
0/16
01/1
1/16
01/1
2/16
01/0
1/17
Date
Pct
. cha
nge
in e
xcha
nge
rate
v
s U
SD
wrt
pre
viou
s cl
ose
Currency
BRL
CAD
CNY
EUR
GBP
MXN
Figure 33: Change in Exchange Rates for 2016.
−3
−2
−1
0
1
2
3
4
5
6
7
8
01/1
1/16
02/1
1/16
03/1
1/16
04/1
1/16
05/1
1/16
06/1
1/16
07/1
1/16
08/1
1/16
09/1
1/16
10/1
1/16
11/1
1/16
12/1
1/16
13/1
1/16
14/1
1/16
15/1
1/16
Date
Pct
. cha
nge
in e
xcha
nge
rate
v
s U
SD
wrt
pre
viou
s cl
ose
Currency
BRL
CAD
CNY
EUR
GBP
MXN
Figure 34: Change in Exchange Rates for November 2016
The US General Elections were between November 8 and 10, and some markets, specially in Latinamericareacted negatively to the preliminary results of the election and when Trump became president-elect. Torestrict the analysis, the Chinese Yuan and British Pound are dropped from the analysis. The CanadianDollar is useful to see the reaction in the rest of North America to the elections, the Brazilian Real reflects thereaction in South America and the Euro in Europe. The analysis is thus restricted to the trend in Westerncountries. The plot of the resulting dataset can be seen in Figure 35.
24
−3
−2
−1
0
1
2
3
4
5
6
7
8
01/1
1/16
02/1
1/16
03/1
1/16
04/1
1/16
05/1
1/16
06/1
1/16
07/1
1/16
08/1
1/16
09/1
1/16
10/1
1/16
11/1
1/16
12/1
1/16
13/1
1/16
14/1
1/16
15/1
1/16
Date
Pct
. cha
nge
in e
xcha
nge
rate
v
s U
SD
wrt
pre
viou
s cl
ose
Currency
BRL
CAD
EUR
MXN
Figure 35:
So far we’ve been focusing on some information only to restrict the analysis, and we have the data we needto show how Trump’s election had an immediate effect on Latinamerican countries from the moment thepolls favored him as president-elect. We now need to modify the plot to emphasize this conclusion and focusit on the Mexican peso.
There are several ways of emphasizing content:
Remove unnecessary elements from the plot:
Try to remove as much clutter and unnecessary elements as is possible from the visualization. For thisexample, these were the actions taken:
• The x and y grid lines were set to blank.• The top and right borders of the plot were set to blank.• The frequenecy of the tick marks on the x-axis was reduced.• The formatting of the tick labels on the x-axis was changed to avoid having them at an angle.• The year in the tick labels was dropped as it gives unnecessary information.• The label on the y-axis was simplified with the inclusion of a title and subtitle.
Focus on increasing the whitespace in the plot:
The color white is your friend when trying to create impactful visualizations. The removal of the unnecesaryelements from the previous items resulted in more whitespace in the plot. More whitespace helps the audiencefocus their attention on what we want them to see.
Use color to focus attention on what you want the audience to see first
The Color Brewer schemes are excellent in exploratory analysis, in explanatory analysis you should restrictthe colors to 1-2 different colors and use greyscale for non-principal parts of the visualization. In this examplethese actions were taken:
• The dates corresponding to the elections were highlighted with a grey background• The line plot corresponding to the Mexican Peso was highlighted with a blue color• The series for the rest of the currencies were set to greyscale colors.
25
Use annotations to reinforce the point of the visualization.
The inclusion of brief, concise texts, in the form of takeaways help you reinforce your point and allow thevisualization to be more independent. In this example only a couple of annotations were added:
• The legend was completely replaced by labeling each individual series.• The take-away of the visualization is also included in the top right corner of the plot with the same
font color as the highlights.
The resulting visualization can be seen in Figure 36, this plot takes into consideration all of the best practicesfor data visualization seen so far.
US Elections
BRL
CADEUR
MXN
LATAM markets suffered a sharp drop as theelection favored Trump, reflecting the market'suncertainty over his presidency.
−3
−2
−1
0
1
2
3
4
5
6
7
8
Nov 01 Nov 03 Nov 05 Nov 07 Nov 09 Nov 11 Nov 13 Nov 15
Per
cent
Cha
nge
With respect to previous close.
Change in exchange rate vs USD (2016)
Figure 36: Correct Use of Emphasis for Visualization
26
Choosing the correct visualization
Choosing the best visualization for a given dataset can be complex, information can be presented in diverseways, and there is not a definitive guideline on how to choose the best one. One must become familiar withthe capabilities of a plotting library to better decide what type of visualization to use. One good resource forselecting visualizations is The Dataviz Catalogue (Ribecca, 2017), which allows selecting an appropriatevisualization with a simple to use Wizard (Figure 37)
Figure 37: Dataviz Catalogue Search Interface
27
Storytelling with Data
Understand the Context
Tell a Story
Case Studies
28
Bibliography
Anscombe, F. (1973). Graphs in statistical analysis. The American Statistician, 27 (1), 17–21.
Brewer, C. (2017). Color brewer. Retrieved August 24, 2017, from http://colorbrewer2.org/
Few, S. (2008). Dual-scaled axes in graphs, are they ever the best solution. Retrieved August 25, 2017, fromhttp://www.perceptualedge.com/articles/visual_business_intelligence/dual-scaled_axes.pdf
Hickey, W. (2013). The worst chart in the world. Retrieved August 24, 2017, from http://www.businessinsider.com/pie-charts-are-the-worst-2013-6
Matejka, J., & Fitzmaurice, G. (2017). Datasaurus dozen. Retrieved from https://www.autodeskresearch.com/sites/default/files/The%20Datasaurus%20Dozen.zip
Myfxbook Ltd. (2017). Forex currencies. Retrieved August 26, 2017, from http://www.myfxbook.com/forex-market/currencies
National Review. (2015). The only #climatechange chart you need to see. Retrieved August 24, 2017, fromhttps://twitter.com/nro/status/676516015078039556
Reynolds, P. (1994). Case studies in biometry. John Wiley & Sons.
Ribecca, S. (2017). The data visualization catalogue. Retrieved August 26, 2017, from http://www.datavizcatalogue.com/search.html
U.S. General Services Administration. (2015). Accidents, fatalities, and rates, 1995 through2014, u.S. general aviation. Retrieved August 25, 2017, from https://catalog.data.gov/dataset/accidents-fatalities-and-rates-1995-through-2014-u-s-general-aviation
U.S. General Services Administration. (2017). Sweetener market data historical deliveries by use - ice cream.Retrieved August 25, 2017, from https://catalog.data.gov/dataset/sweetener-market-data-historical-deliveries-by-use-ice-cream/
Vigen, T. (n.d.). Spurious correlations. Retrieved August 25, 2017, from http://tylervigen.com/spurious-correlations
29