Post on 15-Dec-2015
transcript
Ethics of data representation
v2.0
Collect Raw Data Process and Filter Data Clean Dataset
Exploratory Analysis
Generate Conclusion
Generate Visualisation
Data Visualisation Process
What is Ethics when it comes to data visualisation?• The figure/graph/image should show what is actually
happening and not what you want to happen.
• Different ways of being unethical:– knowingly:
• deliberately showing the data in a misleading manner,• choosing the ‘most representative’ image/experiment.
– unknowingly:• not exploring/getting to know the data well enough,• misusing your chosen graphical representation.
Before After0
200
400
600
800
1000
1200
1400
Cheating knowingly: Choice of graph
You know that what is going on
Before After0
200
400
600
800
1000
1200
1400
• Hypothesis (what you want to see): Applying a treatment will decrease the levels of a variable.
Exp2Exp1
Exp3
Exp4
You choose to plot your data like that
Cheating knowingly: Choice of axis/scale
You know that what is going on
• You want to show an increase in salary in the last term.
You choose to plot your data like thatJune July Aug Sept Oct Nov Dec
0
5000
10000
15000
20000
25000
Sal
ary
June July Aug Sept Oct Nov Dec19200
19400
19600
19800
20000
20200
Sal
ary
Cheating knowingly: Choice of axis/scale
• Be careful with Linear vs. logarithmic scale.
Cheating knowingly: Choice of axis/scale
• If you want to cheat, a bar graph using a log axis is a great tool, as it lets you either exaggerate differences between groups or minimize them.
Linear scale
Logarithmic scale
Cheating knowingly: Choice of axis/scale
• Logarithmic axis should be used for:
Lognormal data
Logarithmically spaced values
Original Brightness and Contrast
Adjusted
Brightness and Contrast
Adjusted Too Much:
Oversaturation
Cheating knowingly: Manipulating images: Western blot
• Presenting bands out of context • ‘Playing’ too much with contrast
• ‘Rebuilding’ a Western blot from several cuts
Cheating unknowingly: Not exploring/getting to know the data well enough
CondA CondB0
10
20
30
40
50
60
70
CondA CondB0
20
40
60
80
100
120
CondA CondB0
20
40
60
80
100
120
• Hypothesis: increase from CondA to CondB.You run the experiment once and you choose to plot the data as a bar chart.
Cheating unknowingly: Not exploring/getting to know the data well enough
Control Treatment 1 Treatment 2 Treatment 30
20
40
60
80
100
120
Val
ue
p=0.04
p=0.32
p=0.001
Comparisons: Treatments vs. Control
Control Treatment 1 Treatment 2 Treatment 30
20
40
60
80
100
120
140
Val
ue
Exp3Exp4
Exp1
Exp5
Exp2
Treat1 Treat2 Treat3-100
-50
0
50
100
Sta
nd
ard
ised
val
ues
Types of plotThings you can illustrate
Plot types – Distribution/ExplorationHistograms
• Very good for exploring data. Better on big dataset. • Rules: Number of intervals ≈√N and Interval width ≈ Range ÷√N• Histograms are great but careful with the resolution (= number of bins) as it affects the
shape of the distribution.
• Be careful with the resolution …
… and the type of data you are dealing with.
0 1 2 3 4 5 6 7 8 9 100
2
4
6
8
10
Bin width = 1
Nu
mb
er o
f va
lues
0.00 1.25 2.50 3.75 5.00 6.25 7.50 8.75 10.000
2
4
6
8
10
12
Bin width = 1.25
Nu
mb
er o
f va
lues
0.0 1.5 3.0 4.5 6.0 7.5 9.0 10.50
2
4
6
8
10
12
14
16
18
Bin width = 1.5
Nu
mb
er o
f va
lues
Plot types – Distribution/ExplorationHistograms
• Histograms are great but careful with discrete data.
Male Female60
70
80
90
100
110
Le
ng
th (
cm
)
Cutoff = Q1 – 1.5*IQR
Median
Maximum
Interquartile Range (IQR): 50% of the data
Lower Quartile (Q1) 25th percentile (1st quartile)
Outlier
Upper Quartile (Q3) 75th percentile (3rd quartile)
Plot types – Distribution/ExplorationBoxplots and Bean plots
Minimum
Plot types – Distribution/ExplorationBoxplots and Bean plots
Bimodal Uniform NormalDistributions
A bean= a ‘batch’ of data
Data density mirrored by the shape of the polygon
Scatterplot shows individual data
• Very good for exploring data. Better on medium size dataset. • Boxplots are great but be careful with underlying distribution.
Plot types – Exploration/ComparisonStripcharts/Scatterplots
Control CondA CondB CondC CondD0.0
0.5
1.0
1.5
2.0
Val
ue
s
• Very good for exploring data. Better on small/medium dataset. • Very informative: exploration AND comparison.• Very hard to cheat with these. • Stripcharts are great but they don’t work so well with big samples.
Plot types – ComparisonsBarcharts
Control CondA CondB CondC CondD0.0
0.5
1.0
1.5
2.0
2.5
3.0
Control CondA CondB CondC CondD0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Control CondA CondB CondC CondD0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
Control
CondA
CondB
CondC
CondD
Standard deviation Standard error
Confidence intervalStar wars (cool graph!)
Control CondA CondB CondC CondD
Plot types – ComparisonsBarcharts
• Be careful with the scale when plotting ratio
• Very good for presenting results and emphasizing differences.• Effectiveness: most important info with the most effective
channel.• Barcharts are great but after data exploration and the y-axis
needs to be chosen wisely.
0 10 20 30 40 50 60 70 80 90 1000
1
2
3
4
5
6
rati
o
0 10 20 30 40 50 60 70 80 90 100-3
-2
-1
0
1
2
3
log
2(ra
tio
)
Plot types – Relationship/ComparisonLine graphs
Except for exploration …
Control Treatment 1 Treatment 2 Treatment 30
20
40
60
80
100
120
140
Val
ue
0 10 20 30 40 50 60 70 80 90 100-2
-1
0
1
2
3
Arb
itra
ry c
ha
ng
e o
ve
r ti
me
0 10 20 30 40 50 600
20
40
60
80
100
Time
Pe
rcen
t s
urv
iva
l
CaPO CaPA CaPOA CaP
5 experiments
• Very good for presenting results of matched/paired/repeated data.• Linecharts are great but careful with the axes.
Plot types – RelationshipsScatterplot
• Very good for understanding the relationship between quantitativevariables.
Plot types – RelationshipsScatterplots
• Solution: smoothed densities colour representation
• Scatterplots are great but big data can be tricky.
Plot types – RelationshipsHeatmaps
• Great for big data sets, allow to plot a third quantitative value: colour scheme for grouping.
Euclidean distance Correlation Colour scheme
• Heatmaps are great but plot data that are changing.
A heatmap is basically a table that has colors in place of numbers.Simon’s data from simple numbers to correlation
ABCD
Total=62
E
ABCDE
Total=62
Plot types – CompositionStack charts/Pie charts
Group A Group B0
20
40
60
80
100
Pe
rcen
tag
e
ABCDE
• Stack /pie charts are great but keep an eye on the sample size.