Ethics of data representation v2.0. Collect Raw Data Process and Filter Data Clean Dataset...

Post on 15-Dec-2015

217 views 0 download

Tags:

transcript

Ethics of data representation

v2.0

Collect Raw Data Process and Filter Data Clean Dataset

Exploratory Analysis

Generate Conclusion

Generate Visualisation

Data Visualisation Process

What is Ethics when it comes to data visualisation?• The figure/graph/image should show what is actually

happening and not what you want to happen.

• Different ways of being unethical:– knowingly:

• deliberately showing the data in a misleading manner,• choosing the ‘most representative’ image/experiment.

– unknowingly:• not exploring/getting to know the data well enough,• misusing your chosen graphical representation.

Before After0

200

400

600

800

1000

1200

1400

Cheating knowingly: Choice of graph

You know that what is going on

Before After0

200

400

600

800

1000

1200

1400

• Hypothesis (what you want to see): Applying a treatment will decrease the levels of a variable.

Exp2Exp1

Exp3

Exp4

You choose to plot your data like that

Cheating knowingly: Choice of axis/scale

You know that what is going on

• You want to show an increase in salary in the last term.

You choose to plot your data like thatJune July Aug Sept Oct Nov Dec

0

5000

10000

15000

20000

25000

Sal

ary

June July Aug Sept Oct Nov Dec19200

19400

19600

19800

20000

20200

Sal

ary

Cheating knowingly: Choice of axis/scale

• Be careful with Linear vs. logarithmic scale.

Cheating knowingly: Choice of axis/scale

• If you want to cheat, a bar graph using a log axis is a great tool, as it lets you either exaggerate differences between groups or minimize them.

Linear scale

Logarithmic scale

Cheating knowingly: Choice of axis/scale

• Logarithmic axis should be used for:

Lognormal data

Logarithmically spaced values

Original Brightness and Contrast

Adjusted

Brightness and Contrast

Adjusted Too Much:

Oversaturation

Cheating knowingly: Manipulating images: Western blot

• Presenting bands out of context • ‘Playing’ too much with contrast

• ‘Rebuilding’ a Western blot from several cuts

Cheating unknowingly: Not exploring/getting to know the data well enough

CondA CondB0

10

20

30

40

50

60

70

CondA CondB0

20

40

60

80

100

120

CondA CondB0

20

40

60

80

100

120

• Hypothesis: increase from CondA to CondB.You run the experiment once and you choose to plot the data as a bar chart.

Cheating unknowingly: Not exploring/getting to know the data well enough

Control Treatment 1 Treatment 2 Treatment 30

20

40

60

80

100

120

Val

ue

p=0.04

p=0.32

p=0.001

Comparisons: Treatments vs. Control

Control Treatment 1 Treatment 2 Treatment 30

20

40

60

80

100

120

140

Val

ue

Exp3Exp4

Exp1

Exp5

Exp2

Treat1 Treat2 Treat3-100

-50

0

50

100

Sta

nd

ard

ised

val

ues

Types of plotThings you can illustrate

Plot types – Distribution/ExplorationHistograms

• Very good for exploring data. Better on big dataset. • Rules: Number of intervals ≈√N and Interval width ≈ Range ÷√N• Histograms are great but careful with the resolution (= number of bins) as it affects the

shape of the distribution.

• Be careful with the resolution …

… and the type of data you are dealing with.

0 1 2 3 4 5 6 7 8 9 100

2

4

6

8

10

Bin width = 1

Nu

mb

er o

f va

lues

0.00 1.25 2.50 3.75 5.00 6.25 7.50 8.75 10.000

2

4

6

8

10

12

Bin width = 1.25

Nu

mb

er o

f va

lues

0.0 1.5 3.0 4.5 6.0 7.5 9.0 10.50

2

4

6

8

10

12

14

16

18

Bin width = 1.5

Nu

mb

er o

f va

lues

Plot types – Distribution/ExplorationHistograms

• Histograms are great but careful with discrete data.

Male Female60

70

80

90

100

110

Le

ng

th (

cm

)

Cutoff = Q1 – 1.5*IQR

Median

Maximum

Interquartile Range (IQR): 50% of the data

Lower Quartile (Q1) 25th percentile (1st quartile)

Outlier

Upper Quartile (Q3) 75th percentile (3rd quartile)

Plot types – Distribution/ExplorationBoxplots and Bean plots

Minimum

Plot types – Distribution/ExplorationBoxplots and Bean plots

Bimodal Uniform NormalDistributions

A bean= a ‘batch’ of data

Data density mirrored by the shape of the polygon

Scatterplot shows individual data

• Very good for exploring data. Better on medium size dataset. • Boxplots are great but be careful with underlying distribution.

Plot types – Exploration/ComparisonStripcharts/Scatterplots

Control CondA CondB CondC CondD0.0

0.5

1.0

1.5

2.0

Val

ue

s

• Very good for exploring data. Better on small/medium dataset. • Very informative: exploration AND comparison.• Very hard to cheat with these. • Stripcharts are great but they don’t work so well with big samples.

Plot types – ComparisonsBarcharts

Control CondA CondB CondC CondD0.0

0.5

1.0

1.5

2.0

2.5

3.0

Control CondA CondB CondC CondD0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Control CondA CondB CondC CondD0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Control

CondA

CondB

CondC

CondD

Standard deviation Standard error

Confidence intervalStar wars (cool graph!)

Control CondA CondB CondC CondD

Plot types – ComparisonsBarcharts

• Be careful with the scale when plotting ratio

• Very good for presenting results and emphasizing differences.• Effectiveness: most important info with the most effective

channel.• Barcharts are great but after data exploration and the y-axis

needs to be chosen wisely.

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

rati

o

0 10 20 30 40 50 60 70 80 90 100-3

-2

-1

0

1

2

3

log

2(ra

tio

)

Plot types – Relationship/ComparisonLine graphs

Except for exploration …

Control Treatment 1 Treatment 2 Treatment 30

20

40

60

80

100

120

140

Val

ue

0 10 20 30 40 50 60 70 80 90 100-2

-1

0

1

2

3

Arb

itra

ry c

ha

ng

e o

ve

r ti

me

0 10 20 30 40 50 600

20

40

60

80

100

Time

Pe

rcen

t s

urv

iva

l

CaPO CaPA CaPOA CaP

5 experiments

• Very good for presenting results of matched/paired/repeated data.• Linecharts are great but careful with the axes.

Plot types – RelationshipsScatterplot

• Very good for understanding the relationship between quantitativevariables.

Plot types – RelationshipsScatterplots

• Solution: smoothed densities colour representation

• Scatterplots are great but big data can be tricky.

Plot types – RelationshipsHeatmaps

• Great for big data sets, allow to plot a third quantitative value: colour scheme for grouping.

Euclidean distance Correlation Colour scheme

• Heatmaps are great but plot data that are changing.

A heatmap is basically a table that has colors in place of numbers.Simon’s data from simple numbers to correlation

ABCD

Total=62

E

ABCDE

Total=62

Plot types – CompositionStack charts/Pie charts

Group A Group B0

20

40

60

80

100

Pe

rcen

tag

e

ABCDE

• Stack /pie charts are great but keep an eye on the sample size.