Download - Meelis Kull [email protected] Autumn 2017 · Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 Demo: Data science mini-project CRISP-DM: cross-industrial standard

Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03

Meelis Kull

[email protected]

Autumn 2017

1

mailto:[email protected]


Demo: Data science mini-project

CRISP-DM: cross-industrial standard process

for data mining

Data understanding: Types of data

Data understanding: First look at attributes

Types of attributes

First look at a nominal attribute

First look at a ordinal attribute

First look at a numeric attribute

2




for data mining


• Data understanding: First look at attributes

Types of attributes


– First look at a ordinal attribute

– First look at a numeric attribute

3

Meelis Kull - Autumn 2017 - MTAT.03.183 - Data Mining - Lecture 03 4


• All the same applies as for nominal attributes

• Need to make sure that the order is retained

in histograms

• Additional ways to look at the attribute:

– Calculate the min, max, median of the attribute

5


• All the same applies as for nominal attributes

• Need to make sure that the order is retained

in histograms

• Additional ways to look at the attribute:

– Calculate the min, max, median of the attribute

– More generally, calculate quantiles

6


• There are 3 quartiles:– lower quartile, median, upper quartile

• They divide a sorted data set into 4 equal parts

• Percentiles divide into 100 equal parts– Lower quartile is the 25th percentile

– Median is the 50th percentile

• Deciles divide into 10 equal parts

• Quantiles generalise to any location over sorted data:– Lower quartile = 25th percentile = quantile at 0.25

– 4th decile = quantile at 0.4

7


A. 1st and 2nd decile

B. 2nd and 3rd decile

C. 3rd and 4th decile

D. 4th and 5th decile

E. 5th and 6th decile

F. 6th and 7th decile

G. 7th and 8th decile

H. 8th and 9th decile

8


A. Nominal

attributes

B. Ordinal

attributes

C. Both

D. Neither

E. Not sure

9


• Same aspects as in nominal attributes

• Additionally:

– Is the order specified in the documentation?

(meta-data)

– If all possible values specified in the meta-data:

• Check if all values present in the data

• If not, make sure that the empty bars in histograms

are in the right location corresponding to the order

10




for data mining


• Data understanding: First look at attributes

Types of attributes



– First look at a numeric attribute

11



• Is it really numeric?

– E.g. if contains only values 0.0 and 1.0 then

perhaps nominal is more appropriate?

• Is it discrete (only integers)?

– E.g. if contains integers 0 to 100 then can have a

first look as if it were an ordinal attribute

• Calculate quantiles as for ordinal attributes

• Calculate the mean (arithmetic average):

13


• Is it really numeric?

• Is it discrete (only integers)?

• Calculate quantiles as for ordinal attributes

• Calculate the mean (arithmetic average)

• Plot a histogram (with default binning)

14


• Check if some value occurs many times

– With real numbers often no number occurs more

than once

– If a value is frequent then:

• Possibly can denote missingness (e.g. value 0),

should be treated as N/A (not available, missing)

• Perhaps can denote an extremal value (more

extremal values represented as this value)

• Check for rounding

– E.g., if all numbers rounded to 2nd digit and one

has more digits then it might be a typing error

15


A. Nominal

attributes

B. Ordinal attributes

C. Numeric

attributes

D. All of the above

E. None of the above

F. Not sure

16




for data mining


Data understanding: First look at attributes

Types of attributes



First look at a numeric attribute

17


• Data understanding: distribution of

attributes

• Types of histograms

• How to describe probability distributions?

• Some standard probability distributions

• More ways to visualise distributions

• Visualising relations of attributes

• Are the attributes related?

18



attributes







19



• We have now looked at each attribute

• A key question we can now answer:

How do the items in this dataset look like?

• Now we have at least 3 answers to this:

21


• How do the items in this dataset look like?

• Let us just pick up one as an example:

– Age = 22

– Workclass = Private

– Education = 11th

– Occupation = Other-service

– Capital.gain = 0

– ...

22



• Let us just provide the ranges of attributes

– Age: {17,18,19,…,90}

– Workclass: {Federal-gov, Local-gov, Private, …}

– Education: {1st-4th,5th-6th, …, Doctorate)

– Occupation: {Adm-clerical, Exec-managerial,…}

– Capital.gain: [0,99999]

– ...

23



• Let us just provide the histograms

– Age: <histogram>

– Workclass: <histogram>

– Education: <histogram>

– Occupation: <histogram>

– Capital.gain: <histogram>

– ...

24



attributes







25



• Histograms on discrete data

– Nominal

– Ordinal

– Numeric with few different values

• E.g. small number of different integers

• Histograms on continuous data

– Numeric with many different values

27


• Frequency histogram

– Frequency = count of items with each value

28

ggplot(data) + geom_histogram(aes(x=age),stat="count")



– Frequency = count of items with each value

29

ggplot(data) + geom_histogram(aes(x=workclass),stat="count")


• Relative frequency histogram

– Relative frequency = proportion (0..1) or

percentage (0..100%) of items with each value

– Heights of bars sum up to 1

30

ggplot(data) + geom_bar(aes(x=workclass,..count../sum(..count..)),stat="count")


• Depends on the goal


– Gives actual counts in the data

• Relative frequency histogram

– Gives proportions in the data

– Interpretable as probability distribution of a

randomly chosen item

31


• Continuous attribute:

– Usually value is different in each item

– Need to introduce bins (a.k.a. intervals, ranges)

– Histogram not informative without bins:

32

ggplot(data) + geom_histogram(aes(x=factor(salaries)),stat="count")


• Frequency histogram of binned data

– Frequency = count of items in each bin

33

ggplot(data) + geom_histogram(aes(x=salaries),stat="bin",boundary=0,binwidth=10000)


• Relative frequency histogram of binned data

– Relative frequency = proportion of items in bins

– Heights of bars add up to 1

34

ggplot(data) + geom_histogram(aes(x=salaries,..count../sum(..count..)),stat="bin",boundary=0,binwidth=10000)


• Density histogram of binned data

– Density = Y-axis such that areas of bars in the

histogram add up to 1

– Density scale is invariant to the sizes of bins

35

ggplot(data) + geom_histogram(aes(x=salaries,..density..),stat="bin",boundary=0,binwidth=10000)






36

+ geom_histogram(aes(x=salaries,..density..),stat="bin",boundary=0,binwidth=1000,alpha=0,colour="red")






37

+ geom_histogram(aes(x=salaries),stat="density",colour="blue")


• Represented by the probability density

function (pdf)

• Area under the curve is equal to 1

• Areas represent probabilities

38

Area = P(a<X<b) =

probability that X is between a and b


Data understanding: distribution of

attributes

Types of histograms






39



• Statistic – measure calculated from all values

• Mode of the distribution:– The most probable value (or values)

• The most frequent value if discrete

• Value with highest density if continuous

• Median and other quantiles– We have defined earlier

• Mean of the distribution:– Average value of the attribute

• Arithmetic average if discrete

• Expected value if continuous

– Centre of mass of the distribution

41


• Consider attribute with values:

– 1,2,2,2,3,3,4,7

• Mode:

– 2 because occurs three times

• Median

– 2.5 because 2 & 3 are in the middle, (2+3)/2=2.5

• Mean:

– 3.0 because (1+2+2+2+3+3+4+7)/8 = 24/8 = 3.0

42


A. 0

B. 0.5

C. 1

D. 1.5


43


A. 0

B. 0.5

C. 1

D. 1.5


44


A. 0

B. 0.5

C. 1

D. 1.5


45


• Variance of a distribution is:

– average squared deviation from the mean

• Standard deviation is square root of variance

– Quadratic average deviation from the mean

• Example:

– Values: 1,2,2,2,3,3,4,7

– Mean: 3.0

– Variance: ((1-3)^2+(2-3)^2+…+(7-3)^2) / 8 = 3.0

– Standard deviation: sqrt(3.0)=1.732…

46


• Symmetric

• Skewed / right-skewed / left-skewed

• Heavy-tailed

• Bimodal

• Multi-modal

• Capped / right-capped / left-capped

47


• Simple definitions, not fully correct:

– Unimodal – 1 mode

– Bimodal – 2 modes

– Multimodal – multiple modes

• Actually:

– Bimodal usually means that the density (pdf) has

two local maxima:

– Multimodal means pdf has multiple maxima (2 or

more)

48


• Symmetric distribution:

– Symmetric around a vertical axis of symmetry

• Left and right side are mirror-images of each other

• Left- (or right-)skewed distribution:

– Non-symmetric and mean below (above) mode

• Usually used only for unimodal distributions

49

positiv

ely

ske

we

d

negati

vely

ske

wed

sym

m

et

ri

c



attributes

Types of histograms

How to describe probability distributions?





50



• Statisticians have names for many different

families of probability distributions

• Why need to know some of them?

– In practice these are used to communicate the

distribution without having to visualise it

• We will talk about:

– Uniform distributions

– Normal distributions

– Power law distributions

52


• All options are equally probable

53


• All values in [a,b] are equally probable:

– The pdf is constant between a and b,

and 0 elsewhere

54


←—————Represent data dispersion, spread —————→

Represent central tendency


• N(mean,variance)

• The most common non-uniform continuous

distribution

– Why?

– Sum of many independent and identically

distributed (i.i.d.) random variables is

approximately normally distributed

• Standard normal distribution: N(0,1)

56


• Heavy-tailed (or long-tailed) distribution

– Very high or low values are likely

• More likely than in case of normal distribution

– Technical definition:

• Tails are not exponentially bounded

57


• Alternative names:

– scale-free / scale-independent

• Distributions on positive real values

• The probability of M times bigger value is K times smaller (power law; M, K - parameters)

• Linearly descending pdf when drawn in log-log-scale (both x and y logarithmic)

• Examples:

– Number of connections in many real-world graphs

– Frequencies of words in languages

– Sizes of craters on the moon

58


A. Uniform

B. Unimodal symmetric

C. Unimodal right-skewed

D. Unimodal left-skewed

E. Bimodal symmetric

F. Bimodal asymmetric

G. Multi-modal symmetric

H. Multi-modal asymmetric

I. Normally distributed

J. Power-law distributed

59


A. Uniform










60


A. Uniform










61


A. Uniform










62

10-4

10-3

10-2 100 101 102



attributes

Types of histograms


Some standard probability distributions




63



• Compactly visualise many distributions

– Density plots rotated by 90º and mirrored

– Distribution of salaries for each education level

65

ggplot(data) + geom_violin(aes(x=education,y=salaries))


• Compact visualise many distributions

– marked median, upper and lower quartile, and

outliers (R ggplot outlier = more than 1.5x inter-

quartile range from quartile)

66

ggplot(data) + geom_boxplot(aes(x=education,y=salaries))


• Violin plots and box plots can also be shown

together:

67

ggplot(data) +

geom_violin(aes(x=education_factor,y=salaries)) +

geom_boxplot(aes(x=education_factor,y=salaries),alpha=0.3)


• Often too crowded, but sometimes provide

extra insights compared to box plots and

violin plots

68

Over-crowded with too many points

ggplot(data) + geom_point(aes(x=education_factor,y=salaries))


• Often too crowded, but sometimes provide

extra insights compared to box plots and

violin plots

69

More useful, here only 100 points

ggplot(data) + geom_point(aes(x=education_factor,y=salaries))



attributes

Types of histograms



More ways to visualise distributions



70



• 2 categorical attributes:

– Cross-table (workclass & education)

72

table(data$education,data$workclass)




– Heatmap (workclass & education)

73

ggplot(data %>% group_by(education,workclass) %>% summarise(count=length(education))) +

geom_tile(aes(y=education,x=workclass,fill=count)) + scale_fill_gradient(low='white',high='black',trans="log",breaks=c(1,10,100,1000)) +

theme_bw()




– Heatmap (workclass & education)

– Cross-table and heatmap combined

74

+ geom_text(aes(x=workclass,y=education,label=count),color="red")


• Scatter plot, box plot, violin plot

75


• Scatter plot

76

ggplot(data) + geom_point(aes(x=age,y=salaries))


• Scatter plot

77

ggplot(data) + geom_point(aes(x=capital.gain,y=salaries))


• Scatter plot

• Discretise one (or both) of the attributes

– Discretise = make into categorical

– For instance, introduce bins

78

library(arules)

data$capital.gain.discretised =

discretize(data$capital.gain,method="fixed",categories=c(0,5,10,20,50,100)*1000)

ggplot(data) + geom_boxplot(aes(x=capital.gain.discretised,y=salaries))


• Make all pairwise visualisations and

organise in a cross-table

• Perform dimensionality reduction

– Introduced later in the course

– Projects the data into a 2-dimensional space

– Then visualise

• Use colours, shapes, etc.

79


• Four attributes visualised together

80

ggplot(data) +

geom_point(aes(x=capital.gain,y=salaries,color=workclass,shape=gender)) +

scale_colour_brewer(type="qual")


• Four attributes visualised together

81

Often hard to read when many

attributes visualised this way

ggplot(data) +

geom_point(aes(x=capital.gain,y=salaries,color=workclass,shape=gender)) +

scale_colour_brewer(type="qual")



attributes

Types of histograms




Visualising relations of attributes


82



• There are many statistical tests about this,

we will cover some in the next lecture

• Visual inspection can reveal some relations

– For example, university graduates have higher

median salaries than others

84


• There are many statistical tests about this,

we will cover some in the next lecture

• Visual inspection can reveal some relations

– For example, university graduates have higher

median salaries than others

• Correlation is a statistic on 2 numeric

attributes, quantifying linear relationships

85


• Mean (actually sample mean as explained in

next lecture)

• Correlation (Pearson correlation coefficient)

86

n

i

ixn

x1

1


• Ranges from -1 to +1:

– R=-1: perfectly anti-correlated

– R=+1: perfectly correlated

– R=0: absolutely uncorrelated

87

Positively

correlated

Negatively

correlated


Uncorrelated Uncorrelated Uncorrelated



Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for

each set. Note that the correlation reflects the noisiness and direction of a linear

relationship (top row), but not the slope of that relationship (middle), nor many

aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a

slope of 0 but in that case the correlation coefficient is undefined because the

variance of Y is zero.



A. All equal

B. All different

C. Some equal, some

different

92







0

200

400

600

800

1000

1200

0 500 1000 1500 2000 2500 3000 3500 4000 4500

405- Feb2017

CO2 Noise Pressure


0

10

20

30

40

50

60

70

80

0

200

400

600

800

1000

1200

0 500 1000 1500 2000 2500 3000 3500 4000 4500

405- Feb2017

CO2 Pressure Noise

Changed the scale of noise


y=0.0465x+19.693

R²=0.65625

0

10

20

30

40

50

60

70

80

0 200 400 600 800 1000 1200

CO2vsNoisein405

Noise

CO2


y=0.0465x+19.693

R²=0.65625

0

10

20

30

40

50

60

70

80

0 200 400 600 800 1000 1200

CO2vsNoisein405

Noise

CO2

Not the same

as correlation!


• r – Pearson correlation coefficient

• R2 – coefficient of determination

– Proportion of variance in the target attribute that

is predictable from the source attribute

– Basically, measures how well the data follow the

regression line

– In simple linear regression R2 is equal to the

square of r

102


• Source:

https://www.valuewalk.com/2016/08/usd-vs-gbp-vs-eur-a-leveling-of-the-playing-field/


Source:

https://www.valuewalk.com/2016/08/usd-vs-gbp-vs-eur-a-leveling-of-the-playing-field/

104


Source: http://www.tylervigen.com/spurious-correlations

105

???

???

???


Source: http://www.tylervigen.com/spurious-correlations


???

???

???



Relation ≠ Correlation ≠ Causation


Spurious correlations!

Multiple testing problem

(next lecture)

Relation ≠ Correlation ≠ Causation



attributes

Types of histograms




Visualising relations of attributes

Are the attributes related?

111



• “The most merciful thing in the world... is the inability of the human mind to correlate all its contents.”

– H. P. Lovecraft

• “The correlation of quality of life and cost of energy is huge”

– Sam Altman

• “Visualization is daydreaming with a purpose.”

– Bo Bennett

• Source: https://www.brainyquote.com

113

https://www.brainyquote.com/