+ All Categories
Home > Documents > Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Date post: 19-Dec-2015
Category:
View: 222 times
Download: 0 times
Share this document with a friend
Popular Tags:
29
Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan
Transcript
Page 1: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Statistical Methods in Computer Science

Data 2: Central Tendency &

Variability

Ido Dagan

Page 2: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 2

Nominal Ordinal Numerical Numerical/Continuous

Distrib.Apparent/Real Limits

F

Percentile,Per. Ranks

f Distrib.Groupedf

Relativef

(Accumulativef)

Frequency Distributions and Scales

Page 3: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 3

Characteristics of Distributions

Shape, Central Tendency, Variability

Different Central Tendency

Different Variability

Page 4: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 4

This Lesson

Examine measures of central tendency Mode (Nominal) Median (Ordinal) Mean (Numerical)

Examine measures of variability (dispersion) Entropy (Nominal) Variance (Numerical), Standard Deviation

Standard scores (z-score)

Page 5: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 5

Nominal Ordinal Numerical Numerical/Continuous

ModeMedian

Mean

EntropyVariance,

Std. Deviationz-Scores

Centrality/Variability Measuresand Scales

Page 6: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 6

The Mode (Mo)השכיח

The mode of a variable is the value that is most frequentMo = argmax f(x)

For categorical variable: The category that appeared most

For grouped data: The midpoint of the most frequent interval Under the assumption that values are evenly distributed in

the interval

Page 7: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 7

Finding the Mode: Example 1

The collection of values that a variable X took during the measurement

Student Grade System Algorithm NameX1 60 Windows Round-RobinX2 43 Linux Round-RobinX3 57 BSD Prioritized SchedulingX4 82 Preemptive SchedulingX5 75X6 32X7 82X8 60

MacOS

Trial Run-Time#1 23.234#2 15.471#3 12.220#4 23.100

?Depends onGrouping

Page 8: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 8

Finding the Mode: Example 2

The mode of a grouped frequency distribution depends on grouping

Score f Score f96 1 78 495 0 77 294 0 76 193 0 75 192 0 74 091 1 73 190 1 72 289 3 71 088 2 70 387 2 69 186 6 68 285 2 67 184 2 66 083 1 65 082 3 64 081 2 63 080 2 62 179 2 61 1

Score f95-99 190-94 285-89 1580-84 1075-79 1070-74 665-69 460-64 2N= 50

i=5

Score f96-100 191-95 186-90 1481-85 1076-80 1171-75 466-70 761-65 2N= 50

i=5

86

88

87

Page 9: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 9

The Median (Mdn)החציון

The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall

Requires ordering: Only ordinal and the numerical scales Examples:

0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15

and 18).

Page 10: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 10

The Median (Mdn)החציון

The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall

Requires ordering: Only ordinal and the numerical scales Examples:

0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15

and 18). 5,7,8,8,8,8 ==> Mdn = ?

Page 11: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 11

The Median (Mdn)החציון

The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall

Requires ordering: Only ordinal and the numerical scales Examples:

0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15

and 18). 5,7,8,8,8,8 ==> Mdn = ?

One method: Halfway between first and second 8, Mdn = 8

Another (for real limits): Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0)

Page 12: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 12

The Median (Mdn)החציון

The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall

Requires ordering: Only ordinal and the numerical scales Examples:

0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15

and 18). 5,7,8,8,8,8 ==> Mdn = ?

One method: Halfway between first and second 8, Mdn = 8

Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0)

between 7 and 8

Page 13: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 13

The Median (Mdn)החציון

The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall

Requires ordering: Only ordinal and the numerical scales Examples:

0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15

and 18). 5,7,8,8,8,8 ==> Mdn = ?

One method: Halfway between first and second 8, Mdn = 8

Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) 1 of four 8's

Page 14: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 14

The Median (Mdn)החציון

The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall

Requires ordering: Only ordinal and the numerical scales Examples:

0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15

and 18). 5,7,8,8,8,8 ==> Mdn = ?

One method: Halfway between first and second 8, Mdn = 8

Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) Width of interval containing 8's (real limits)

Page 15: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 15

Arithmetic mean (mean, for short)Average is colloquial: Not precisely defined when used, so we

avoid the term.

The Arithmetic Meanממוצע חשבוני

Page 16: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 16

Properties of Central Tendency Measures

Mo: Relatively unstable between samples Problematic in grouped distributions Can be more than one:

Distributions that have more than one sometimes called multi-modal

For uniform distributions, all values are possible modes

Typically used only on nominal data

Page 17: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 17

Properties of Central Tendency Measures

Mean: Responsive to exact value of each score

Only interval and ratio scales Takes total of scores into account: Does not ignore

any value Sum of deviations from mean is always zero:

Because of this: sensitive to outliers Presence/absence of scores at extreme values

Stable between samples, and basis for many other statistical measures

Page 18: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 18

Properties of Central Tendency Measures

Median: Robust to extreme values

Only cares about ordering, not magnitude of intervals Often used with skewed distributions

MoMdn

Mean

Page 19: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 19

Properties of Central Tendency Measures

Contrasting Mode, Median, Mean

Mo Mdn Mean

Page 20: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 20

Properties of Central Tendency Measures

Contrasting Mode, Median, Mean

MoMdn

Mean

Page 21: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 21

Dispersion and VariabilityMode, Median, Mean: Only give central tendencies

MoMdnMean

We need to measure the spread of the distribution

Page 22: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 22

Dispersion as Ranges

Range: max(X) - min(X)

Semi-Interquartile Range:

Half the range where 50% of the scores are

22

257513 PP=

QQ=Q

Page 23: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 23

Dispersion as Deviation

Look at dispersion as a function of the central tendency (mean)

We know sum of deviations from mean is zero

But what if we look at sum of absolute deviations? Smaller sum indicates more clustering of the distribution

around the mean

0=XXi

XXi

Page 24: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 24

Variance

Statisticians prefer a different way to use absolute values

Sum of squares Shorthand for: Sum of squared deviations from the mean

And normalizing for the size of the sample

This is called the variance of the distribution

2XXi=SSX

N

XXi=

N

SS=σ=S X 2

22

Page 25: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 25

Standard Deviation (std.)

Square root of variance

Robust to sampling variation: Does not change very much with new samples of the population

Perhaps the most common measure of dispersion Std is defined for population; standard-error for sample is a

bit different We ignore this for now; return to this later

N

XXi=

N

SS=σ X

2

Page 26: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 26

Standard Scores

Mean, median, etc. are robust to constant translations Adding V to each value is the same as adding V to the

central tendency measures

We may need to also compare distributions changing in range

For instance, what's better: Score of 50, when mean is 60 Score of 60, when mean is 80 ....

Can compute z-scores of the raw scores

Page 27: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 27

z Scores

Key idea: Express all values in units of standard deviation

This allows comparison of values from different distributions But only if shapes of distributions are similar

XX=z

Page 28: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 28

Measuring Dispersion in Nominal Scales

Entropy

Where rX is rel f of the value X Entropy of 0 means that all values X are the same

rel f = 1.0 for some value X Entropy grows positive when values become more

dispersed e.g., Entropy of 1 means all scores split evenly between

two values Entropy is maximal when rX = 1/N for all values X

i.e., uniform distribution

XX rr=XH log

Page 29: Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 29

Normalizing Entropy

Can normalize by dividing by maximal entropy given N.

This allows comparing the entropy of distributions of different size

n

=nn

n=nn

=maxH1

log1

log11

log1


Recommended