Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Statistical Methods in Computer Science

Data 2: Central Tendency &

Variability

Ido Dagan

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 2

Nominal Ordinal Numerical Numerical/Continuous

Distrib.Apparent/Real Limits

F

Percentile,Per. Ranks

f Distrib.Groupedf

Relativef

(Accumulativef)

Frequency Distributions and Scales


Characteristics of Distributions

Shape, Central Tendency, Variability

Different Central Tendency

Different Variability


This Lesson

Examine measures of central tendency Mode (Nominal) Median (Ordinal) Mean (Numerical)

Examine measures of variability (dispersion) Entropy (Nominal) Variance (Numerical), Standard Deviation

Standard scores (z-score)


Nominal Ordinal Numerical Numerical/Continuous

ModeMedian

Mean

EntropyVariance,

Std. Deviationz-Scores

Centrality/Variability Measuresand Scales


The Mode (Mo)השכיח

The mode of a variable is the value that is most frequentMo = argmax f(x)

For categorical variable: The category that appeared most

For grouped data: The midpoint of the most frequent interval Under the assumption that values are evenly distributed in

the interval


Finding the Mode: Example 1

The collection of values that a variable X took during the measurement

Student Grade System Algorithm NameX1 60 Windows Round-RobinX2 43 Linux Round-RobinX3 57 BSD Prioritized SchedulingX4 82 Preemptive SchedulingX5 75X6 32X7 82X8 60

MacOS

Trial Run-Time#1 23.234#2 15.471#3 12.220#4 23.100

?Depends onGrouping


Finding the Mode: Example 2

The mode of a grouped frequency distribution depends on grouping

Score f Score f96 1 78 495 0 77 294 0 76 193 0 75 192 0 74 091 1 73 190 1 72 289 3 71 088 2 70 387 2 69 186 6 68 285 2 67 184 2 66 083 1 65 082 3 64 081 2 63 080 2 62 179 2 61 1

Score f95-99 190-94 285-89 1580-84 1075-79 1070-74 665-69 460-64 2N= 50

i=5

Score f96-100 191-95 186-90 1481-85 1076-80 1171-75 466-70 761-65 2N= 50

i=5

86

88

87


The Median (Mdn)החציון

The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall

Requires ordering: Only ordinal and the numerical scales Examples:

0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15

and 18).






and 18). 5,7,8,8,8,8 ==> Mdn = ?






and 18). 5,7,8,8,8,8 ==> Mdn = ?

One method: Halfway between first and second 8, Mdn = 8

Another (for real limits): Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0)






and 18). 5,7,8,8,8,8 ==> Mdn = ?


Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0)

between 7 and 8






and 18). 5,7,8,8,8,8 ==> Mdn = ?


Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) 1 of four 8's






and 18). 5,7,8,8,8,8 ==> Mdn = ?


Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) Width of interval containing 8's (real limits)


Arithmetic mean (mean, for short)Average is colloquial: Not precisely defined when used, so we

avoid the term.

The Arithmetic Meanממוצע חשבוני


Properties of Central Tendency Measures

Mo: Relatively unstable between samples Problematic in grouped distributions Can be more than one:

Distributions that have more than one sometimes called multi-modal

For uniform distributions, all values are possible modes

Typically used only on nominal data



Mean: Responsive to exact value of each score

Only interval and ratio scales Takes total of scores into account: Does not ignore

any value Sum of deviations from mean is always zero:

Because of this: sensitive to outliers Presence/absence of scores at extreme values

Stable between samples, and basis for many other statistical measures



Median: Robust to extreme values

Only cares about ordering, not magnitude of intervals Often used with skewed distributions

MoMdn

Mean



Contrasting Mode, Median, Mean

Mo Mdn Mean



Contrasting Mode, Median, Mean

MoMdn

Mean


Dispersion and VariabilityMode, Median, Mean: Only give central tendencies

MoMdnMean

We need to measure the spread of the distribution


Dispersion as Ranges

Range: max(X) - min(X)

Semi-Interquartile Range:

Half the range where 50% of the scores are

22

257513 PP=

QQ=Q


Dispersion as Deviation

Look at dispersion as a function of the central tendency (mean)

We know sum of deviations from mean is zero

But what if we look at sum of absolute deviations? Smaller sum indicates more clustering of the distribution

around the mean

0=XXi

XXi


Variance

Statisticians prefer a different way to use absolute values

Sum of squares Shorthand for: Sum of squared deviations from the mean

And normalizing for the size of the sample

This is called the variance of the distribution

2XXi=SSX

N

XXi=

N

SS=σ=S X 2

22


Standard Deviation (std.)

Square root of variance

Robust to sampling variation: Does not change very much with new samples of the population

Perhaps the most common measure of dispersion Std is defined for population; standard-error for sample is a

bit different We ignore this for now; return to this later

N

XXi=

N

SS=σ X

2


Standard Scores

Mean, median, etc. are robust to constant translations Adding V to each value is the same as adding V to the

central tendency measures

We may need to also compare distributions changing in range

For instance, what's better: Score of 50, when mean is 60 Score of 60, when mean is 80 ....

Can compute z-scores of the raw scores


z Scores

Key idea: Express all values in units of standard deviation

This allows comparison of values from different distributions But only if shapes of distributions are similar

Xσ

XX=z


Measuring Dispersion in Nominal Scales

Entropy

Where rX is rel f of the value X Entropy of 0 means that all values X are the same

rel f = 1.0 for some value X Entropy grows positive when values become more

dispersed e.g., Entropy of 1 means all scores split evenly between

two values Entropy is maximal when rX = 1/N for all values X

i.e., uniform distribution

XX rr=XH log


Normalizing Entropy

Can normalize by dividing by maximal entropy given N.

This allows comparing the entropy of distributions of different size

n

=nn

n=nn

=maxH1

log1

log11

log1

Date post:	19-Dec-2015
Category:	Documents
View:	222 times
Download:	0 times

Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.

Documents