Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 222 times |
Download: | 0 times |
Statistical Methods in Computer Science
Data 2: Central Tendency &
Variability
Ido Dagan
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 2
Nominal Ordinal Numerical Numerical/Continuous
Distrib.Apparent/Real Limits
F
Percentile,Per. Ranks
f Distrib.Groupedf
Relativef
(Accumulativef)
Frequency Distributions and Scales
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 3
Characteristics of Distributions
Shape, Central Tendency, Variability
Different Central Tendency
Different Variability
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 4
This Lesson
Examine measures of central tendency Mode (Nominal) Median (Ordinal) Mean (Numerical)
Examine measures of variability (dispersion) Entropy (Nominal) Variance (Numerical), Standard Deviation
Standard scores (z-score)
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 5
Nominal Ordinal Numerical Numerical/Continuous
ModeMedian
Mean
EntropyVariance,
Std. Deviationz-Scores
Centrality/Variability Measuresand Scales
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 6
The Mode (Mo)השכיח
The mode of a variable is the value that is most frequentMo = argmax f(x)
For categorical variable: The category that appeared most
For grouped data: The midpoint of the most frequent interval Under the assumption that values are evenly distributed in
the interval
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 7
Finding the Mode: Example 1
The collection of values that a variable X took during the measurement
Student Grade System Algorithm NameX1 60 Windows Round-RobinX2 43 Linux Round-RobinX3 57 BSD Prioritized SchedulingX4 82 Preemptive SchedulingX5 75X6 32X7 82X8 60
MacOS
Trial Run-Time#1 23.234#2 15.471#3 12.220#4 23.100
?Depends onGrouping
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 8
Finding the Mode: Example 2
The mode of a grouped frequency distribution depends on grouping
Score f Score f96 1 78 495 0 77 294 0 76 193 0 75 192 0 74 091 1 73 190 1 72 289 3 71 088 2 70 387 2 69 186 6 68 285 2 67 184 2 66 083 1 65 082 3 64 081 2 63 080 2 62 179 2 61 1
Score f95-99 190-94 285-89 1580-84 1075-79 1070-74 665-69 460-64 2N= 50
i=5
Score f96-100 191-95 186-90 1481-85 1076-80 1171-75 466-70 761-65 2N= 50
i=5
86
88
87
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 9
The Median (Mdn)החציון
The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall
Requires ordering: Only ordinal and the numerical scales Examples:
0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15
and 18).
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 10
The Median (Mdn)החציון
The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall
Requires ordering: Only ordinal and the numerical scales Examples:
0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15
and 18). 5,7,8,8,8,8 ==> Mdn = ?
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 11
The Median (Mdn)החציון
The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall
Requires ordering: Only ordinal and the numerical scales Examples:
0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15
and 18). 5,7,8,8,8,8 ==> Mdn = ?
One method: Halfway between first and second 8, Mdn = 8
Another (for real limits): Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0)
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 12
The Median (Mdn)החציון
The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall
Requires ordering: Only ordinal and the numerical scales Examples:
0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15
and 18). 5,7,8,8,8,8 ==> Mdn = ?
One method: Halfway between first and second 8, Mdn = 8
Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0)
between 7 and 8
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 13
The Median (Mdn)החציון
The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall
Requires ordering: Only ordinal and the numerical scales Examples:
0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15
and 18). 5,7,8,8,8,8 ==> Mdn = ?
One method: Halfway between first and second 8, Mdn = 8
Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) 1 of four 8's
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 14
The Median (Mdn)החציון
The median of a variable is its 50th percentile, P50. The point below which 50% of all measurements fall
Requires ordering: Only ordinal and the numerical scales Examples:
0,8,8,11,15,16,20 ==> Mdn = 11 12,14,15,18,19,20 ==> Mdn = 16.5 (halfway between 15
and 18). 5,7,8,8,8,8 ==> Mdn = ?
One method: Halfway between first and second 8, Mdn = 8
Another: Use linear interpolation as we did in intervals, Mdn = 7.75 7.75 = 7.5 + (¼ * 1.0) Width of interval containing 8's (real limits)
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 15
Arithmetic mean (mean, for short)Average is colloquial: Not precisely defined when used, so we
avoid the term.
The Arithmetic Meanממוצע חשבוני
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 16
Properties of Central Tendency Measures
Mo: Relatively unstable between samples Problematic in grouped distributions Can be more than one:
Distributions that have more than one sometimes called multi-modal
For uniform distributions, all values are possible modes
Typically used only on nominal data
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 17
Properties of Central Tendency Measures
Mean: Responsive to exact value of each score
Only interval and ratio scales Takes total of scores into account: Does not ignore
any value Sum of deviations from mean is always zero:
Because of this: sensitive to outliers Presence/absence of scores at extreme values
Stable between samples, and basis for many other statistical measures
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 18
Properties of Central Tendency Measures
Median: Robust to extreme values
Only cares about ordering, not magnitude of intervals Often used with skewed distributions
MoMdn
Mean
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 19
Properties of Central Tendency Measures
Contrasting Mode, Median, Mean
Mo Mdn Mean
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 20
Properties of Central Tendency Measures
Contrasting Mode, Median, Mean
MoMdn
Mean
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 21
Dispersion and VariabilityMode, Median, Mean: Only give central tendencies
MoMdnMean
We need to measure the spread of the distribution
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 22
Dispersion as Ranges
Range: max(X) - min(X)
Semi-Interquartile Range:
Half the range where 50% of the scores are
22
257513 PP=
QQ=Q
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 23
Dispersion as Deviation
Look at dispersion as a function of the central tendency (mean)
We know sum of deviations from mean is zero
But what if we look at sum of absolute deviations? Smaller sum indicates more clustering of the distribution
around the mean
0=XXi
XXi
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 24
Variance
Statisticians prefer a different way to use absolute values
Sum of squares Shorthand for: Sum of squared deviations from the mean
And normalizing for the size of the sample
This is called the variance of the distribution
2XXi=SSX
N
XXi=
N
SS=σ=S X 2
22
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 25
Standard Deviation (std.)
Square root of variance
Robust to sampling variation: Does not change very much with new samples of the population
Perhaps the most common measure of dispersion Std is defined for population; standard-error for sample is a
bit different We ignore this for now; return to this later
N
XXi=
N
SS=σ X
2
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 26
Standard Scores
Mean, median, etc. are robust to constant translations Adding V to each value is the same as adding V to the
central tendency measures
We may need to also compare distributions changing in range
For instance, what's better: Score of 50, when mean is 60 Score of 60, when mean is 80 ....
Can compute z-scores of the raw scores
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 27
z Scores
Key idea: Express all values in units of standard deviation
This allows comparison of values from different distributions But only if shapes of distributions are similar
Xσ
XX=z
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 28
Measuring Dispersion in Nominal Scales
Entropy
Where rX is rel f of the value X Entropy of 0 means that all values X are the same
rel f = 1.0 for some value X Entropy grows positive when values become more
dispersed e.g., Entropy of 1 means all scores split evenly between
two values Entropy is maximal when rX = 1/N for all values X
i.e., uniform distribution
XX rr=XH log
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 29
Normalizing Entropy
Can normalize by dividing by maximal entropy given N.
This allows comparing the entropy of distributions of different size
n
=nn
n=nn
=maxH1
log1
log11
log1