Measures and indexes of variability
Stats48n
Measures of spread/dispersion/variabilityI A measure of center needs to be complemented by a measure
of spread around this center.I The definition of averages that we explore naturally lead
themselves to measures of variabilityI Variance: average square distance from the mean
V(x1, . . . , xn) = 1n
n∑i=1
(xi − x̄)2
I Standard Deviation: √√√√1n
n∑i=1
(xi − x̄)2
I Note that R actually divides by n − 1 rather than n. This isbecause when x1, . . . , xn are a sample from a larger populationof possible values, dividing by n − 1 one has a “better”estimator for the population quantity.
A note: data with frequencies
I Often data is summarized so that we have counts ofoccurrences of the same values: we have a set v1, . . . , vm ofpossible values, with their frequencies fi
v1 v2 · · · vmf1 f2 · · · fm
- Calculating averages and standard deviations has to adapt to thisdifferent set-up
v̄ = 1∑mi=1 fi
m∑i=1
vi fi
Variance = 1∑mi=1 fi
m∑i=1
(vi − v̄)2fi
A note: the maximal variance of x1, . . . , xnI Generally speaking, the variance of a dataset can be arbitrarily
largeI Let’s consider some restrictions that make the statement
meaningfulI xi ≥ 0 ∀i
I fix the total sum of valuesn∑
i=1xi = nx̄
n∑i=1
(xi − x̄)2 =n∑
i=1(x2
i + x̄2 − 2xi x̄)
=n∑
i=1x2
i +n∑
i=1x̄2 − 2x̄
n∑i=1
xi
=n∑
i=1x2
i + nx̄2 − 2x̄(nx̄)
=n∑
i=1x2
i − nx̄2
A note: the maximal variance of x1, . . . , xn
So, V(x1, . . . , xn) = (n∑
i=1x2
i − nx̄2)/n. Now,
n∑i=1
x2i − nx̄2 ≤ (
n∑i=1
xi )2 − nx̄2 = n2x̄2 − nx̄2 = x̄2n(n − 1)
Which means that
V(x1, . . . , xn) ≤ x̄2(n − 1)
I Can we imagine a set of values of x1, . . . , xn for which thevariance is actually equal to this max?
Index of concentration
The opposite of spread-out is “concentrated.”
Let’s consider variables like the one we just talked about, that iswith only positive values. One such variable might be the incomeof households in a nation.
It is interesting to study how “concentrated” or not such income is.One can imagine that the total income of a nation is the totalamount of a resource that one could distribute.
Income inequality in the media
Income inequality in politics
How can we measure “income inequality”?
I Let’s think we have a population with n individuals, each withincome x1, . . . xn.
I nx̄ is the total income in the population (with x̄ =∑n
i=1 xi/n)I What would be the values of x1, . . . xn in the case of maximal“income equality”?
I What would be the values of x1, . . . xn in the case of maximal“income inequality”?
I How are we going to judge cases in the middle?I Any known measure?I Any measure we can come up with given what we already
know?
A graphical display for income distributionWe take the values x1, . . . xn and order them
x(1) ≤ x(2) ≤ · · · ≤ x(n)
For simplicity, we are going to drop the parentheses from the indexnotation, and just remember that x1 is the smallest index.We now calculate two quantities:
Fi = in Qi =
∑ij=1 xj∑nj=1 xj
I F1 is the fraction of the population that correspond to thebottom earner; F2 is the fraction of the population thatcorrespond to the two bottom earners etc.
I Q1 is the fraction of the national income earned by the bottomearner; Q2 is the fraction of the national income earned by thetwo bottom earners etc.
A graphical display for income distribution
I Let’s think about the relation between Fi and Qi in the case ofperfect income equality
I In general, Qi ≤ Fi . To see this, let’s look at their definitionand multiply by
∑nj=1 xj and divide by i
Qi ≤ Fi∑ij=1 xj∑nj=1 xj
≤ in∑i
j=1 xj
i ≤∑n
j=1 xj
n
and remember that the xi are increasing.
A graphical display for income distributionIncome values = 1,2,3,10,15,15,30,50
● ●●
●
●
●
●
●
●
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
F
Q
A graphical display for income distribution
●
●
●
●
●
●
●
●
●
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
F
Q
Perfect equality
● ● ● ● ● ● ● ●
●
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
F
Q
Maximal inequality
How could we use this to construct an Index?
An idea for the index
● ●●
●
●
●
●
●
●
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00
F
Q
From area to index
I Index varies between 0 and 1I Area in between curves A= 1/2- area under bottom curveI Area under bottom curve: sum of areas of trapezoids. Thus
A = 12 −
n∑i=1
(Fi − Fi−1)(Qi + Qi−1)2
I Gini’s index= G = A1/2 = 1−
n∑i=1
(Fi − Fi−1)(Qi + Qi−1)
How do things change if we have repetition?
I data in the formx1 ≤ x2 ≤ · · · ≤ xk
n1 n2 · · · nk
with∑
j nj = nI Define
Fi =∑i
j=1 nj
n Qi =∑i
j=1 njxj∑kj=1 njxj
I Everything else stays the same.
Income distribution in USA 2015Current Population Survey, Income Data
●●
●●●
●●
●
●●●
●●
●●●●
●●
●●
●●
●●●●●
●●●●●
●●●●●●●
●
●
●●
●●
●
●●
●
●●●
●●
●●●
●
●●●
●
●●
●●●●●
●●●●●
●●●●●●●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●●
●●
●●
●●●●●●●●●
●●●
●●
●
●
●●●●●
●●●●
●
●
●●
●●
●
●
●
●●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●●●●
●
●●●●●●●●
●
●
0.00
0.02
0.04
0.06
0.08
0e+00 1e+05 2e+05 3e+05 4e+05
Average Income in Bracket
Pro
port
ion
in In
com
e B
rack
et
●●●●●
●●●●●
●●●●●
●●●●●
●●●●●
All
Whites
Blacks
Asians
Hispanics
Revisiting the Income data
●
●
●
●
●
All
Asian
Black
Hispanic
White
0.47 0.48 0.49 0.50
Gini Index
Rac
e
Gini index for other data
Gini index for other data
Something to note
We can calculate the following summary of “mutual variability”
∆ =k∑
i=1
k∑j=1|xi − xj |
nin
njn
And one can show thatG = ∆
2x̄