1
Sampling and Inference
The Quality of Data and Measures
March 23, 2006
2
Why we talk about sampling
• General citizen education• Understand data you’ll be using• Understand how to draw a sample, if you
need to• Make statistical inferences
3
Why do we sample?
N
Cost/benefit Benefit
(precision)
Cost(hassle factor)
4
How do we sample?
• Simple random sample– Variant: systematic sample with a random start
• Stratified• Cluster
5
Stratification
• Divide sample into subsamples, based on known characteristics (race, sex, religiousity, continent, department)
• Benefit: preserve or enhance variability
6
Cluster sampling
Block
HH Unit
Individual
7
Effects of samples
• Obvious: influences marginals• Less obvious
– Allows effective use of time and effort– Effect on multivariate techniques
• Sampling of independent variable: greater precision in regression estimates
• Sampling on dependent variable: bias
8
Sampling on Independent Variable
x
y
x
y
9
Sampling on Dependent Variable
x
y
x
y
10
Sampling
Consequences for Statistical Inference
11
Statistical Inference:Learning About the Unknown From the
Known• Reasoning forward: distributions of sample
means, when the population mean, s.d., and n are known.
• Reasoning backward: learning about the population mean when only the sample, s.d., and n are known
12
Reasoning Forward
13
Exponential Distribution Example
Frac
tion
inc0 500000 1.0e+06
0
.271441
Mean = 250,000Median=125,000s.d. = 283,474Min = 0Max = 1,000,000
14
Consider 10 random samples, of n = 100 apiece
Sample mean
1 253,396.9
2 198.789.6
3 271,074.2
4 238,928.7
5 280,657.3
6 241,369.8
7 249,036.7
8 226,422.7
9 210,593.4
10 212,137.3
Frac
tion
inc0 250000 500000 1.0e+06
0
.271441
15
Consider 10,000 samples of n = 100
Frac
tion
(mean) inc0 250000 500000 1.0e+06
0
.275972N = 10,000Mean = 249,993s.d. = 28,559Skewness = 0.060Kurtosis = 2.92
16
Consider 1,000 samples of various sizes
10 100 1000
Mean =250,105s.d.= 90,891Skew= 0.38Kurt= 3.13
Mean = 250,498s.d.= 28,297Skew= 0.02Kurt= 2.90
Mean = 249,938s.d.= 9,376Skew= -0.50Kurt= 6.80
Frac
tion
(mean) inc0 250000 500000 1.0e+06
0
.731
Frac
tion
(mean) inc0 250000 500000 1.0e+06
0
.731
Frac
tion
(mean) inc0 250000 500000 1.0e+06
0
.731
17
Difference of means example
Frac
tion
inc0 250000 500000 1.0e+06
0
.280203
Frac
tion
inc20 250000 500000 1.0e+06
0
.251984
State 1Mean = 250,000
State 2Mean = 300,000
18
Take 1,000 samples of 10, of each state, and compare them
First 10 samplesSample State 1 State 2
1 311,410 < 365,2242 184,571 < 243,0623 468,574 > 438,3364 253,374 < 557,9095 220,934 > 189,6746 270,400 < 284,3097 127,115 < 210,9708 253,885 < 333,2089 152,678 < 314,882
10 222,725 > 152,312
19
1,000 samples of 10(m
ean)
inc2
(mean) inc0 1.1e+06
0
1.1e+06
State 2 > State 1: 673 times
20
1,000 samples of 100(m
ean)
inc2
(mean) inc0 1.1e+06
0
1.1e+06
State 2 > State 1: 909 times
21
1,000 samples of 1,000
State 2 > State 1: 1,000 times
(mea
n) in
c2
(mean) inc0 1.1e+06
0
1.1e+06
22
Another way of looking at it:The distribution of Inc2 – Inc1
n = 10 n = 100 n = 1,000
Mean = 51,845s.d. = 124,815
Mean = 49,704s.d. = 38,774
Mean = 49,816s.d. = 13,932
Frac
tion
diff-400000 0 600000
0
.565
Frac
tion
diff-400000 050000 600000
0
.565
Frac
tion
diff-400000 050000 600000
0
.565
23
Play with some simulations
• http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html
• http://www.kuleuven.ac.be/ucs/java/index.htm
24
Reasoning Backward
about somethingsay obut want t , and ,X , knowyou When sn
25
Central Limit Theorem
As the sample size n increases, the distribution of the mean of a random sample taken from practically any population approaches a normal distribution, with mean and standard deviation
X
n
26
Calculating Standard Errors
In general:
ns
err. std.
27
Most important standard errors
2
22
1
11 )1()1(n
ppn
pp
Mean
Proportion
Diff. of 2 means
Diff. of 2 proportionsDiff of 2 means (paired data)Regression (slope) coeff.
ns
npp )1(
2
22
1
21
ns
ns
xsnres 11...
nsd
28
Using Standard Errors, we can construct “confidence intervals”
• Confidence interval (ci): an interval between two numbers, where there is a certain specified level of confidence that a population parameter lies
• ci = sample parameter +• multiple * sample standard error
29
Constructing Confidence Intervals
• Let’s say we draw a sample of tuitions from 15 private universities. Can we estimate what the average of all private university tuitions is?
• N = 15• Average = 29,735• S.d. = 2,196• S.e. = 567
15196,2
ns
30
The Picturey
Mean
.000134
.398942
2 3 4234 68%
95%99%
N = 15; avg. = 29,735; s.d. = 2,196; s.e. = s/√n = 567
29,735
29,735+567=30,30229,735-567=29,168
29,735+2*567=30,869
29,735-2*567=28,601
31
Confidence Intervals for Tuition Example
• 68% confidence interval = 29,735+567 = [29,168 to 30,302]
• 95% confidence interval = 29,735+2*567 = [28,601 to 30,869]
• 99% confidence interval = 29,735+3*567 = [28,034 to 31,436]
32
What if someone (ahead of time) had said, “I think the average tuition of
major research universities is $25k”?• Note that $25,000 is well out of the 99%
confidence interval, [28,034 to 31,436]• Q: How far away is the $25k estimate from
the sample mean?– A: Do it in z-scores: (29,735-25,000)/567 = 8.35
33
Constructing confidence intervals of proportions
• Let us say we drew a sample of 1,000 adults and asked them if they approved of the way George Bush was handling his job as president. (March 13-16, 2006 Gallup Poll) Can we estimate the % of all American adults who approve?
• N = 1000• p = .37• s.e. = 02.0
1000)37.1(37.)1(
n
pp
34
The Picturey
Mean
.000134
.398942
2 3 4234 68%
95%99%
N = 1,000; p. = .37; s.e. = √p(1-p)/n = .02
.37
.37+.02=.39.37-.02=.35
.37+2*.02=.41.37-2*.02=.33
35
Confidence Intervals for Bush approval example
• 68% confidence interval = .37+.02 = [.35 to .39]• 95% confidence interval = .37+2*.02 = [.33 to .41]• 99% confidence interval = .37+3*.02 = [ .31 to .43]
36
What Gallup said about the confidence interval
• Results are based on telephone interviews with 1,000 national adults, aged 18 and older, conducted March 13-16, 2006. For results based on the total sample of national adults, one can say with 95% confidence that the maximum margin of sampling error is ±3 percentage points [because the actual standard error is 1.5%, not 2%].
37
What if someone (ahead of time) had said, “I think Americans are equally
divided in how they think about Bush.”
• Note that 50% is well out of the 99% confidence interval, [31% to 43%]
• Q: How far away is the 50% estimate from the sample proportion?– A: Do it in z-scores: (.37-.5)/.02 = -6.5 [-8.7 if
we divide by 0.15]
38
Constructing confidence intervals of differences of means
• Let’s say we draw a sample of tuitions from 15 private and public universities. Can we estimate what the difference in average tuitions is between the two types of universities?
• N = 15 in both cases• Average = 29,735 (private); 5,498 (public); diff = 24,238• s.d. = 2,196 (private); 1,894 (public)• s.e. =
74915
3,587,23615
4,822,416
2
22
1
21
ns
ns
39
The Picturey
Mean
.000134
.398942
2 3 4234 68%
95%99%
N = 15 twice; diff = 24,238; s.e. = 749
24,238
24,238+749=24,98724,238-749=23,489
24,238+2*749=25,736
24,238-2*749=22,740
40
Confidence Intervals for difference of tuition means example
• 68% confidence interval = 24,238+749 = [23,489 to 24,987]• 95% confidence interval = 24,238+2*749 =
[22,740 to 25,736]• 99% confidence interval =24,238+3*749 = • [21,991 to 26,485]
41
What if someone (ahead of time) had said, “Private universities are no
more expensive than public universities”
• Note that $0 is well out of the 99% confidence interval, [$21,991 to $26,485]
• Q: How far away is the $0 estimate from the sample proportion?– A: Do it in z-scores: (24,238-0)/749 = 32.4
42
Constructing confidence intervals of difference of proportions
• Let us say we drew a sample of 1,000 adults and asked them if they approved of the way George Bush was handling his job as president. (March 13-16, 2006 Gallup Poll). We focus on the 600 who are either independents or Democrats. Can we estimate whether independents and Democrats view Bus differently?
• N = 300 ind; 300 Dem.• p = .29 (ind.); .10 (Dem.); diff = .19• s.e. =
03.300
)10.1(10.300
)29.1(29.)1()1(
2
22
1
11
npp
npp
43
The Picturey
Mean
.000134
.398942
2 3 4234 68%
95%99%
diff. p. = .19; s.e. = .03
.19
.19+.03=.22.19-.03=.16
.19+2*.03=.25.19-2*.03=.13
44
Confidence Intervals for Bush Ind/Dem approval example
• 68% confidence interval = .19+.03 = [.16 to .22]• 95% confidence interval = .19+2*.03 = [.13 to .25]• 99% confidence interval = .19+3*.03 = [ .10 to .28]
45
What if someone (ahead of time) had said, “I think Democrats and
Independents are equally unsupportive of Bush”?
• Note that 0% is well out of the 99% confidence interval, [10% to 28%]
• Q: How far away is the 0% estimate from the sample proportion?– A: Do it in z-scores: (.19-0)/.03 = 6.33
46
Constructing confidence intervals of differences of means in a paired
sample• Let’s say we draw a sample of tuitions from 15
private universities, in 2003 and again in 2004. Can we estimate what the difference in average tuitions is between the two years?
• N = 15 • Averages = 28,102 (2003); 29,735 (2004); diff = 1,632• s.d. = 2,196 (private); 1,894 (public); 886 (diff)• s.e. =
22915
886
nsd
47
The Picturey
Mean
.000134
.398942
2 3 4234 68%
95%99%
N = 15; diff = 1,632; s.e. = 229
1,632
1,632+229=1,8611,632-229=1,403
1632+2*229=2,090
1,632-2*229=1,174
48
Confidence Intervals for second difference of tuition means example
• 68% confidence interval = 1,632+ 229= [1,403 to 1,861]• 95% confidence interval = 1,632+ 2*229 =
[1,174 to 2,090]• 99% confidence interval = 1,632+3*229 =
[945 to 2,319]
49
What if someone (ahead of time) had said, “Private university tuitions did
not grow from 2003 to 2004”
• Note that $0 is well out of the 99% confidence interval, [$1,174 to $2,090]
• Q: How far away is the $0 estimate from the sample proportion?– A: Do it in z-scores: (1,632-0)/229 = 7.13
50
The Stata output
. gen difftuition=tuition2004-tuition2003
. ttest diff=0 in 1/15
One-sample t test------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------difftu~n | 15 1631.6 228.6886 885.707 1141.112 2122.088------------------------------------------------------------------------------ mean = mean(difftuition) t = 7.1346Ho: mean = 0 degrees of freedom = 14
Ha: mean < 0 Ha: mean != 0 Ha: mean > 0 Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000
51
Constructing confidence intervals of regression coefficients
• Let’s look at the relationship between the mid-term seat loss by the President’s party at midterm and the President’s Gallup poll rating
1938
1942
1946
1950
1954
1958
1962
1966
1970
1974
1978
1982
19861990
1994
19982002
-80
-60
-40
-20
0C
hang
e in
Hou
se s
eats
30 40 50 60 70Gallup approval rating (Nov.)
loss Fitted valuesFitted values
Slope = 1.97N = 14s.e.r. = 13.8sx = 8.14s.e.slope =
47.014.81
138.131
1...
xsnres
52
The Picturey
Mean
.000134
.398942
2 3 4234 68%
95%99%
N = 14; slope=1.97; s.e. = 0.45
1.97
1.97+0.47=2.441.97-0.47=1.50
1.97+2*0.47=2.911.97-2*0.47=1.03
53
Confidence Intervals for regression example
• 68% confidence interval = 1.97+ 0.47= [1.50 to 2.44]• 95% confidence interval = 1.97+ 2*0.47 =
[1.03 to 2.91]• 99% confidence interval = 1.97+3*0.47 =
[0.62 to 3.32]
54
What if someone (ahead of time) had said, “There is no relationship
between the president’s popularity and how his party’s House members
do at midterm”?• Note that 0 is well out of the 99%
confidence interval, [0.62 to 3.32]• Q: How far away is the 0 estimate from the
sample proportion?– A: Do it in z-scores: (1.97-0)/0.47 = 4.19
55
The Stata output. reg loss gallup if year>1948
Source | SS df MS Number of obs = 14-------------+------------------------------ F( 1, 12) = 17.53 Model | 3332.58872 1 3332.58872 Prob > F = 0.0013 Residual | 2280.83985 12 190.069988 R-squared = 0.5937-------------+------------------------------ Adj R-squared = 0.5598 Total | 5613.42857 13 431.802198 Root MSE = 13.787
------------------------------------------------------------------------------ loss | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- gallup | 1.96812 .4700211 4.19 0.001 .9440315 2.992208 _cons | -127.4281 25.54753 -4.99 0.000 -183.0914 -71.76486------------------------------------------------------------------------------
56
z vs. t
57
If n is sufficiently large, we know the distribution of sample means/coeffs. will
obey the normal curve
y
Mean
.000134
.398942
2 3 4234 68%
95%99%
58
If n is sufficiently large, we know the distribution of sample means/coeffs. will
obey the normal curve
y
Mean
.000134
.398942
2 3 4234 68%
95%99%
59
If n is sufficiently large, we know the distribution of sample means/coeffs. will
obey the normal curve
y
Mean
.000134
.398942
2 3 4234 68%
95%99%
60
If n is sufficiently large, we know the distribution of sample means/coeffs. will
obey the normal curve
y
Mean
.000134
.398942
2 3 4234 68%
95%99%
61
Therefore….
• When the sample size is large (i.e., > 150), convert the difference into z units and consult a z table
Z = (H1 - H0) / s.e.
62
Reading a z table
63
64
Therefore….
• When the sample size is small (i.e., <150), convert the difference into t units and consult a t table
t = (H1 - H0) / s.e.
65
t (when the sample is small)
z-4 -2 0 2 4
.000045
.003989
t-distribution
z (normal) distribution
66
Reading a t table
67
A word about standard errors and collinearity
• The problem: if X1 and X2 are highly correlated, then it will be difficult to precisely estimate the effect of either one of these variables on Y
68
How does having another collinear independent variable affect standard
errors?
s eN n
SS
RR
Y
X
Y
X
. . ( )1
2
2
2
2
11
11
1 1
R2 of the “auxiliary regression” of X1 on allthe other independent variables
69
Example: Effect of party, ideology, and religiosity on feelings toward
Quincy BushBush
FeelingsConserv. Repub. Religious
Bush Feelings
1.0 .39 .57 .16
Conserv. 1.0 .46 .18
Repub. 1.0 .06
Relig. 1.0
70
Regression table(1) (2) (3) (4)
Intercept 32.7(0.85)
32.9(1.08)
32.6(1.20)
29.3(1.31)
Repub. 6.73(0.244)
5.86(0.27)
6.64(0.241)
5.88(0.27)
Conserv. --- 2.11(0.30)
--- 1.87(0.30)
Relig. --- --- 7.92(1.18)
5.78(1.19)
N 1575 1575 1575 1575
R2 .32 .35 .35 .36