Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean)...

transcript

Sampling and Inference

The Quality of Data and Measures

Why we talk about sampling

• General citizen education• Understand data you’ll be using• Understand how to draw a sample, if you

need to• Make statistical inferences

Why do we sample?

Cost/benefit Benefit

(precision)

Cost(hassle factor)

How do we sample?

• Simple random sample– Variant: systematic sample with a random start

• Stratified• Cluster

Stratification

• Divide sample into subsamples, based on known characteristics (race, sex, religiousity, continent, department)

• Benefit: preserve or enhance variability

Cluster sampling

HH Unit

Individual

Effects of samples

• Obvious: influences marginals• Less obvious

– Allows effective use of time and effort– Effect on multivariate techniques

• Sampling of independent variable: greater precision in regression estimates

• Sampling on dependent variable: bias

Sampling on Independent Variable

Sampling on Dependent Variable

Sampling

Consequences for Statistical Inference

Statistical Inference:Learning About the Unknown From the

Known• Reasoning forward: distributions of sample

means, when the population mean, s.d., and n are known.

• Reasoning backward: learning about the population mean when only the sample, s.d., and n are known

Reasoning Forward

First, we play with some simulations

• http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html

• http://www.kuleuven.ac.be/ucs/java/index.htm

Exponential Distribution Example

inc0 500000 1.0e+06

.271441

Mean = 250,000Median=125,000s.d. = 283,474Min = 0Max = 1,000,000

Consider 10 random samples, of n = 100 apiece

Sample mean1 253,396.92 198.789.63 271,074.24 238,928.75 280,657.36 241,369.87 249,036.78 226,422.79 210,593.410 212,137.3

inc0 250000 500000 1.0e+06

.271441

Consider 10,000 samples of n = 100

N = 10,000Mean = 249,993s.d. = 28,559Skewness = 0.060Kurtosis = 2.92

(mean) inc0 250000 500000 1.0e+06

.275972

Consider 1,000 samples of various sizes

10 100 1000

Mean =250,105s.d.= 90,891Skew= 0.38Kurt= 3.13

Mean = 250,498s.d.= 28,297Skew= 0.02Kurt= 2.90

Mean = 249,938s.d.= 9,376Skew= -0.50Kurt= 6.80

(mean) inc0 250000 500000 1.0e+06

Difference of means example

inc0 250000 500000 1.0e+06

.280203

State 1Mean = 250,000

inc20 250000 500000 1.0e+06

.251984

State 2Mean = 300,000

Take 1,000 samples of 10, of each state, and compare them

First 10 samplesSample State 1 State 2

1 311,410 <<><><<<<>

365,2242 184,571 243,0623 468,574 438,3364 253,374 557,9095 220,934 189,6746 270,400 284,3097 127,115 210,9708 253,885 333,2089 152,678 314,88210 222,725 152,312

1,000 samples of 10(m

(mean) inc0 1.1e+06

1.1e+06

State 2 > State 1: 673 times

1,000 samples of 100(m

(mean) inc0 1.1e+06

1.1e+06

State 2 > State 1: 909 times

1,000 samples of 1,000(m

(mean) inc0 1.1e+06

1.1e+06

State 2 > State 1: 1,000 times

Another way of looking at it:The distribution of Inc2 – Inc1

n = 10 n = 100 n = 1,000

Mean = 51,845s.d. = 124,815

Mean = 49,704s.d. = 38,774

Mean = 49,816s.d. = 13,932

diff-400000 0 600000

diff-400000 050000 600000

Reasoning Backward

µabout somethingsay obut want t , and ,X , knowyou When sn

Central Limit Theorem

As the sample size n increases, the distribution of the mean of a random sample taken from practically any population approaches a normaldistribution, with mean : and standard deviation

Calculating Standard Errors

In general:

=err. std.

Most important standard errorsMean

Proportion

Diff. of 2 means

Regression (slope) coeff.

npp )1( −

xsnres 1...×

Return to the aplets for the regression standard error

• http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html

• http://www.kuleuven.ac.be/ucs/java/index.htm

The Idea Behind Classical Hypothesis Testing

True mean or regression coefficient

Sample mean or regression coefficientH0 = 0

What We Know

• We know:– The sample mean/coeff. will not equal the

population mean/coeff.– The sample mean/coeff., sample s.d./s.e., & n

• The question:– Is the sample mean/coeff. “far” from H0 or

“close” to H0?

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

Therefore….

• When the sample size is large (i.e., > 150), convert the difference into z units and consult a z table

Z = (H1 - H0) / s.e.

Reading a z table

z table for standardized normal distribution. Image removed for copyright reasons.

Therefore….

• When the sample size is small (i.e., <150), convert the difference into t units and consult a t table

Z = (H1 - H0) / s.e.

t (when the sample is small)

z-4 -2 0 2 4

.000045

.003989

t-distribution

z (normal) distribution

Reading a t table

t table for standardized normal distribution. Image removed forcopyright reasons.

Doing a t-test

diff9692-.2 0 .2 .4

.429558

Q: How likely is it that the residual vote rate in 1996 equal to the rate in 1992 (i.e., blank96-blank92= 0)?

Mean: 0.003069s.d.: 0.02323N: 1448

00061.01448/02323.0

= nses

The pictureMean: 0.003069s.d.: 0.02323N: 1448

newz.003069.00246.00185.00124.000627.000017-.00059

.000134

.398942

00061.01448/02323.0

= nses

028.500061.0

0003069.0

The STATA output. ttest blank96=blank92

Paired t test

------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------blank96 | 1448 .0242941 .0005116 .0194689 .0232904 .0252977blank92 | 1448 .021225 .0005382 .0204813 .0201692 .0222808

---------+--------------------------------------------------------------------diff | 1448 .003069 .0006104 .0232279 .0018717 .0042664

------------------------------------------------------------------------------

Ho: mean(blank96 - blank92) = mean(diff) = 0

Ha: mean(diff) < 0 Ha: mean(diff) ~= 0 Ha: mean(diff) > 0t = 5.0278 t = 5.0278 t = 5.0278

P < t = 1.0000 P > |t| = 0.0000 P > t = 0.0000

. ttest diff9692=0

One-sample t test

------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------diff9692 | 1448 .003069 .0006104 .0232279 .0018717 .0042664------------------------------------------------------------------------------Degrees of freedom: 1447

Ho: mean(diff9692) = 0

Ha: mean < 0 Ha: mean ~= 0 Ha: mean > 0t = 5.0278 t = 5.0278 t = 5.0278

P < t = 1.0000 P > |t| = 0.0000 P > t = 0.0000

Final t-testQ: Was there a relationship between residual vote and countySize in 1996?

Slope coeff: -0.07510s.e.r: 0.7115N: 1861Sx: 1.4788

01115.06762.001649.0

4788.11

18617115.0

×=xsn

vap96_to

blank96 Fitted values

326 6.5e+06

.000281

.298789

Calculating t

7319.601115.

07510.0

The STATA output

. reg lblank96 lvap96

Source | SS df MS Number of obs = 1861-------------+------------------------------ F( 1, 1859) = 45.32

Model | 22.941515 1 22.941515 Prob > F = 0.0000Residual | 941.080329 1859 .506229332 R-squared = 0.0238

-------------+------------------------------ Adj R-squared = 0.0233Total | 964.021844 1860 .518291314 Root MSE = .7115

------------------------------------------------------------------------------lblank96 | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------lvap96 | -.0750985 .0111556 -6.73 0.000 -.0969774 -.0532197_cons | -3.129858 .1113781 -28.10 0.000 -3.348298 -2.911419

------------------------------------------------------------------------------

A word about standard errors and collinearity

• The problem: if X1 and X2 are highly correlated, then it will be difficult to precisely estimate the effect of either one of these variables on Y

How does having another collinearindependent variable affect standard

errors?

s eN n

X. .( $ )β1

=− −

−−

R2 of the “auxiliary regression” of X1 on allthe other independent variables

Example: Effect of party, ideology, and religiosity on feelings toward

Quincy BushBush

FeelingsConserv. Repub. Religious

Bush Feelings

1.0 .39 .57 .16

Conserv. 1.0 .46 .18

Repub. 1.0 .06

Relig. 1.0

Regression table(1) (2) (3) (4)

Intercept 32.7(0.85)

32.9(1.08)

32.6(1.20)

29.3(1.31)

Repub. 6.73(0.244)

5.86(0.27)

6.64(0.241)

5.88(0.27)

Conserv. --- 2.11(0.30)

--- 1.87(0.30)

Relig. --- --- 7.92(1.18)

5.78(1.19)

N 1575 1575 1575 1575R2 .32 .35 .35 .36

Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean)...

Documents