Sampling and Inferencedspace.mit.edu/.../lecture-notes/samp_and_inferce.pdf · r ac t i on (mean)...

Post on 25-Feb-2020

4 views 0 download

transcript

1

Sampling and Inference

The Quality of Data and Measures

2

Why we talk about sampling

• General citizen education• Understand data you’ll be using• Understand how to draw a sample, if you

need to• Make statistical inferences

3

Why do we sample?

N

Cost/benefit Benefit

(precision)

Cost(hassle factor)

4

How do we sample?

• Simple random sample– Variant: systematic sample with a random start

• Stratified• Cluster

5

Stratification

• Divide sample into subsamples, based on known characteristics (race, sex, religiousity, continent, department)

• Benefit: preserve or enhance variability

6

Cluster sampling

Block

HH Unit

Individual

7

Effects of samples

• Obvious: influences marginals• Less obvious

– Allows effective use of time and effort– Effect on multivariate techniques

• Sampling of independent variable: greater precision in regression estimates

• Sampling on dependent variable: bias

8

Sampling on Independent Variable

x

y

x

y

9

Sampling on Dependent Variable

x

y

x

y

10

Sampling

Consequences for Statistical Inference

11

Statistical Inference:Learning About the Unknown From the

Known• Reasoning forward: distributions of sample

means, when the population mean, s.d., and n are known.

• Reasoning backward: learning about the population mean when only the sample, s.d., and n are known

12

Reasoning Forward

13

First, we play with some simulations

• http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html

• http://www.kuleuven.ac.be/ucs/java/index.htm

14

Exponential Distribution Example

Fra

ctio

n

inc0 500000 1.0e+06

0

.271441

Mean = 250,000Median=125,000s.d. = 283,474Min = 0Max = 1,000,000

15

Consider 10 random samples, of n = 100 apiece

Sample mean1 253,396.92 198.789.63 271,074.24 238,928.75 280,657.36 241,369.87 249,036.78 226,422.79 210,593.410 212,137.3

Fra

ctio

n

inc0 250000 500000 1.0e+06

0

.271441

16

Consider 10,000 samples of n = 100

N = 10,000Mean = 249,993s.d. = 28,559Skewness = 0.060Kurtosis = 2.92

Fra

ctio

n

(mean) inc0 250000 500000 1.0e+06

0

.275972

17

Consider 1,000 samples of various sizes

10 100 1000

Mean =250,105s.d.= 90,891Skew= 0.38Kurt= 3.13

Mean = 250,498s.d.= 28,297Skew= 0.02Kurt= 2.90

Mean = 249,938s.d.= 9,376Skew= -0.50Kurt= 6.80

Fra

ctio

n

(mean) inc0 250000 500000 1.0e+06

0

.731

Fra

ctio

n

(mean) inc0 250000 500000 1.0e+06

0

.731

Fra

ctio

n

(mean) inc0 250000 500000 1.0e+06

0

.731

18

Difference of means example

Fra

ctio

n

inc0 250000 500000 1.0e+06

0

.280203

State 1Mean = 250,000

Fra

ctio

n

inc20 250000 500000 1.0e+06

0

.251984

State 2Mean = 300,000

19

Take 1,000 samples of 10, of each state, and compare them

First 10 samplesSample State 1 State 2

1 311,410 <<><><<<<>

365,2242 184,571 243,0623 468,574 438,3364 253,374 557,9095 220,934 189,6746 270,400 284,3097 127,115 210,9708 253,885 333,2089 152,678 314,88210 222,725 152,312

20

1,000 samples of 10(m

ean)

inc2

(mean) inc0 1.1e+06

0

1.1e+06

State 2 > State 1: 673 times

21

1,000 samples of 100(m

ean)

inc2

(mean) inc0 1.1e+06

0

1.1e+06

State 2 > State 1: 909 times

22

1,000 samples of 1,000(m

ean)

inc2

(mean) inc0 1.1e+06

0

1.1e+06

State 2 > State 1: 1,000 times

23

Another way of looking at it:The distribution of Inc2 – Inc1

n = 10 n = 100 n = 1,000

Mean = 51,845s.d. = 124,815

Mean = 49,704s.d. = 38,774

Mean = 49,816s.d. = 13,932

Fra

ctio

n

diff-400000 0 600000

0

.565

Fra

ctio

n

diff-400000 050000 600000

0

.565

Fra

ctio

n

diff-400000 050000 600000

0

.565

24

Reasoning Backward

µabout somethingsay obut want t , and ,X , knowyou When sn

25

Central Limit Theorem

As the sample size n increases, the distribution of the mean of a random sample taken from practically any population approaches a normaldistribution, with mean : and standard deviation

X

26

Calculating Standard Errors

In general:

ns

=err. std.

27

Most important standard errorsMean

Proportion

Diff. of 2 means

Regression (slope) coeff.

ns

npp )1( −

21

11nn

sp +

xsnres 1...×

28

Return to the aplets for the regression standard error

• http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html

• http://www.kuleuven.ac.be/ucs/java/index.htm

29

The Idea Behind Classical Hypothesis Testing

True mean or regression coefficient

Sample mean or regression coefficientH0 = 0

30

What We Know

• We know:– The sample mean/coeff. will not equal the

population mean/coeff.– The sample mean/coeff., sample s.d./s.e., & n

• The question:– Is the sample mean/coeff. “far” from H0 or

“close” to H0?

31

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

32

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

33

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

34

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

σ σ2 σ3 σ4σ−σ2−σ3−σ4− 68%

95%99%

35

Therefore….

• When the sample size is large (i.e., > 150), convert the difference into z units and consult a z table

Z = (H1 - H0) / s.e.

36

Reading a z table

z table for standardized normal distribution. Image removed for copyright reasons.

37

Therefore….

• When the sample size is small (i.e., <150), convert the difference into t units and consult a t table

Z = (H1 - H0) / s.e.

38

t (when the sample is small)

z-4 -2 0 2 4

.000045

.003989

t-distribution

z (normal) distribution

39

Reading a t table

t table for standardized normal distribution. Image removed forcopyright reasons.

40

Doing a t-test

Frac

tion

diff9692-.2 0 .2 .4

0

.429558

Q: How likely is it that the residual vote rate in 1996 equal to the rate in 1992 (i.e., blank96-blank92= 0)?

Mean: 0.003069s.d.: 0.02323N: 1448

00061.01448/02323.0

/..

==

= nses

41

The pictureMean: 0.003069s.d.: 0.02323N: 1448

y

newz.003069.00246.00185.00124.000627.000017-.00059

.000134

.398942

00061.01448/02323.0

/..

==

= nses

028.500061.0

0003069.0

=

−=t

42

The STATA output. ttest blank96=blank92

Paired t test

------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------blank96 | 1448 .0242941 .0005116 .0194689 .0232904 .0252977blank92 | 1448 .021225 .0005382 .0204813 .0201692 .0222808

---------+--------------------------------------------------------------------diff | 1448 .003069 .0006104 .0232279 .0018717 .0042664

------------------------------------------------------------------------------

Ho: mean(blank96 - blank92) = mean(diff) = 0

Ha: mean(diff) < 0 Ha: mean(diff) ~= 0 Ha: mean(diff) > 0t = 5.0278 t = 5.0278 t = 5.0278

P < t = 1.0000 P > |t| = 0.0000 P > t = 0.0000

. ttest diff9692=0

One-sample t test

------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------diff9692 | 1448 .003069 .0006104 .0232279 .0018717 .0042664------------------------------------------------------------------------------Degrees of freedom: 1447

Ho: mean(diff9692) = 0

Ha: mean < 0 Ha: mean ~= 0 Ha: mean > 0t = 5.0278 t = 5.0278 t = 5.0278

P < t = 1.0000 P > |t| = 0.0000 P > t = 0.0000

.

43

Final t-testQ: Was there a relationship between residual vote and countySize in 1996?

Slope coeff: -0.07510s.e.r: 0.7115N: 1861Sx: 1.4788

01115.06762.001649.0

4788.11

18617115.0

1....

=×=

×=

×=xsn

reses

blan

k96

vap96_to

blank96 Fitted values

326 6.5e+06

.000281

.298789

44

Calculating t

7319.601115.

07510.0

−=

−=t

45

The STATA output

. reg lblank96 lvap96

Source | SS df MS Number of obs = 1861-------------+------------------------------ F( 1, 1859) = 45.32

Model | 22.941515 1 22.941515 Prob > F = 0.0000Residual | 941.080329 1859 .506229332 R-squared = 0.0238

-------------+------------------------------ Adj R-squared = 0.0233Total | 964.021844 1860 .518291314 Root MSE = .7115

------------------------------------------------------------------------------lblank96 | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------lvap96 | -.0750985 .0111556 -6.73 0.000 -.0969774 -.0532197_cons | -3.129858 .1113781 -28.10 0.000 -3.348298 -2.911419

------------------------------------------------------------------------------

46

A word about standard errors and collinearity

• The problem: if X1 and X2 are highly correlated, then it will be difficult to precisely estimate the effect of either one of these variables on Y

47

How does having another collinearindependent variable affect standard

errors?

s eN n

SS

RR

Y

X

Y

X. .( $ )β1

2

2

2

2

11

11

1 1

=− −

−−

R2 of the “auxiliary regression” of X1 on allthe other independent variables

48

Example: Effect of party, ideology, and religiosity on feelings toward

Quincy BushBush

FeelingsConserv. Repub. Religious

Bush Feelings

1.0 .39 .57 .16

Conserv. 1.0 .46 .18

Repub. 1.0 .06

Relig. 1.0

49

Regression table(1) (2) (3) (4)

Intercept 32.7(0.85)

32.9(1.08)

32.6(1.20)

29.3(1.31)

Repub. 6.73(0.244)

5.86(0.27)

6.64(0.241)

5.88(0.27)

Conserv. --- 2.11(0.30)

--- 1.87(0.30)

Relig. --- --- 7.92(1.18)

5.78(1.19)

N 1575 1575 1575 1575R2 .32 .35 .35 .36