Sampling and Inference

1

Sampling and Inference

The Quality of Data and Measures

March 23, 2006

2

Why we talk about sampling

• General citizen education• Understand data you’ll be using• Understand how to draw a sample, if you

need to• Make statistical inferences

3

Why do we sample?

N

Cost/benefit Benefit

(precision)

Cost(hassle factor)

4

How do we sample?

• Simple random sample– Variant: systematic sample with a random start

• Stratified• Cluster

5

Stratification

• Divide sample into subsamples, based on known characteristics (race, sex, religiousity, continent, department)

• Benefit: preserve or enhance variability

6

Cluster sampling

Block

HH Unit

Individual

7

Effects of samples

• Obvious: influences marginals• Less obvious

– Allows effective use of time and effort– Effect on multivariate techniques

• Sampling of independent variable: greater precision in regression estimates

• Sampling on dependent variable: bias

8

Sampling on Independent Variable

x

y

x

y

9

Sampling on Dependent Variable

x

y

x

y

10

Sampling

Consequences for Statistical Inference

11

Statistical Inference:Learning About the Unknown From the

Known• Reasoning forward: distributions of sample

means, when the population mean, s.d., and n are known.

• Reasoning backward: learning about the population mean when only the sample, s.d., and n are known

12

Reasoning Forward

13

Exponential Distribution Example

Frac

tion

inc0 500000 1.0e+06

0

.271441

Mean = 250,000Median=125,000s.d. = 283,474Min = 0Max = 1,000,000

14

Consider 10 random samples, of n = 100 apiece

Sample mean

1 253,396.9

2 198.789.6

3 271,074.2

4 238,928.7

5 280,657.3

6 241,369.8

7 249,036.7

8 226,422.7

9 210,593.4

10 212,137.3

Frac

tion

inc0 250000 500000 1.0e+06

0

.271441

15

Consider 10,000 samples of n = 100

Frac

tion

(mean) inc0 250000 500000 1.0e+06

0

.275972N = 10,000Mean = 249,993s.d. = 28,559Skewness = 0.060Kurtosis = 2.92

16

Consider 1,000 samples of various sizes

10 100 1000

Mean =250,105s.d.= 90,891Skew= 0.38Kurt= 3.13

Mean = 250,498s.d.= 28,297Skew= 0.02Kurt= 2.90

Mean = 249,938s.d.= 9,376Skew= -0.50Kurt= 6.80

Frac

tion

(mean) inc0 250000 500000 1.0e+06

0

.731

Frac

tion

(mean) inc0 250000 500000 1.0e+06

0

.731

Frac

tion

(mean) inc0 250000 500000 1.0e+06

0

.731

17

Difference of means example

Frac

tion

inc0 250000 500000 1.0e+06

0

.280203

Frac

tion

inc20 250000 500000 1.0e+06

0

.251984

State 1Mean = 250,000

State 2Mean = 300,000

18

Take 1,000 samples of 10, of each state, and compare them

First 10 samplesSample State 1 State 2

1 311,410 < 365,2242 184,571 < 243,0623 468,574 > 438,3364 253,374 < 557,9095 220,934 > 189,6746 270,400 < 284,3097 127,115 < 210,9708 253,885 < 333,2089 152,678 < 314,882

10 222,725 > 152,312

19

1,000 samples of 10(m

ean)

inc2

(mean) inc0 1.1e+06

0

1.1e+06

State 2 > State 1: 673 times

20

1,000 samples of 100(m

ean)

inc2

(mean) inc0 1.1e+06

0

1.1e+06

State 2 > State 1: 909 times

21

1,000 samples of 1,000

State 2 > State 1: 1,000 times

(mea

n) in

c2

(mean) inc0 1.1e+06

0

1.1e+06

22

Another way of looking at it:The distribution of Inc2 – Inc1

n = 10 n = 100 n = 1,000

Mean = 51,845s.d. = 124,815

Mean = 49,704s.d. = 38,774

Mean = 49,816s.d. = 13,932

Frac

tion

diff-400000 0 600000

0

.565

Frac

tion

diff-400000 050000 600000

0

.565

Frac

tion

diff-400000 050000 600000

0

.565

23

Play with some simulations

• http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html

• http://www.kuleuven.ac.be/ucs/java/index.htm

http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html

http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html

http://www.kuleuven.ac.be/ucs/java/index.htm

http://www.kuleuven.ac.be/ucs/java/index.htm

24

Reasoning Backward

about somethingsay obut want t , and ,X , knowyou When sn

25

Central Limit Theorem

As the sample size n increases, the distribution of the mean of a random sample taken from practically any population approaches a normal distribution, with mean and standard deviation

X

n

26

Calculating Standard Errors

In general:

ns

err. std.

27

Most important standard errors

2

22

1

11 )1()1(n

ppn

pp

Mean

Proportion

Diff. of 2 means

Diff. of 2 proportionsDiff of 2 means (paired data)Regression (slope) coeff.

ns

npp )1(

2

22

1

21

ns

ns

xsnres 11...

nsd

28

Using Standard Errors, we can construct “confidence intervals”

• Confidence interval (ci): an interval between two numbers, where there is a certain specified level of confidence that a population parameter lies

• ci = sample parameter +• multiple * sample standard error

29

Constructing Confidence Intervals

• Let’s say we draw a sample of tuitions from 15 private universities. Can we estimate what the average of all private university tuitions is?

• N = 15• Average = 29,735• S.d. = 2,196• S.e. = 567

15196,2

ns

30

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

N = 15; avg. = 29,735; s.d. = 2,196; s.e. = s/√n = 567

29,735

29,735+567=30,30229,735-567=29,168

29,735+2*567=30,869

29,735-2*567=28,601

31

Confidence Intervals for Tuition Example

• 68% confidence interval = 29,735+567 = [29,168 to 30,302]

• 95% confidence interval = 29,735+2*567 = [28,601 to 30,869]

• 99% confidence interval = 29,735+3*567 = [28,034 to 31,436]

32

What if someone (ahead of time) had said, “I think the average tuition of

major research universities is $25k”?• Note that $25,000 is well out of the 99%

confidence interval, [28,034 to 31,436]• Q: How far away is the $25k estimate from

the sample mean?– A: Do it in z-scores: (29,735-25,000)/567 = 8.35

33

Constructing confidence intervals of proportions

• Let us say we drew a sample of 1,000 adults and asked them if they approved of the way George Bush was handling his job as president. (March 13-16, 2006 Gallup Poll) Can we estimate the % of all American adults who approve?

• N = 1000• p = .37• s.e. = 02.0

1000)37.1(37.)1(

n

pp

34

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

N = 1,000; p. = .37; s.e. = √p(1-p)/n = .02

.37

.37+.02=.39.37-.02=.35

.37+2*.02=.41.37-2*.02=.33

35

Confidence Intervals for Bush approval example

• 68% confidence interval = .37+.02 = [.35 to .39]• 95% confidence interval = .37+2*.02 = [.33 to .41]• 99% confidence interval = .37+3*.02 = [ .31 to .43]

36

What Gallup said about the confidence interval

• Results are based on telephone interviews with 1,000 national adults, aged 18 and older, conducted March 13-16, 2006. For results based on the total sample of national adults, one can say with 95% confidence that the maximum margin of sampling error is ±3 percentage points [because the actual standard error is 1.5%, not 2%].

37

What if someone (ahead of time) had said, “I think Americans are equally

divided in how they think about Bush.”

• Note that 50% is well out of the 99% confidence interval, [31% to 43%]

• Q: How far away is the 50% estimate from the sample proportion?– A: Do it in z-scores: (.37-.5)/.02 = -6.5 [-8.7 if

we divide by 0.15]

38

Constructing confidence intervals of differences of means

• Let’s say we draw a sample of tuitions from 15 private and public universities. Can we estimate what the difference in average tuitions is between the two types of universities?

• N = 15 in both cases• Average = 29,735 (private); 5,498 (public); diff = 24,238• s.d. = 2,196 (private); 1,894 (public)• s.e. =

74915

3,587,23615

4,822,416

2

22

1

21

ns

ns

39

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

N = 15 twice; diff = 24,238; s.e. = 749

24,238

24,238+749=24,98724,238-749=23,489

24,238+2*749=25,736

24,238-2*749=22,740

40

Confidence Intervals for difference of tuition means example

• 68% confidence interval = 24,238+749 = [23,489 to 24,987]• 95% confidence interval = 24,238+2*749 =

[22,740 to 25,736]• 99% confidence interval =24,238+3*749 = • [21,991 to 26,485]

41

What if someone (ahead of time) had said, “Private universities are no

more expensive than public universities”

• Note that $0 is well out of the 99% confidence interval, [$21,991 to $26,485]

• Q: How far away is the $0 estimate from the sample proportion?– A: Do it in z-scores: (24,238-0)/749 = 32.4

42

Constructing confidence intervals of difference of proportions

• Let us say we drew a sample of 1,000 adults and asked them if they approved of the way George Bush was handling his job as president. (March 13-16, 2006 Gallup Poll). We focus on the 600 who are either independents or Democrats. Can we estimate whether independents and Democrats view Bus differently?

• N = 300 ind; 300 Dem.• p = .29 (ind.); .10 (Dem.); diff = .19• s.e. =

03.300

)10.1(10.300

)29.1(29.)1()1(

2

22

1

11

npp

npp

43

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

diff. p. = .19; s.e. = .03

.19

.19+.03=.22.19-.03=.16

.19+2*.03=.25.19-2*.03=.13

44

Confidence Intervals for Bush Ind/Dem approval example

• 68% confidence interval = .19+.03 = [.16 to .22]• 95% confidence interval = .19+2*.03 = [.13 to .25]• 99% confidence interval = .19+3*.03 = [ .10 to .28]

45

What if someone (ahead of time) had said, “I think Democrats and

Independents are equally unsupportive of Bush”?

• Note that 0% is well out of the 99% confidence interval, [10% to 28%]

• Q: How far away is the 0% estimate from the sample proportion?– A: Do it in z-scores: (.19-0)/.03 = 6.33

46

Constructing confidence intervals of differences of means in a paired

sample• Let’s say we draw a sample of tuitions from 15

private universities, in 2003 and again in 2004. Can we estimate what the difference in average tuitions is between the two years?

• N = 15 • Averages = 28,102 (2003); 29,735 (2004); diff = 1,632• s.d. = 2,196 (private); 1,894 (public); 886 (diff)• s.e. =

22915

886

nsd

47

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

N = 15; diff = 1,632; s.e. = 229

1,632

1,632+229=1,8611,632-229=1,403

1632+2*229=2,090

1,632-2*229=1,174

48

Confidence Intervals for second difference of tuition means example

• 68% confidence interval = 1,632+ 229= [1,403 to 1,861]• 95% confidence interval = 1,632+ 2*229 =

[1,174 to 2,090]• 99% confidence interval = 1,632+3*229 =

[945 to 2,319]

49

What if someone (ahead of time) had said, “Private university tuitions did

not grow from 2003 to 2004”

• Note that $0 is well out of the 99% confidence interval, [$1,174 to $2,090]

• Q: How far away is the $0 estimate from the sample proportion?– A: Do it in z-scores: (1,632-0)/229 = 7.13

50

The Stata output

. gen difftuition=tuition2004-tuition2003

. ttest diff=0 in 1/15

One-sample t test------------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+--------------------------------------------------------------------difftu~n | 15 1631.6 228.6886 885.707 1141.112 2122.088------------------------------------------------------------------------------ mean = mean(difftuition) t = 7.1346Ho: mean = 0 degrees of freedom = 14

Ha: mean < 0 Ha: mean != 0 Ha: mean > 0 Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

51

Constructing confidence intervals of regression coefficients

• Let’s look at the relationship between the mid-term seat loss by the President’s party at midterm and the President’s Gallup poll rating

1938

1942

1946

1950

1954

1958

1962

1966

1970

1974

1978

1982

19861990

1994

19982002

-80

-60

-40

-20

0C

hang

e in

Hou

se s

eats

30 40 50 60 70Gallup approval rating (Nov.)

loss Fitted valuesFitted values

Slope = 1.97N = 14s.e.r. = 13.8sx = 8.14s.e.slope =

47.014.81

138.131

1...

xsnres

52

The Picturey

Mean

.000134

.398942

2 3 4234 68%

95%99%

N = 14; slope=1.97; s.e. = 0.45

1.97

1.97+0.47=2.441.97-0.47=1.50

1.97+2*0.47=2.911.97-2*0.47=1.03

53

Confidence Intervals for regression example

• 68% confidence interval = 1.97+ 0.47= [1.50 to 2.44]• 95% confidence interval = 1.97+ 2*0.47 =

[1.03 to 2.91]• 99% confidence interval = 1.97+3*0.47 =

[0.62 to 3.32]

54

What if someone (ahead of time) had said, “There is no relationship

between the president’s popularity and how his party’s House members

do at midterm”?• Note that 0 is well out of the 99%

confidence interval, [0.62 to 3.32]• Q: How far away is the 0 estimate from the

sample proportion?– A: Do it in z-scores: (1.97-0)/0.47 = 4.19

55

The Stata output. reg loss gallup if year>1948

Source | SS df MS Number of obs = 14-------------+------------------------------ F( 1, 12) = 17.53 Model | 3332.58872 1 3332.58872 Prob > F = 0.0013 Residual | 2280.83985 12 190.069988 R-squared = 0.5937-------------+------------------------------ Adj R-squared = 0.5598 Total | 5613.42857 13 431.802198 Root MSE = 13.787

------------------------------------------------------------------------------ loss | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- gallup | 1.96812 .4700211 4.19 0.001 .9440315 2.992208 _cons | -127.4281 25.54753 -4.99 0.000 -183.0914 -71.76486------------------------------------------------------------------------------

56

z vs. t

57

If n is sufficiently large, we know the distribution of sample means/coeffs. will

obey the normal curve

y

Mean

.000134

.398942

2 3 4234 68%

95%99%

58



y

Mean

.000134

.398942

2 3 4234 68%

95%99%

59



y

Mean

.000134

.398942

2 3 4234 68%

95%99%

60



y

Mean

.000134

.398942

2 3 4234 68%

95%99%

61

Therefore….

• When the sample size is large (i.e., > 150), convert the difference into z units and consult a z table

Z = (H1 - H0) / s.e.

62

Reading a z table

63

64

Therefore….

• When the sample size is small (i.e., <150), convert the difference into t units and consult a t table

t = (H1 - H0) / s.e.

65

t (when the sample is small)

z-4 -2 0 2 4

.000045

.003989

t-distribution

z (normal) distribution

66

Reading a t table

67

A word about standard errors and collinearity

• The problem: if X1 and X2 are highly correlated, then it will be difficult to precisely estimate the effect of either one of these variables on Y

68

How does having another collinear independent variable affect standard

errors?

s eN n

SS

RR

Y

X

Y

X

. . ( )1

2

2

2

2

11

11

1 1

R2 of the “auxiliary regression” of X1 on allthe other independent variables

69

Example: Effect of party, ideology, and religiosity on feelings toward

Quincy BushBush

FeelingsConserv. Repub. Religious

Bush Feelings

1.0 .39 .57 .16

Conserv. 1.0 .46 .18

Repub. 1.0 .06

Relig. 1.0

70

Regression table(1) (2) (3) (4)

Intercept 32.7(0.85)

32.9(1.08)

32.6(1.20)

29.3(1.31)

Repub. 6.73(0.244)

5.86(0.27)

6.64(0.241)

5.88(0.27)

Conserv. --- 2.11(0.30)

--- 1.87(0.30)

Relig. --- --- 7.92(1.18)

5.78(1.19)

N 1575 1575 1575 1575

R2 .32 .35 .35 .36

Date post:	14-Feb-2016
Category:	Documents
Upload:	atara
View:	33 times
Download:	0 times

Sampling and Inference

Documents