Download - Day 8: Sampling map Why Sample? Sampling terminology Probability and Non-Probability Sampling Sample Size Where do the formulas come from? An SPSS Example Mallinson Day 8 October ...

Day 8: Sampling

Daniel J. Mallinson

School of Public AffairsPenn State [email protected]

PADM-HADM 503

Mallinson Day 8 October 12, 2017 1 / 46

Road map

Why Sample?

Sampling terminology

Probability and Non-Probability Sampling

Sample Size

Where do the formulas come from?

An SPSS Example


Why Sample?

Often not feasible to study the entire population

Too costly, too time consuming, or both

Enables us to make generalizations about a large number ofcases by study small numbers, with a reasonable degree ofvalidity


Why Sample?

Often not feasible to study the entire population

Too costly, too time consuming, or both

Enables us to make generalizations about a large number ofcases by study small numbers, with a reasonable degree ofvalidity


Sampling Terminology

Sample

A selected group of units that are representative of a generalpopulation

Population

The entire group of units that are of interest to the researcher

Target population

A specifically defined population



Sampling Frame

The complete list of units from which a sample is selected (may notbe the same as the population)

Unit of Analysis

Units about which information is collected and analyses are conducted

Sampling Unit

This may be different from the unit of analysis at different stages ofsampling (see cluster sampling)



Parameter

A characteristic (measure) of the population

Statistic

A characteristic (measure) of the sample

Sampling Error

The difference between the parameter and the statistic



Standard Error

A measure (approximation) of sampling error

Sample Bias

Non-statistical errors, systematic misrepresentations of populationcharacteristics

Sampling Fraction

Percentage of the population selected for the sample

Sampling Design

Procedure of selecting a sample


Example of Terms

Population: All motor vehicles owned in the state in the currentfiscal year.

Sampling Frame: All vehicles appearing on the state list ofRegistered Motor Vehicles prepared July 1 of the current fiscalyear by the DMV

Sampling Design: Probability sampling

Sample: 300 motor vehicles randomly selected from thesampling frame

Unit of analysis: Motor vehicle

Statistic: Average distance passenger cars in the sample weredriven annually: 20,000 miles

Parameter: The actual average annual mileage of all passengercars in the state


Group Task

You have decided to conduct a mail survey for the following study:

You are an administrator at the Dauphin Countydepartment of Human Services. One of the programs underyour jurisdiction is smoking cessation that targets pregnantwomen. You would like to evaluate the effectiveness of thisprogram and determine why some women were successful atquitting and others were not. Remember that these factorscould be personal and/or programmatic.

As a group, determine the population, a sampling frame, samplingdesign, sample size, unit of analysis, statistic, and the relatedparameter.


Two Groups of Sampling DesignsProbability Sampling Designs

Designs whose sizes and sampling errors can be estimated usingstatistical analyses

1 Simple Random Sampling

2 Systematic Sampling

3 Stratified Random Sampling

4 Cluster and Multistage Sampling

Non-Probability Sampling Designs

Designs whose sizes or sampling errors cannot be estimated usingstatistical analyses

1 Convenience designs

2 Purposive sampling

3 Quota sampling

4 Snowball samplingMallinson Day 8 October 12, 2017 10 / 46

Probability Sampling Designs

Simple Random Sampling

The original sampling method

The basis of basic sampling statistics

Statistical formulas used in our book are based on this, all othersare variations on this model



Simple Random Sampling

The principle: Each unit should have the same chance of beingselected

Two types:1 With replacement2 Without replacement - most commonly called simple random

sampling


Excel Method

Create column of names

Type RAND() in secondcolumn

Drag bottom corner to copydown the list

Copy, paste, and select“values only” option

Sort by the random numbers



Systematic Sampling

Statistical formulas are the same as for simple random sampling

Called quasi-random sampling

Units are ordered in a sequence

Skip interval = Number of units in the sampling frame/Numberof units in the sample

Skip 5: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ...



Systematic Sampling

The problem of periodicity

Example: If you want to estimate the number of people on the2nd Street (“Restaurant Row”) in Harrisburg, do not sampleevery 7th evening (e.g., Saturdays). You will get a biasedsample.

A strategy to break up periodicity:

Select a starting point randomly and select half of the sample inthe first round. Then select another starting point in the otherhalf.



Stratified Random Sampling

Divide the population into strata and make random selectionsfrom each stratum

Results in better representation than simple random sampling,because each stratum is homogeneous

Requires a smaller sample size than simple random sampling



Stratified Random Sampling

Two Types:1 Proportionate: Strata in a population will be represented

proportionately2 Disproportionate: Some strata may be over-sampled to ensure

representation; results of combined dataset should be weighted



Cluster and Multistage Sampling

The weakest, but most commonly used method

Weakest means that this method requires the largest sample sizefor the same level of accuracy

Random-digit dialing emulates this method

If one stage is used, it is called cluster sampling

Gather data on all units within randomly selected clusters

If multiple stages, it is called multistage sampling




Examples of levels that can be used in multistage sampling:

StateCountyTownship, borough, cityNeighborhoods (Census tracts)BlocksHouseholdsParticular individuals

Note that the sampling unit changes at each stage




Probability proportionate to size (PPS) technique: Larger unitsare given more chances to be selected


Ranking Sampling Designs In Terms ofAccuracy

The best (most powerful, accurate) method yields the leastamount of sampling error for the same sample size

In other words, the best method requires the smallest samplesize for the same level of sampling error

The Ranking:

1. Stratified Random Sampling (Best)2. Simple Random Sampling, Systematic Sampling3. Cluster and Multistage Sampling (Worst)


Non-Probability Sampling Designs

1 Convenience designs (accidental sampling): Select whatever unityou want first

2 Purposive sampling (theory-based): There is a non-statisticalreasoning behind the sampling strategy used

3 Quota sampling: This is like stratified sampling, but units areselected randomly

4 Snowball sampling: One unit leads to the next one


Sample Size

The rule of thumb: the larger, the better

But the calculation of a sample size is more complex than thisrule

Larger samples cost more and larger samples may be more proneto errors in the data collection process

So, we need to select samples that are large enough for theresources (money and time) we have and the level ofmeasurement error we can tolerate


Sample Size

Sample size is determined by:

Population size

Population variability (homogeneity)

Confidence level

Accuracy desired


Sample Size

Population size:

Not a linear relationship with sample size (diminishing returns)

Is ignored for large population sizes, like a national population


Sample Size

Population variability:

Measured as standard deviation; the larger the variability, thelarger the sample size should be

Think of this: If every unit is identical, a sample of one would besufficient to represent all of the units


Sample Size

Confidence level:

Confidence in the validity of the results of an analysis on thesample

It is 1-alpha level. We will talk more about alpha level later inthe course

Bottom line: The more confidence desired, the larger the sampleshould be


Sample Size

Accuracy desired:

Measured by the standard error

A trade off between confidence level and accuracy


Sample Size

Confidence-Accuracy Trade Off

Confidence Level Accuracy as Shown by Confidence Level99% ±2.5895% ±1.9690% ±1.6550% ±.68

Table: Table 5.5, pg. 155


Sample Size Formulas

General Formula:

√n =

(Standard Deviation of Population ∗ Confidence level)

Accuracy desired (in standard error terms)(1)

√n = square root of sample size

Population size is ignored if it is relatively large (like thepopulation of a nation)



What this Formula Means:1 As variation in population ⇑, sample size ⇑2 As desired confidence level ⇑, sample size ⇑3 As desired level of accuracy ⇑, sample size ⇑4 As tolerable level of error ⇑, sample size ⇓



Proportions (Dichotomous Variables):

n =Z 2 ∗ p(1− p)

d2(2)

Z is the z-score for confidence level (e.g., 1.96 for 95%)

d is the desired accuracy (e.g., ±4%), i.e., margin of error

If the standard deviation of the population is unknown, use 50%(0.5)



Means (Interval or Ratio Variables):

n =σ2 ∗ Z 2

d2(3)

σ is the population variance; either assumed, estimated fromsample data, or previous knowledge

Z is the z-score for confidence level (e.g., 1.96 for 95%)

d is the desired accuracy (e.g., ±4%)



How to find n without using a formula

See the sample sizes for various degrees of accuracy andconfidence levels (for small populations): Table 5.6, p. 158

See the sample sizes for various degrees of accuracy andconfidence levels (for large populations): Table 5.7, p. 159


Where Do the Formulas Come From?How many different samples can be drawn from the same population?

Figure: Musu-Gillette, Lauren 2016


https://nces.ed.gov/blogs/nces/post/statistical-concepts-in-brief-how-and-why-does-nces-use-sample-surveys

Where Do the Formulas Come From?

How many different samples can be drawn from the samepopulation?

n!

r !(n − r)!=

(n

r

)(4)

n is the size of the population, r is the sample size

Example: A sample size 3 from a population of 10, the formula wouldbe:

10 ∗ 9 ∗ 8 ∗ 7 ∗ 6 ∗ 5 ∗ 4 ∗ 3 ∗ 2 ∗ 1

3!(10− 3)!= 120 (5)



A sampling distribution is the distribution of the sample statistic weare interested in (e.g., mean, or percentage of voter for a candidate)in all possible samples. See Figure 5.6, p. 154.



The sampling distributions for particular populations can beplotted and their measures can be calculated

The normal distribution is the most common shape for asampling distribution

The normal is not the only, but the most basic



Figure: https://www.mathsisfun.com/data/standard-normal-distribution.html


https://www.mathsisfun.com/data/standard-normal-distribution.html


The proportions of the area under the normal curve are fixed

If you move one or two standard deviations (Z units, Z scores)away from the mean, the area under the curve will always be thesame percentage when a distribution is normal



Normal Distribution (cont.)

This is a key characteristic of the normal distribution that helpsus make sampling estimations

Also the basis of statistical significance tests (e.g., the t-test),which we will discuss later

These areas under the curve can be used to calculatestandardized scores (recall the measurement section):

Z − scores =(Score−Mean)

Standard Deviation(6)



Central Limit Theorem

If the population is normally distributed, its sampling distributionwill also be normal

If the population is large, but not normally distributed, itssampling distribution will also be normal

One test we will discuss (t-test) uses a modified version of thenormal curve for its sampling distribution

Other tests have their own specific sampling distributions

Calculations of sampling error and confidence intervals are basedon the idea of a normal sampling distribution



Standard Error:

The standard deviation of the sampling distribution (populationcorrection factor (fcp) in bold for small populations)

(Proportions) SEp =

√p(1− p)

n∗ (N-n)(N-1) (7)

(Means) SEx =σ√n∗ (N-n)(N-1) (8)



Confidence Interval:

Confidence level = 1− alpha level

If alpha is 0.05, then the confidence level will be .95 (95%)

A confidence interval is calculated by using the confidence level

If CL is .95 (95%) and we assume a normal distribution:

Lower limit (bound) = Sample Mean− 1.96 ∗ SEx

Upper limit (bound) = Sample Mean + 1.96 ∗ SEx

CI means confidence intervals produced by 95% of sampleswould contain population parameter


An SPSS Example


Questions?

Figure: Q&A by Libby Levi, CC BY-SA 2.0


https://www.flickr.com/photos/opensourceway/5556249000

http://blog.libbylevi.com