Day 8: Sampling
Daniel J. Mallinson
School of Public AffairsPenn State [email protected]
PADM-HADM 503
Mallinson Day 8 October 12, 2017 1 / 46
Road map
Why Sample?
Sampling terminology
Probability and Non-Probability Sampling
Sample Size
Where do the formulas come from?
An SPSS Example
Mallinson Day 8 October 12, 2017 2 / 46
Why Sample?
Often not feasible to study the entire population
Too costly, too time consuming, or both
Enables us to make generalizations about a large number ofcases by study small numbers, with a reasonable degree ofvalidity
Mallinson Day 8 October 12, 2017 3 / 46
Why Sample?
Often not feasible to study the entire population
Too costly, too time consuming, or both
Enables us to make generalizations about a large number ofcases by study small numbers, with a reasonable degree ofvalidity
Mallinson Day 8 October 12, 2017 3 / 46
Sampling Terminology
Sample
A selected group of units that are representative of a generalpopulation
Population
The entire group of units that are of interest to the researcher
Target population
A specifically defined population
Mallinson Day 8 October 12, 2017 4 / 46
Sampling Terminology
Sampling Frame
The complete list of units from which a sample is selected (may notbe the same as the population)
Unit of Analysis
Units about which information is collected and analyses are conducted
Sampling Unit
This may be different from the unit of analysis at different stages ofsampling (see cluster sampling)
Mallinson Day 8 October 12, 2017 5 / 46
Sampling Terminology
Parameter
A characteristic (measure) of the population
Statistic
A characteristic (measure) of the sample
Sampling Error
The difference between the parameter and the statistic
Mallinson Day 8 October 12, 2017 6 / 46
Sampling Terminology
Standard Error
A measure (approximation) of sampling error
Sample Bias
Non-statistical errors, systematic misrepresentations of populationcharacteristics
Sampling Fraction
Percentage of the population selected for the sample
Sampling Design
Procedure of selecting a sample
Mallinson Day 8 October 12, 2017 7 / 46
Example of Terms
Population: All motor vehicles owned in the state in the currentfiscal year.
Sampling Frame: All vehicles appearing on the state list ofRegistered Motor Vehicles prepared July 1 of the current fiscalyear by the DMV
Sampling Design: Probability sampling
Sample: 300 motor vehicles randomly selected from thesampling frame
Unit of analysis: Motor vehicle
Statistic: Average distance passenger cars in the sample weredriven annually: 20,000 miles
Parameter: The actual average annual mileage of all passengercars in the state
Mallinson Day 8 October 12, 2017 8 / 46
Group Task
You have decided to conduct a mail survey for the following study:
You are an administrator at the Dauphin Countydepartment of Human Services. One of the programs underyour jurisdiction is smoking cessation that targets pregnantwomen. You would like to evaluate the effectiveness of thisprogram and determine why some women were successful atquitting and others were not. Remember that these factorscould be personal and/or programmatic.
As a group, determine the population, a sampling frame, samplingdesign, sample size, unit of analysis, statistic, and the relatedparameter.
Mallinson Day 8 October 12, 2017 9 / 46
Two Groups of Sampling DesignsProbability Sampling Designs
Designs whose sizes and sampling errors can be estimated usingstatistical analyses
1 Simple Random Sampling
2 Systematic Sampling
3 Stratified Random Sampling
4 Cluster and Multistage Sampling
Non-Probability Sampling Designs
Designs whose sizes or sampling errors cannot be estimated usingstatistical analyses
1 Convenience designs
2 Purposive sampling
3 Quota sampling
4 Snowball samplingMallinson Day 8 October 12, 2017 10 / 46
Probability Sampling Designs
Simple Random Sampling
The original sampling method
The basis of basic sampling statistics
Statistical formulas used in our book are based on this, all othersare variations on this model
Mallinson Day 8 October 12, 2017 11 / 46
Probability Sampling Designs
Simple Random Sampling
The principle: Each unit should have the same chance of beingselected
Two types:1 With replacement2 Without replacement - most commonly called simple random
sampling
Mallinson Day 8 October 12, 2017 12 / 46
Excel Method
Create column of names
Type RAND() in secondcolumn
Drag bottom corner to copydown the list
Copy, paste, and select“values only” option
Sort by the random numbers
Mallinson Day 8 October 12, 2017 13 / 46
Probability Sampling Designs
Systematic Sampling
Statistical formulas are the same as for simple random sampling
Called quasi-random sampling
Units are ordered in a sequence
Skip interval = Number of units in the sampling frame/Numberof units in the sample
Skip 5: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 ...
Mallinson Day 8 October 12, 2017 14 / 46
Probability Sampling Designs
Systematic Sampling
The problem of periodicity
Example: If you want to estimate the number of people on the2nd Street (“Restaurant Row”) in Harrisburg, do not sampleevery 7th evening (e.g., Saturdays). You will get a biasedsample.
A strategy to break up periodicity:
Select a starting point randomly and select half of the sample inthe first round. Then select another starting point in the otherhalf.
Mallinson Day 8 October 12, 2017 15 / 46
Probability Sampling Designs
Stratified Random Sampling
Divide the population into strata and make random selectionsfrom each stratum
Results in better representation than simple random sampling,because each stratum is homogeneous
Requires a smaller sample size than simple random sampling
Mallinson Day 8 October 12, 2017 16 / 46
Probability Sampling Designs
Stratified Random Sampling
Two Types:1 Proportionate: Strata in a population will be represented
proportionately2 Disproportionate: Some strata may be over-sampled to ensure
representation; results of combined dataset should be weighted
Mallinson Day 8 October 12, 2017 17 / 46
Probability Sampling Designs
Cluster and Multistage Sampling
The weakest, but most commonly used method
Weakest means that this method requires the largest sample sizefor the same level of accuracy
Random-digit dialing emulates this method
If one stage is used, it is called cluster sampling
Gather data on all units within randomly selected clusters
If multiple stages, it is called multistage sampling
Mallinson Day 8 October 12, 2017 18 / 46
Probability Sampling Designs
Cluster and Multistage Sampling
Examples of levels that can be used in multistage sampling:
StateCountyTownship, borough, cityNeighborhoods (Census tracts)BlocksHouseholdsParticular individuals
Note that the sampling unit changes at each stage
Mallinson Day 8 October 12, 2017 19 / 46
Probability Sampling Designs
Cluster and Multistage Sampling
Probability proportionate to size (PPS) technique: Larger unitsare given more chances to be selected
Mallinson Day 8 October 12, 2017 20 / 46
Ranking Sampling Designs In Terms ofAccuracy
The best (most powerful, accurate) method yields the leastamount of sampling error for the same sample size
In other words, the best method requires the smallest samplesize for the same level of sampling error
The Ranking:
1. Stratified Random Sampling (Best)2. Simple Random Sampling, Systematic Sampling3. Cluster and Multistage Sampling (Worst)
Mallinson Day 8 October 12, 2017 21 / 46
Non-Probability Sampling Designs
1 Convenience designs (accidental sampling): Select whatever unityou want first
2 Purposive sampling (theory-based): There is a non-statisticalreasoning behind the sampling strategy used
3 Quota sampling: This is like stratified sampling, but units areselected randomly
4 Snowball sampling: One unit leads to the next one
Mallinson Day 8 October 12, 2017 22 / 46
Sample Size
The rule of thumb: the larger, the better
But the calculation of a sample size is more complex than thisrule
Larger samples cost more and larger samples may be more proneto errors in the data collection process
So, we need to select samples that are large enough for theresources (money and time) we have and the level ofmeasurement error we can tolerate
Mallinson Day 8 October 12, 2017 23 / 46
Sample Size
Sample size is determined by:
Population size
Population variability (homogeneity)
Confidence level
Accuracy desired
Mallinson Day 8 October 12, 2017 24 / 46
Sample Size
Population size:
Not a linear relationship with sample size (diminishing returns)
Is ignored for large population sizes, like a national population
Mallinson Day 8 October 12, 2017 25 / 46
Sample Size
Population variability:
Measured as standard deviation; the larger the variability, thelarger the sample size should be
Think of this: If every unit is identical, a sample of one would besufficient to represent all of the units
Mallinson Day 8 October 12, 2017 26 / 46
Sample Size
Confidence level:
Confidence in the validity of the results of an analysis on thesample
It is 1-alpha level. We will talk more about alpha level later inthe course
Bottom line: The more confidence desired, the larger the sampleshould be
Mallinson Day 8 October 12, 2017 27 / 46
Sample Size
Accuracy desired:
Measured by the standard error
A trade off between confidence level and accuracy
Mallinson Day 8 October 12, 2017 28 / 46
Sample Size
Confidence-Accuracy Trade Off
Confidence Level Accuracy as Shown by Confidence Level99% ±2.5895% ±1.9690% ±1.6550% ±.68
Table: Table 5.5, pg. 155
Mallinson Day 8 October 12, 2017 29 / 46
Sample Size Formulas
General Formula:
√n =
(Standard Deviation of Population ∗ Confidence level)
Accuracy desired (in standard error terms)(1)
√n = square root of sample size
Population size is ignored if it is relatively large (like thepopulation of a nation)
Mallinson Day 8 October 12, 2017 30 / 46
Sample Size Formulas
What this Formula Means:1 As variation in population ⇑, sample size ⇑2 As desired confidence level ⇑, sample size ⇑3 As desired level of accuracy ⇑, sample size ⇑4 As tolerable level of error ⇑, sample size ⇓
Mallinson Day 8 October 12, 2017 31 / 46
Sample Size Formulas
Proportions (Dichotomous Variables):
n =Z 2 ∗ p(1− p)
d2(2)
Z is the z-score for confidence level (e.g., 1.96 for 95%)
d is the desired accuracy (e.g., ±4%), i.e., margin of error
If the standard deviation of the population is unknown, use 50%(0.5)
Mallinson Day 8 October 12, 2017 32 / 46
Sample Size Formulas
Means (Interval or Ratio Variables):
n =σ2 ∗ Z 2
d2(3)
σ is the population variance; either assumed, estimated fromsample data, or previous knowledge
Z is the z-score for confidence level (e.g., 1.96 for 95%)
d is the desired accuracy (e.g., ±4%)
Mallinson Day 8 October 12, 2017 33 / 46
Sample Size Formulas
How to find n without using a formula
See the sample sizes for various degrees of accuracy andconfidence levels (for small populations): Table 5.6, p. 158
See the sample sizes for various degrees of accuracy andconfidence levels (for large populations): Table 5.7, p. 159
Mallinson Day 8 October 12, 2017 34 / 46
Where Do the Formulas Come From?How many different samples can be drawn from the same population?
Figure: Musu-Gillette, Lauren 2016
Mallinson Day 8 October 12, 2017 35 / 46
Where Do the Formulas Come From?
How many different samples can be drawn from the samepopulation?
n!
r !(n − r)!=
(n
r
)(4)
n is the size of the population, r is the sample size
Example: A sample size 3 from a population of 10, the formula wouldbe:
10 ∗ 9 ∗ 8 ∗ 7 ∗ 6 ∗ 5 ∗ 4 ∗ 3 ∗ 2 ∗ 1
3!(10− 3)!= 120 (5)
Mallinson Day 8 October 12, 2017 36 / 46
Where Do the Formulas Come From?
A sampling distribution is the distribution of the sample statistic weare interested in (e.g., mean, or percentage of voter for a candidate)in all possible samples. See Figure 5.6, p. 154.
Mallinson Day 8 October 12, 2017 37 / 46
Where Do the Formulas Come From?
The sampling distributions for particular populations can beplotted and their measures can be calculated
The normal distribution is the most common shape for asampling distribution
The normal is not the only, but the most basic
Mallinson Day 8 October 12, 2017 38 / 46
Where Do the Formulas Come From?
Figure: https://www.mathsisfun.com/data/standard-normal-distribution.html
Mallinson Day 8 October 12, 2017 39 / 46
Where Do the Formulas Come From?
The proportions of the area under the normal curve are fixed
If you move one or two standard deviations (Z units, Z scores)away from the mean, the area under the curve will always be thesame percentage when a distribution is normal
Mallinson Day 8 October 12, 2017 40 / 46
Where Do the Formulas Come From?
Normal Distribution (cont.)
This is a key characteristic of the normal distribution that helpsus make sampling estimations
Also the basis of statistical significance tests (e.g., the t-test),which we will discuss later
These areas under the curve can be used to calculatestandardized scores (recall the measurement section):
Z − scores =(Score−Mean)
Standard Deviation(6)
Mallinson Day 8 October 12, 2017 41 / 46
Where Do the Formulas Come From?
Central Limit Theorem
If the population is normally distributed, its sampling distributionwill also be normal
If the population is large, but not normally distributed, itssampling distribution will also be normal
One test we will discuss (t-test) uses a modified version of thenormal curve for its sampling distribution
Other tests have their own specific sampling distributions
Calculations of sampling error and confidence intervals are basedon the idea of a normal sampling distribution
Mallinson Day 8 October 12, 2017 42 / 46
Where Do the Formulas Come From?
Standard Error:
The standard deviation of the sampling distribution (populationcorrection factor (fcp) in bold for small populations)
(Proportions) SEp =
√p(1− p)
n∗ (N-n)(N-1) (7)
(Means) SEx =σ√n∗ (N-n)(N-1) (8)
Mallinson Day 8 October 12, 2017 43 / 46
Where Do the Formulas Come From?
Confidence Interval:
Confidence level = 1− alpha level
If alpha is 0.05, then the confidence level will be .95 (95%)
A confidence interval is calculated by using the confidence level
If CL is .95 (95%) and we assume a normal distribution:
Lower limit (bound) = Sample Mean− 1.96 ∗ SEx
Upper limit (bound) = Sample Mean + 1.96 ∗ SEx
CI means confidence intervals produced by 95% of sampleswould contain population parameter
Mallinson Day 8 October 12, 2017 44 / 46
An SPSS Example
Mallinson Day 8 October 12, 2017 45 / 46
Questions?
Figure: Q&A by Libby Levi, CC BY-SA 2.0
Mallinson Day 8 October 12, 2017 46 / 46