Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | brendan-dawson |
View: | 214 times |
Download: | 0 times |
CHAPTER 7: SURVEY
SAMPLING & INFERENCE
CLASS SURVEYS...
STATISTIC VS. PARAMETER… Know the symbols and the meanings
STATISTICAL INFERENCE … Drawing conclusions about a population
on the basis of observing only a small subset of that population
Always involves some uncertainty
Does a given sample represent a particular population accurately?
Is the sample systemically ‘off?’ By a lot? By a little?
WHAT COULD POSSIBLY GO WRONG? BIAS… Recall, bias is…
Being systematically ‘off;’ scale 5 pound heavy; clock 10 minutes fast, etc.
Textbook explains two different types of bias: measurement bias & sampling bias
Don’t need to know if a particular bias is measurement or sampling; just need to know the concept of bias
Let’s discuss…
BAD SAMPLING TECHNIQUES: VOLUNTARY RESPONSE SAMPLING…• People who choose themselves by responding
to a general appeal
• Biased because people with strong opinions, especially negative opinions, are most likely to respond
• Often very misleading
• With a partner, come up with one real-life example of voluntary response sampling with which you are familiar; 30 seconds & share out
BAD SAMPLING TECHNIQUES: VOLUNTARY RESPONSE SAMPLING…
TV, radio, on-line sites pose a question & listener/viewer call/text in a response
DWTS, America’s Next Top Model, etc.
BAD SAMPLING TECHNIQUES: CONVENIENCE SAMPLING….
Convenience Sampling: Choosing individuals who are easiest to reach.
With a partner, think of a real-life example of convenience sampling. 30 seconds, then share out.
BAD SAMPLING TECHNIQUES: CONVENIENCE SAMPLING…
Example: Mall interviews• Not representative; mall-goers have $$
• Interviewers tend to choose ‘safe’ people to interview
• May not reflect views of all consumers
• Worthless data
BIAS... SITUATIONAL EXAMPLES…• Conducting a survey asking if people
believe in God at various Christian churches
• Taking a poll at a variety of liquor stores that asks if those customers drink alcohol
• Surveying a random sample of gun and ammunition store customers if they support the right to bear arms
BIAS... Convenience sampling & voluntary
sampling are both bias sampling methods
How do we minimize/eliminate bias?
Let impersonal chance/randomness do the choosing; more on this...
RANDOM SAMPLING... (remember, in voluntary response people
chose to respond; in convenience sampling, interviewer made choice; in both situations, personal choice creates bias)
Simple Random Sample (SRS) – type of probability or random sample
SRS – chance selects the sample
The use of chance selecting the sample is the essential principle of statistical sampling
RANDOM SAMPLING... Several different types of random sampling,
all involve chance selecting the sample
Choosing samples by chance gives all individuals an equal chance to be chosen
We will focus on Simple Random Sampling (SRS)
SRS ensures that every set of n individuals has an equal chance to be in the sample/actually selected
SIMPLE RANDOM SAMPLE (SRS)
Easiest ways to use chance/SRS:• Names in a hat
• Random digits generator in calculator or Minitab
• Random digits table
SIMPLE RANDOM SAMPLERandom digits table: table of random digits,
long string of digits 0, 1, ..., 9 in which:
• Each entry in the table is equally likely to be 0 – 9
• Entries are independent. Knowing the digits in one point of the table gives no information about another part of the table
• Table in rows & columns; read either way (but usually rows); groups & rows – no meaning; just easier to read
RANDOM DIGITS TABLE...
0 – 9 equally likely 00 – 99 equally likely 000 – 999 equally likely
RANDOM DIGITS TABLE
Joan’s small accounting firm serves 30 business clients. Joan wants to interview a sample of 5 clients to find ways to improve client satisfaction. To avoid bias, she chooses a SRS of size 5.
RANDOM DIGITS TABLE... Enter table at a random row
Notice her clients are numbered (labeled) with 2-digits numbers (if this isn’t already done, you must label your list), so we are going to go by 2-digit number in table
Ignore all 2-digit number that are beyond 30 (our data is numbered from 01 to 30)
Ignore duplicates
Continue until we have 5 distinct 2-digit numbers chosen & identify who those clients are
CHOOSING A SRS: 4 STEPS1. Label... Assign a numerical label to every
individual
2. Random Digits Table (or Minitab or names in hat)... Select labels at random
3. Stopping Rule ... Indicate when you should stop sampling
4. Identify Sample ... Use labels to identify subjects/individuals selected to be in the sample
REMINDERS.... CAUTIONS....IF USING RDT Be certain all labels have the same # of
digits if using RDT(ensures individuals have the same chance to be chosen)
Use shortest possible label, i.e., 1 digit for populations up to 10 members (can use labels from 0 to 9), 2 digits for populations from 11 – 100 members (can use labels from 00 to 99), etc. --this is just a good standard of practice...
SRS OF OUR CLASS... FOR CANDY Label students; what labels should we use?
Label candy; what labels should we use?
SRS of 3 students using Random Digits Table; enter table on line ___
SRS of 3 students using my Minitab (will my Minitab be different from your Minitab?)
Should we allow duplicates? When should we and when should we not?
CAUTIONS ABOUT SAMPLE SURVEYS...
Most samples suffer from some degree of under coverage (another type of bias)
What is bias again?? ....
... Bias is systemically favoring a particular outcome
Under coverage occurs when a group(s) is left out of the process of choosing the sample somehow/entirely
UNDERCOVERAGE...Talk in your groups and come up with an
example of under coverage (some groups in population are left out of the process of choosing the sample)
Examples... Household surveys will miss students in dorms, prison inmates, the homeless
Telephone surveys will miss those without phones; how about those with unlisted phone numbers?
CAUTIONS ABOUT SAMPLE SURVEYS... Another source of bias in many/most sample
surveys is non-response, when a selected individual cannot be contacted or refuses to cooperate
Big problem; happens very often, even with aggressive follow-up
Almost impossible to eliminate non-response; we can just try to minimize as much as possible
Note: Most media polls won’t/don’t tell us the rate of non-response
CAUTION ABOUT SAMPLE SURVEYS...
Response bias...occurs when respondents are untruthful, especially if asked about illegal or unpopular beliefs or behaviors
Example: Salaries, amount/frequency of alcohol consumed, jail time, use of illegal drugs, weight, age, votes or not, etc.
VIDEO CLIP OF JIMMY KIMMEL LIVEWho won the “First Lady Debate?”
http://www.youtube.com/watch?v=EohGmG-QUhA
http://perezhilton.com/tv/JIMMY_KIMMEL_LIVE_Who_Won_The_Presidential_Debate_BEFORE_It_Even_Happened/?id=79b7cb451ec00#.
Vb-lN_NViko
http://www.mrctv.org/videos/kimmel-public-weighs-first-lady-debate
WORDING IN SAMPLE SURVEYS... MORE POSSIBLE BIAS
Should we ban disposable diapers?
A survey paid for by makers of disposable diapers found that 84% of the sample opposed banning disposable diapers. Here’s the actual question:
It is estimated that disposable diapers account for less than 2% of the trash in today’s landfills. In contract, beverage containers, third-class mail and yard wastes are estimated to account for about 21% of the trash in landfills. Given this, in your opinion, would it be fair to ban disposable diapers?
Remember our survey ...
CAUTION ABOUT SAMPLE SURVEYS...
Even if we use SRS/probability sampling, and we are very careful in reducing bias as much as possible, the statistics we get from a given sample will likely be different from the statistics we get from another sample.
Statistics vary from sample to sample
We can improve our results (our sample statistic can get closer to what our population parameter actually is) by increasing our random sample size
Remember samples vary; parameters are fixed
CAUTION ABOUT SAMPLE SURVEYS…
But remember no matter how much we increase/how large our sample is, a large sample size does not ‘fix’ underlying issues, like bad wording, under coverage, convenience sampling, etc.
QUESTIONS TO ASK YOURSELF BEFORE YOU BELIEVE A POLL/SURVEY...
Who carried out the survey? How was the sample selected? How large was the sample? What was the response rate? How were the subjects contacted? When was the survey conducted? What was the exact question
asked?
MEASURING THE QUALITY OF A SURVEY…BIAS & VARIABILITY True value of population parameter (p or
μ) is like a bull’s eye on a target
Sample statistic ((, , etc.) is like an arrow fired at the target; sometimes it hits the bull’s eye and sometimes it misses
Keep in mind… we (very) often don’t know how close to the bull’s eye we are…
BIAS & VARIABILITY (AIM & PRECISION)When we take many samples from a population (sampling
distribution), bias & variability can look like the following:
BIAS & VARIABILITY (AIM & PRECISION)... We want: low bias & low variability (good aim
& good precision)
Properly chosen statistics computed from random samples of sufficient size will have low bias & low variability (good aim & precision)
Hits the bull’s eye on the target
Can’t eliminate bias & variability (bad aim & precision); can just do all that we can to reduce bias & variability
SAMPLING DISTRIBUTIONS...
The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.
GROUP/PARTNER ACTIVITY... The population we will consider is the
scores of 10 students on an exam as follows:
The parameter of interest is the mean score in this population, which is 69.4.
The sample is a SRS drawn from the population.
Let’s use RDT (enter at a random line) to draw an SRS of size n = 4 from this population. Calculate the mean of the sample scores. This statistic is an estimate of the population parameter.
Repeat this process 4 times. Write your 4 ’s on the board
Input all ’s written on the board into Minitab & create a histogram. You are constructing the sampling distribution of .
What is the approximate value of the center of your histogram? What is the shape of this histogram?
SIMULATED SAMPLING DISTRIBUTION …
PRECISION OF AN ESTIMATOR… The precision (variability; how much it
varies) of an estimator does not depend on the size of the population; it depends only on the sample size. An estimator based on a sample size of 10 is just as precise in a population of 1000 people as in a population of a million.
Let’s explore this idea more…
DESCRIBING SAMPLING DISTRIBUTIONS...
Consider 1000 SRS’s of n = 100 for proportion of U.S. adults who watched Survivor Guatemala in 2005
Discuss in your groups some observations you have about this distribution; share out.
SOCS; symmetric, uni-modal, no outliers, center at about 0.37, spread from about 0.21 to about 0.53, ≈ Normal
(not ); some are low; some are high; most are about 0.37
Remember, this is data from sampling distribution; center of sampling distribution (a statistic, not a parameter) = 0.37
In reality, we rarely know population parameters (center p, or spread σ); why? Discuss.
n = 100
(1000 SRS for both
sampling distributions)
n = 1000
What do you notice?
BIG IDEA FOR LARGER N...Given that sampling randomization is used
properly, the larger the SRS size ( n ), the smaller the spread (the more tightly clustered; the more precise) of the sampling distribution.
Center doesn’t change significantly
Shape doesn’t change significantly
Spread (range, standard deviation) does change significantly
SPREAD OF SAMPLING DISTRIBUTION VS. SIZE OF POPULATION
Red, white, & blue marbles (1/3 each) in a 64 ounce cup; well mixed
VERSUSRed, white, & blue marbles (1/3 each) in a
large cargo shipping container; well mixed
Variability (spread, standard deviation) of % of red marbles depends only on size of my scoop (SRS size; n); not the population size
Teaspoon size SRS out of cargo container is not going to vary less than teaspoon size SRS out of 64 ounce cup
RED, WHITE, & BLUE MARBLES (1/3 EACH) IN A 128 OUNCE CUP; WELL MIXED VERSUSRED, WHITE, & BLUE MARBLES (1/3 EACH) IN A LARGE CARGO SHIPPING CONTAINER; WELL MIXED
SRS of scoop size 4 ounces out of either (cup or cargo container) will vary equally
SRS of scoop size 6 ounces will have less variability than SRS of 4 ounces
SRS of scoop size 8 ounces out of either (cargo container or cup) will vary equally
Variability/precision does not depend of the size of the population but rather the size of the SRS
ANOTHER EXAMPLE... A SRS of 2,500 from U.S. population of
300 million is going to be just as accurate/same amount of variability/precision as that same size SRS of 2,500 from 750,000 San Francisco population
Both just as precise (given that population is well-mixed); equally trustworthy
Not about population size; it’s about sample size (n)
BOWL SIZE IS POPULATION…
SO WHY NOT JUST HAVE REALLY LARGE SAMPLE SIZES ALL THE TIME?
Increased sample size improves precision/ reduces variability
Surveys based on larger sample sizes have smaller standard error (SE) and therefore better precision (less variability)
Trade-offs…
Cost increases, time-consuming, etc.
UNBIASED ESTIMATORS… IF Conditions are met:
Sample randomly selected (with or without replacement
If sampling without replacement, population must be at least 10 times the sample size (rule of thumb)
REMEMBER THE NORMAL DISTRIBUTION? Life was good and easy with the Normal
distribution Could easily calculate probabilities Good working model If we could use the Normal distribution
with sampling distributions for proportions, life would be great
Guess what? We can. Meet the Central Limit Theorem
CENTRAL LIMIT THEOREM (CLT) Has many versions (one for proportions,
one for means, etc.)
Let’s discuss proportions for now
To use CLT with proportions, three conditions must be met
CONDITIONS FOR USING CLT (NORMAL DISTRIBUTION) WITH PROPORTIONS…
Random and independent (samples collected randomly from population and observations independent)
Large sample; np ≥ 10 and n(1 – p) ≥ 10; proportion of expected successes and failures at least 10
Big population (population at least 10 times sample size)
CONFIDENCE INTERVALS...WHAT PROPORTION OF ALL COC STUDENTS HAVE AT LEAST ONE TATTOO?
What proportion of us have at least one tattoo? So sample statistic, our =
If we were to ask another group of COC students, we would
get another (likely different)
445 Math 075 students were asked this last Spring; 133/445 = 0.299 = 29.9% had at least one tattoo
Remember, larger n, generally less variation; but still centered at same value (unbiased estimator)
We want to be able to say with a high level of certainty what proportion of all COC students have at least one tattoo. But we don’t know the true, unknown population parameter, p.
SO WHAT CAN WE SAY ABOUT OUR COC STUDENT POPULATION AND THEIR TATTOOS?
We don’t know p (population parameter) We do know (sample statistic) Our estimator is unbiased (what does
that mean?) SD (SE) = ? Sample size is large; ‘randomly’
selected; big population So.... our distribution is ≈ Normal,
centered around ; 68% with 1 SD; 95% within 2 SDs; 99.7% within 3 SDs
SO WHAT CAN WE SAY ABOUT OUR COC STUDENT POPULATION? Our distribution is ≈ Normal, centered
around ; 68% with 1 SD; 95% within 2 SDs; 99.7% within 3 SDs
So, we are highly confident (95%) that the unknown population parameter, the proportion of all COC students that have at least one tattoo, is between ___ and ___.
This is a confidence interval
SO WHAT CAN WE SAY ABOUT OUR COC STUDENT POPULATION? Our distribution is ≈ Normal, centered
around ; 68% with 1 SD; 95% within 2 SDs; 99.7% within 3 SDs
So, we are highly confident (99.7%) that the unknown population parameter, the proportion of all COC students that have at least one tattoo, is between ___ and ___.
This is a confidence interval
STATISTICAL INFERENCE... Statistical inference provides methods for
drawing conclusions about a population based on sample data
Methods used for statistical inference assume that the data was produced by properly randomized design
Confidence intervals, which are based on sampling distributions of statistics, now; then will discuss Hypothesis Testing (another form of inference)
INFERENCE: CONFIDENCE INTERVALS... Estimator ± margin of error (MOE)
Margin of error tells us amount we are most likely ‘off’ with our estimate
Margin of error helps account for sampling variability (NOT any of the bias’ we discussed...voluntary response, non-response, et.)
CONFIDENCE LEVELS: HOW CONFIDENT ARE YOU... that the average temperature in Santa
Clarita in degrees Fahrenheit is between -50 and 150?
that the average temperature in Santa Clarita in degrees Fahrenheit is between 70 and 70.001?
CONFIDENCE LEVELS: HOW CONFIDENT ARE YOU... that the average temperature in Santa
Clarita in degrees Fahrenheit is between -50 and 150?
that the average temperature in Santa Clarita in degrees Fahrenheit is between 70 and 70.001?
In general, large interval high confidence level; small interval lower confidence level
TYPICAL CONFIDENCE LEVELS... 99% confidence level
95% confidence level
90% confidence level
Typically we want both: a reasonably high confidence level AND a reasonably small interval; but there are trade-offs; more on this in a little bit
WE ARE 95% CONFIDENT IN OUR METHOD... GIVES CORRECT RESULTS 95% OF THE TIME...
CONFIDENCE INTERVALS... Will we ever know for sure if we
captured the true unknown population parameter p? No. Actual p is unknown.
Memorize: “I am ___% confident that the true, unknown population proportion of (context) is between ____ and ____.”
PRACTICE: TRUE OR FALSE? There is a 95% probability (chance) that
the interval from 0.80 to 0.92 contains p.
PRACTICE: TRUE OR FALSE? There is a 95% probability (chance) that
the interval from 0.80 to 0.92 contains p.
False; The probability is either 0 or 1 (but we don’t know which)
PRACTICE: TRUE OR FALSE There is a 95% chance that the interval
(0.17, 0.24) contains
PRACTICE: TRUE OR FALSE There is a 95% chance that the interval
(0.17, 0.24) contains
False. The general form of a CI is ± MOE. So, will always be in the center of the CI.
PRACTICE: TRUE OR FALSE? There’s a 95% probability that p = .112
PRACTICE: TRUE OR FALSE? There’s a 95% probability that p = .112
False. Never use this wording. Don’t use ‘probability;’ there’s no mention of the CI either
PRACTICE: TRUE OR FALSE? We are 95% confident that the true,
unknown population parameter, p, of freshmen who have genius-level IQ scores from this particular university, is between 0.010 and 0.018
PRACTICE: TRUE OR FALSE? We are 95% confident that the true,
unknown population parameter, p, of freshmen who have genius-level IQ scores from this particular university, is between 0.010 and 0.018
True; perfect wording for CI interpretations. Doesn’t use word “probability,” context, confidence level.
TO REVIEW...WHAT EFFECTS THE LENGTH OF CI’S? The lower the confidence level (say 10%
confident), the shorter the CI The higher the confidence level (say
99% confident), the wider the CI
What else effects the length of the CI? Larger the n, shorter the CI (small MOE) Smaller the n, longer the CI (large MOE)
EXAMPLE... HIGH CI & SMALL MOE So, if you want (need) high confidence level
AND small(er) interval (margin of error), it is possible if you are willing to increase n
Can be expensive, time-consuming
Sometimes not realistic (why?)
In reality, you may need to compromise on the confidence level (lower confidence level)
PRACTICE... Alcohol abuse is considered by some as the
number one problem on a campus. How common is it? A 2001 SRS of 10,904 U.S. college students collected information on drinking behavior and alcohol-related problems. The researchers defined “frequent binge drinking” as having 5 or more drinks in a row 3 or more times in the past 2 weeks. According to this definition, 2486 students were classified as frequent binge drinkers.
Based on these data, what can we say about the proportion of all college students who have engaged in frequent binge drinking?
N = 10,904; X = 2,486; CONFIDENCE LEVEL: 99% Check conditions to create a confidence
interval... Randomization, Normality, Independence
Randomization: SRS stated in problem Normality (via CLT): np ≥ 10; n (1 – p) ≥
10 Independence: population must be at
least 10 times sample size.
N = 10,904; X = 2,486; CONFIDENCE LEVEL: 99% Perform Minitab calculations 1 sample, proportion Options, 99 CL Data, summarized data, events, trials “+”, data labels
Always conclude with interpretation, in contextI am 99% confident that the true, unknown
population parameter, p, the proportion of all college students who have engaged in frequent binge drinking is between 21.8% and 24.9%.
CHOOSING A SAMPLE SIZE... Often researchers choose the MOE & CL
they want ahead of time/before survey
So they need to have a particular n to achieve the MOE and the CL they want.
€
MOE = z *p(1 − p)
n
⎛
⎝ ⎜
⎞
⎠ ⎟
CHOOSING A SAMPLE SIZE... Often researchers choose the MOE A common CL is 95%, so z* ≈ 2 Can solve for n & get formula in
textbook
€
m = 20.5(1− 0.5)
n
⎛
⎝ ⎜
⎞
⎠ ⎟
PRACTICE... CHOOSING A SAMPLE SIZE...A company has received complaints about its
customer service. They intend to hire a consultant to carry out a survey of customers. Before contacting the consultant, the company president wants some idea of the sample size that she will be required to pay for.
One critical question is the degree of satisfaction with the company's customer service. The president wants to estimate the proportion p of customers who are satisfied. She decides that she wants the estimate to be within 3% (0.03) at a 95% confidence level.
CHOOSING A SAMPLE SIZE... No idea of the true proportion p of
satisfied customers; so use p = 0.5. The sample size required is given by
€
MOE = z *p(1− p)
n
⎛
⎝ ⎜
⎞
⎠ ⎟
0.03 = 20.5(1− 0.5)
n
⎛
⎝ ⎜
⎞
⎠ ⎟