From the Data at Hand to the World at Large Part I Introduction
to Statistics Siana Halim
Slide 2
TOPICS Population and Sample Sampling distribution models
Confidence interval for proportions References: De Veaux, Velleman,
Bock, Stats, Data and Models, Pearson Addison Wesley International
Edition, 2005 John A Rice, Mathematical Statistics and Data
Analysis, Duxbury Press, 1995
Slide 3
Slide 4
Sampling and Population Wed like to know about an entire
population of individuals, but examining all of them is usually
impractical, if not impossible. So we settle for examining a
smaller group of individuals a sample- selected from the population
We should select individuals for the sample at random. Randomizing
protects us from the influences of all the features of our
population, even ones that we may not have thought about. The
fraction of the population that youve sampled doesnt matter. Its
the sample size itself thats important.
Slide 5
Sampling and Population Does a census make sense ? It can be
difficult to complete a census Populations rarely stand still
Taking a census can be more complex than sampling.
Slide 6
Population and Parameters Models use mathematics to represent
reality. Parameters are the key numbers in those models. A
parameter used in a model for a population is called a population
parameter. Any summary found from the data is a statistic.
NameStatisticParameter Mean (mu) Standard deviations (sigma)
Correlationr (rho) Regression coefficient b (beta) Proportionp
Slide 7
Simple Random Samples We need to be sure that the statistics we
compute from the sample reflect the corresponding parameter
accurately (representative). How would we select a representative
sample ? A Simple Random Sample (SRS) Every possible sample of the
size we plan to draw has an equal chance to be selected. Each
combination of people has an equal chance of being selected as
well. The sampling frame is a list of individuals from which the
sample is drawn. Samples drawn at random generally differ one from
another. Each draw of random numbers selects different people for
our sample. These differences lead to different values for the
variables we measure. We call these sample-to-sample difference
sampling variability.
Slide 8
Stratified Sampling All statistical sampling designs have in
common the idea that chance, rather than human choice, is used to
select to sample. Designs that are used to sample from large
populations especially populations residing across large areas are
often more complicated than simple random samples. Sometimes the
population is first sliced into homogeneous groups, called strata,
before the sample is selected. Then simple random sampling is used
within each stratum before the results are combined. This common
sampling design is called stratified random sampling.
Slide 9
Cluster and Multistage Sampling Splitting the population into
similar parts or clusters can make sampling more practical. Then we
could simply select one or a few clusters at random and perform a
census within each of them. Sampling schemes that combine several
methods are called multistage samples. Sometimes we draw a sample
by selecting individuals systematically. This is called a
systematic sampling.
Slide 10
Sampling Distribution Models Why do sample proportions vary at
all ? How can surveys conducted at essentially the same time by the
same organization asking the same questions get different result ?
This answer is the heart of statistics. Its because each survey is
based on different sample size. The proportion vary from sample to
sample because the samples are composed of different people
Slide 11
Modeling the Distribution of Sample Proportion Most models are
useful only when specific assumptions are true. In the case of the
model for the distribution of sample proportions, there are two
assumptions: 1. The sampled values must be independent of each
other. 2. The sample size, n, must be large enough. The
corresponding conditions to check before using the Normal to model
the distribution of sample proportions are: 1. 10% condition : If
sampling has not been made with replacement, then the sample size,
n, must be no larger than 10% of the population 2. Success/failure
condition : The sample size has to be big enough so that both np
and nq are greater than 10
Slide 12
The Sampling Distribution Model of a Proportion Provided that
the sampled values are independent and the sample size is large
enough, the sampling distribution of p is modeled by a Normal model
with mean and standard deviation Proporsi sample y is number of
success n is the sample size
Slide 13
The Central Limit Theorem (CLT) As the sample size, n,
increases, the mean of n independent values has a sampling
distribution that tends toward a Normal model with mean equal to
the population mean, , and standard deviation The CLT requires
remarkably few assumptions, so there are few conditions to check:
1. Random sampling condition. 2. Independence assumption
Slide 14
Sampling Distribution Model for Mean If assumptions of
independence and random sampling are met, and the sample size is
large enough, the sampling distribution of the sample mean is
modeled by a Normal model with a mean equal to the population mean,
, and a standard deviation equal to parameter in the population is
estimated by Sample mean Sample standard deviation
Slide 15
Working with Sample Distribution Models Example 1. About 13% of
the population is left-handed. A 200-seat school auditorium has
been built with 15 leftie seats, seats that have the built- in desk
on the left rather than the right arm of the chair. In a class of
90 students, whats the probability that there will not be enough
seats for the left-handed students? Step-by-step State what we want
to know. Check the conditions. State the parameters and the
sampling distribution model. Make a picture. Sketch the model and
shade the area were interested in. Find the z-score or the cutoff
proportion. Find the resulting probability from a table of Normal
probabilities. Discuss the probability in the context of the
question.
Slide 16
Working with Sample Distribution Models Example 2. Suppose that
mean adult weight is 175 pounds with a standard deviation of 25
pounds. An elevator in our building has a weight limit of 10
persons or 2000 pounds. Whats the probability that the 10 people
who get on the elevator overload its weight limit?
Slide 17
Standard Error When we estimate the standard deviation of a
sampling distribution using statistics found from the data, the
estimate is called a standard error.
Slide 18
Confidence Interval for Proportion We 95% confidence to state
that the True Proportion of the population is in our interval.
Proportion
Slide 19
Confidence Interval (Example) Sea fans, one spectacular kind of
coral, in the Caribbean Sea have been under attack by the disease
aspergillosis. In June of 2000, the sea fan disease team from Dr.
Drew Harvells lab randomly sampled some sea fans at the Las Redes
Reef in Akumal, Mexico, at a depth of 40 feet. They found that 54
of the 104 sea fans they sampled were infected with the disease.
What might this say about the prevalence of this disease among sea
fans in general?
Slide 20
Confidence Interval (Example) What can we say about the
population proportion, p? Is the infected proportion of all sea
fans 51.9%? We do know, though, that the sampling distribution
model of is centered at p, and we know that the standard deviation
of the sampling distribution is But we dont know p, instead well
use and find the standard error,
Slide 21
Now we know the sampling model for should look like this:
Because its Normal, it says that about 68% of all samples of 104
see fans will have s within 1SE, 0.049, of p. And about 95% of all
these samples will be within p 2SEs. BUT Where is our sample
proportion in this picture? We do know that for 95% if random
samples, will be no more than 2 SEs away from p. So lets look at
this from s point of view. If Im, theres a 95% chance that p is no
more than 2 SEs away from me. If I reach out 2 SEs, or 2 x 0.049,
away from me on both sides, Im 95% sure that p will be within my
grasp. Now, Ive got him! Probably.
Slide 22
So what can we really say about p? 1.51.9% of all sea fans on
the Las Redes Reef are infected. NO WAY! 2.It is probably true that
51.9% of all sea fans on the Las Redes Reef are infected NO 3.We
dont know exactly what proportion of sea fans on the Las Redes Reef
are infected but we know that its within the interval 51.9% 2x4.9%.
That is, its between 42.1% and 61.7% GETTING CLOSER! 4.We dont know
exactly what proportion of sea fans on the Las Redes Reef are
infected, but the interval from 42.1% and 61.7% probably contains
the true proportion. TRUE but a bit wishy-washy. 5.We are 95%
confident that between 42.1% and 61.7% of Las Redes Reef sea fans
are infected. YES! Statement like these are called confidence
intervals. Theyre the best we can do. The interval is called a
one-proportion z-interval. Far better an approximate answer to the
right question, than an exact answer to the wrong question. - John
W. Tukey
Slide 23
Margin of Error Confidence Interval (CI) has the form The
extent of the interval on either side of is called the margin of
error (ME). In general, CI look like this: estimate ME The more
confident we want to be, the larger the margin of error must
be.
Slide 24
Critical Value 0.95 1.96 -1.96 0.9 1.645 - 1.645 The z * = 1.96
and z * = 1.645 is called as the critical value. The CI for the
sample proportion and the sample mean can be formulated as
follow
Slide 25
Assumptions and Conditions Independence Assumption check three
conditions: Plausible independence condition. This condition
depends on your knowledge of the situation. Randomization
condition. Were the data sampled at random or generated from a
properly randomized experiment? 10% condition. Sample Size
Assumption check success/failure condition. We must expect at least
10 successes and at least 10 failures.
Slide 26
One-proportion z-interval When the conditions are met, we are
ready to find the confidence interval for the population
proportion, p. The confidence interval is where the standard error
of the proportion is estimated by
Slide 27
Example In May 2002, the Gallup Poll asked 537 randomly sampled
adults the question Generally speaking, do you believe the death
penalty is applied fairly or unfairly in this country today? Of
these, 53% answered Fairly and 7% said they didnt know, What can we
conclude from this survey?
Slide 28
Student t distribution t 0 t (df = 5) t (df = 13)
t-Distribution has similar shape as the normal distribution but it
has longer tails Standard Normal (t with df = ) Note: t Z if n
increase
Slide 29
T- Distribution Upper Tail Area df.25.10.05 11.0003.0786.314 2
0.8171.886 2.920 30.7651.6382.353 t 0 2.920 This the value of t,
not the value of the probability.. Let: n = 3 df = n - 1 = 2 =.10
/2 =.05 /2 =.05 Using t distribution then the CI for mean can be
formulated as