8/4/2019 Stage-3.1 Distributions and Sampling
1/60
Concepts(Review of Probability)
In probability, we assume that the population and itsparameters are known and compute the probability ofdrawing a particular sample.
In statistics, we assume that the population and itsparameters are unknown and the sample is used to infer thevalues of the parameters.
sampling variability : Different samples give differentestimates of population parameters.
Sampling variability leads to sampling error.
Probability is deductive (general -> particular)
Statistics is inductive (particular -> general)
8/4/2019 Stage-3.1 Distributions and Sampling
2/60
Probability Concepts
Random experiment procedure whose outcome cannot be
predicted in advance. E.g. toss a coin twice
Sample Space (S) Mutually exclusive, collectively exhaustive
listing of all possible outcomes
S={H,H},{H,T},{T,H},{T,T}Event (A) a set of outcomes (subset of S). E.g. No heads A={T,T}
Union (or) E.g. A=heads on first, B=heads on second A U B=
{H,T},{H,H},{T,H}
Intersection (and): E.g. A= heads on first, B=heads on second A
B = {H,H}
Complement of Event A set of all outcomes not in A. E.g.
A={T,T}, Ac={H,H},{H,T},{T,H}
8/4/2019 Stage-3.1 Distributions and Sampling
3/60
Probability Model
Example: There are 95% chancesthat sale of apples on Monday will be120 Kg
Estimate=______
Probability of error = ______
Interpretation:
To reach this kind of Judgement weneed data and fitting probability model
8/4/2019 Stage-3.1 Distributions and Sampling
4/60
Process of getting a probabilitymodel
SpecifyExperiment
Recognize alloutcomes
SampleSpace
AssignNumber to
each outcome
RandomVariable
x
Determine probabilityfor each value of x
42
43
44
45
46
47
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
The determination of
probability distributioncompletes the process ofdescribing probabilitymodel
8/4/2019 Stage-3.1 Distributions and Sampling
5/60
Some definitions
Random Experiment
Sample Point
Sample Space
Random Variable
Event: any subset of sample space ofa random variable is called event.
A random sample gives a non-zero(equal) chance to every unit of thepopulation to enter the sample.
8/4/2019 Stage-3.1 Distributions and Sampling
6/60
Examples:
Experiment: Flip a Coin
Outcomes:Two Discrete Outcomes: H,T
Sample Space: Discrete & Finite
Random Variable:Define {x=1} if H occursAnd { x=0} if T occurs
Experiment: Taking an Exam
Outcomes: Grades A,B,C,D,E,F
Sample Space: Discrete and Finite
Random Variable: {y=4} if Grade is A
{y=3} if Grade is B etc
8/4/2019 Stage-3.1 Distributions and Sampling
7/60
Random Variables
8/4/2019 Stage-3.1 Distributions and Sampling
8/60
Continuous Random Variables
8/4/2019 Stage-3.1 Distributions and Sampling
9/60
Cumulative Distribution Function
8/4/2019 Stage-3.1 Distributions and Sampling
10/60
Two Famous Theorems
iid : independent identically distributed
8/4/2019 Stage-3.1 Distributions and Sampling
11/60
Simply stating:
Law of Large Numbers (LLN) says that asn, the sample mean converges to thepopulation mean, i.e.,
0x,n As
8/4/2019 Stage-3.1 Distributions and Sampling
12/60
Probability DistributionProbability distribution is defined for a random variable x
which takes values x1, x2,.xn with probabilities P(x1),P(x2),P(xn)
The function P(Xi)is called Probability Mass Function thissymbol is used if variable is discrete. f(xi) is calledprobability density function,notation f(x) is used if thevariable is continuous.
The summary of distinct values xi of a random variable Xtogether with their probabilities P(x) or f(x) is known as
probability distribution of the random variable
Discrete prob. Distributions: Binomial, Poisson
Continuous Prob. Distribution : Normal
8/4/2019 Stage-3.1 Distributions and Sampling
13/60
Selected Discrete Distributions
)!(!
!
xnx
nCx
n
8/4/2019 Stage-3.1 Distributions and Sampling
14/60
Binomial Rule
P (two six from 3 dice)
n =3 Trials,
Define Success: Occurrence of Six
Define Failure : non-occurrence of six r= 2 success
Formula: if probability of success in
any one trial is p, the probability of rsuccess in n trialsrnr
pprnr
n
)1(*
)!(!
!
8/4/2019 Stage-3.1 Distributions and Sampling
15/60
Solve Following Questions
P ( 2 six from 3 dice)
P(3 six from five throws)
P(less than 2 six from 4 dice)
Binomial Theorem
Pascals Triangle
8/4/2019 Stage-3.1 Distributions and Sampling
16/60
Poisson Distribution
The Poisson distribution can be used to determine the
probability of a designated number of events occurring, when
these events occur in a continuum of time.
A long-run mean() number of events for specific time ofinterest is required to find probability of designated number of
events.
The probability of x no of success in Poisson distribution is
given by P(x|
)= (x
e-
)/x! The mean of the Poisson process is always proportional to the
length of time, therefore if mean is available for one length of
time then mean for any other required time period can be
determined.
8/4/2019 Stage-3.1 Distributions and Sampling
17/60
Question on Poisson
On an average 12 people per hour ask questions to a
decorating consultant in a fabric store. What is the
probability that three or more will approach the
consultant with questions during a 10 min period?Solution: Average per hour = 12
10 min = 1/6 of an hour
Av. Per 10 min = 1/6 *12 = 2
P(x 3| =2) =P(x =3| =2)+P(x =4| =2)+
=0.1804+0.0902+0.0361+0.0120+
S l d Di Di ib i
8/4/2019 Stage-3.1 Distributions and Sampling
18/60
Selected Discrete Distributions(cont)
8/4/2019 Stage-3.1 Distributions and Sampling
19/60
Selected Continuous Distributions
8/4/2019 Stage-3.1 Distributions and Sampling
20/60
Normal Distribution
Many years ago I called the Laplace-Gauss curve the NORMAL
curve, which name, while it avoids an international question of
priority, has the disadvantage of leading people to believe that all
other distributions of frequency are in one sense or another
ABNORMAL.
That belief is, of course, not justifiable.
Karl Pearson, 1920.
N l Di ib i
8/4/2019 Stage-3.1 Distributions and Sampling
21/60
Normal Distribution(Bell-curve, Gaussian)
8/4/2019 Stage-3.1 Distributions and Sampling
22/60
8/4/2019 Stage-3.1 Distributions and Sampling
23/60
Normal Distribution
Transformation of Normal RandomVariables
Finding probabilities using standard
normal distribution Finding values of standard normal
random variable
Excel Functions:NORMSDIST(z) Returns the area to the left of z in standard normaldistributionNORMSINV(p) returns the value of z on st. normal distribution forprobability p.NORMINV(p,m,s) returns the value of variable x with normal
distribution having probability p mean m and SD s.
Transformation of normal random variable to standard normal
8/4/2019 Stage-3.1 Distributions and Sampling
24/60
Transformation of normal random variable to standard normalvariable
We move the distribution from its centre 50the centre of 0. this is done bysubtracting 50 from all the values of X. Thus, we shift the distribution 50 units
back so that its new centre is 0. the second thing we need to do is to make theweight of the distribution, the standard deviation, equal to 1. this is done bysqueezing the width down from 10 to 1. Because the total probability under thecurve must remain 1.00,the distribution must grow upwards to maintain the samearea. Mathematically, squeezing curve to make the width 1 is equivalent todividing the random variable by standard variation. the area under the curveadjusts so that the total remains the same. All probabilities adjust accordingly.Mathematically Z= (x- )/
Transformation
Subtraction: x-
Z
1.0
=10
=50
xDivision by
0 50 Z
8/4/2019 Stage-3.1 Distributions and Sampling
25/60
Normal Probability Plots
After collecting data problem involvesdeciding whether a population or randomvariable is normally distributed?
Since distribution of a random sample from
population will approximate the distributionof population (larger sample providesbetter approximation)- if population isnormally distributed, then a graph of asample should reflect it.
A sensitive graphical technique calledNormal probability plot helps.
8/4/2019 Stage-3.1 Distributions and Sampling
26/60
Normal Probability Plots
A Normal Probability Plot is plot ofsample data versus Normal scores
Normal Scores is the data we would
expect to get by taking a sample ofsame size from standard normaldistribution.
If sample is from a normal
population, then normal probabilityplot should be linear ( or roughlylinear)
8/4/2019 Stage-3.1 Distributions and Sampling
27/60
Guidelines for Inference fromNormal Probability Plot
These guidelines should beinterpreted loosely for smallsamples, but can beinterpreted rather strictlyfor large samples. If the plot is roughly
linear, then accept as
reasonable thatpopulation isapproximately normallydistributed
If plot shows systematicdeviations from linearity(e.g. if displays
significant curvature),than conclude that thepopulation is probablynot approximatelyNormally distributed.
** Shapes of these plots are based on ideal situations, i.e. largesamples from exact distributions
8/4/2019 Stage-3.1 Distributions and Sampling
28/60
Exhibit-1
The internal Revenue Service publishes dataon federal individual income tax returns instatistics of income, individual income taxreturns. A sample of 12 returns reveal theadjusted gross incomes, in thousands of
dollar, shown below9.7 93.1 33.0 21.281.4 51.1 43.5 10.612.8 7.8 18.1 12.7
a) Construct a normality plot for these datab) Assess the normality of adjusted gross
income
8/4/2019 Stage-3.1 Distributions and Sampling
29/60
The normal probability plot in the figure above displays significantcurvature. Evidently, adjusted gross income are not approximatelynormally distributed.
D i O li i N l
8/4/2019 Stage-3.1 Distributions and Sampling
30/60
Detecting Outliers using Normalprobability Plot
The dept. of agriculture publishesdata on chicken consumptions. Lastyears chicken consumptions, in Kgfor randomly selected people aredisplayed in table below. Use normalprobability plot to discuss distributionof chicken consumption and to detectany outliers
47 39 62 49 50 70 59 45 72
53 55 0 65 63 53 51 50
8/4/2019 Stage-3.1 Distributions and Sampling
31/60
On removing the outlier
0Kg from the data set, it
appears plausible that
among people who eatchicken, the amounts they
consume annually are
approximately normally
distributed.
8/4/2019 Stage-3.1 Distributions and Sampling
32/60
Statistics- The Easiest Subject
Sir Ronald A. Fisher
(1890-1962)
Wrote the first bookson statistical methods
(1926 & 1936):
A student should notbe made to read
Fishers books unlesshe has read them
before.
George W. Snedecor
(1882-1974)
Taught at Iowa StateUniv. where wrote a
college textbook (1937)
Thank God forSnedecor; now we can
understand Fisher.(named the distribution
for Fisher)
8/4/2019 Stage-3.1 Distributions and Sampling
33/60
Sampling
Procedure by which some members ofthe defined population is selected as
representative of the entire population
8/4/2019 Stage-3.1 Distributions and Sampling
34/60
Sampling Methods
Example 1: I know that the market for product X is Chinesevillages, is my assumption about them right?
Example 2: I know the product is for a peculiar population. Is
Delhi a right place to market the product?
PopulationSample
Or
Sampling Distribution
(Using C.L.T)
Analysis of data
collected from
previous step
What do youknow aboutpopulation ?
How can you be surethat your sample isfit for analysis?
what can you say aboutpopulation now?
What kind of analysisShall we carry out? Why?
8/4/2019 Stage-3.1 Distributions and Sampling
35/60
Why do we use samples?
Get information from large populationsAt minimal cost.
At maximum speed (Least Time)
At increased accuracy.
Using enhanced tools
8/4/2019 Stage-3.1 Distributions and Sampling
36/60
Sampling Terminology
Population:the relevant target group for study.
Census:data collection from entire population.
Sample: a subset of target population, selected to
represent population.
Sample Unit:elements of the targeted population
available for selection during sampling.
Sample Frame: a list or other way of identifyingunits from which a sample is to be drawn.
8/4/2019 Stage-3.1 Distributions and Sampling
37/60
Sampling Terminology
Sample Representativeness: degree to whichsample is similar to target population in terms of keycharacteristics.
Incident Rate: percentage of people in the generalpopulation or on a list that fits the qualifications of
those the researcher wishes to describe.Sampling Error: Discrepancies between datagenerated from a sample and the actual populationdata as a result of sampling instead of census.
Non Sampling Error: All other biases at any stage,
including inaccurate population, old sampling frame,error in measurement etc. that can occur regardless of
sample or censusused.
8/4/2019 Stage-3.1 Distributions and Sampling
38/60
Steps in planning sample study
Step 1: Define the target population (keycharacteristics)
Step 2: Define the data collection methodand, margin of error ()
Step 3: Obtain the designate sampling frame.
Step 4: Determine the sampling method.
Considering Time/ Area/Budget/ precision
Non probability / probability method of samplingStep 5: Determine sample size.
Step 6: Develop operational procedure
8/4/2019 Stage-3.1 Distributions and Sampling
39/60
Types of Sampling Methods
Convenience
Samples
Non-ProbabilitySamples
Snowball Judgment
Probability Samples
Simple
Random
Systematic
Stratified
Cluster
Multi-Stage
Sampling
Quota
8/4/2019 Stage-3.1 Distributions and Sampling
40/60
Stratified Random Sampling
Separates the population into mutually exclusive sets(strata), and then draw simple random samples from eachstratum.
Strata similar to blocks in an experiment
With this procedure we can acquire information about the whole population each stratum the relationships among strata.
Sex Male Female
Age under 20 20-30 31-40 41-50
Occupation professional clerical blue-collar
8/4/2019 Stage-3.1 Distributions and Sampling
41/60
Stratified Random Sampling
There are several ways to build astratified sample. For example, keepthe proportion of each stratum in the
population.
Total 1,000
Stratum Income Population proportion
1 under $15,000 25% 2502 15,000-29,999 40% 400
3 30.000-50,000 30% 3004 over $50,000 5% 50
Stratum size
8/4/2019 Stage-3.1 Distributions and Sampling
42/60
Determining Sample Size
NON STATISTICAL APPROACH
Arbitrary % of population. Conventional- suggested by past
research, industry standards
Cost /Time Constrains driven
8/4/2019 Stage-3.1 Distributions and Sampling
43/60
Determining Sample Size
Statistical Approach- Using Confidence Intervals3 Factors in Determining Sample Size
Confidence Intervals (Confidence in estimates)
Sampling Error: Precision or tolerance for error aroundestimate stated in percentage points.
Estimated Standard Deviation: Estimate of variability ofpopulation characteristics based on prior information
Confidence level Z Confidence level Z
90% 1.65 95% 1.96
98% 2.33 99% 2.58
Determining Sample Size-
8/4/2019 Stage-3.1 Distributions and Sampling
44/60
Determining Sample Size3 Questions
1. How close you want your sample estimate to be to theunknown parameter? The answer to this question isdenoted by e, desired accuracy range.
2. What do you want the confidence level to be so that thedistance between the estimate and parameter is less
than or equal to e?3. What is your estimate of variance (or standard deviation)
of the population in question?
Ans: this is often unknown and we need to estimate thisusing range (if you are sure of no outliers present), =
(Range/6) or if the population is approximately normaland you can get the 95% bounds on values in thepopulation, divide the difference between upper andlower bound by 4, or conduct a pilot survey to estimate.
Sample size formula for means
8/4/2019 Stage-3.1 Distributions and Sampling
45/60
Sample size formula for means(Interval or Ratio data)
effectdesigng
RangeAccuracyDesiredeMeanPopulationforSDEstimated
levelconfidencedesiredforZofValueZ
Size.SmpleRequiredn
Where,
**2
222
ge
zn
Sample size formula for means
8/4/2019 Stage-3.1 Distributions and Sampling
46/60
Sample size formula for means(Nominal or Ordinal data)
effectdesigng
rangeaccuracydesiredeP)-(100Q
proportionpopulationofestimationP
levelconfidencedesiredforvalueZTheZsizesamplerequiredn
where,
*)*(*
2
22
ge
QPzn
8/4/2019 Stage-3.1 Distributions and Sampling
47/60
Exhibit1:
A market research firm wants to conduct a survey toestimate the average amount spent on entertainment byeach person visiting a popular resort. The people whoplan the survey would like to be able to determine theaverage amount spent by all people visiting the resort to
within $120, with 95% confidence. From post operationof the resort, an estimate of the population standarddeviations $400. what is the minimum required samplesize.
43684.42
120
160000*(1.96)
**
2
2
2
22
2
g
e
zn
8/4/2019 Stage-3.1 Distributions and Sampling
48/60
Exhibit 2:
The manufacturer of a sports car wants to estimate theproportion of people in a given income bracket who areinterested in the model. The company wants to know thepopulation proportion p to be with in 0.10 with 99%confidence. Current company records indicate that
proportion p to within 0.25. what is the minimum requiredsample size for this survey?
12542.124
10.0
)75.0)(25.0(*(2.576)
**
2
2
2
2
2
ge
pqzn
8/4/2019 Stage-3.1 Distributions and Sampling
49/60
Sample Size Calculations
108803.0
85.0*15.*96.1*2**z*n
SamplingCluster
54403.0
85.0*15.*96.1**zn
samplingsystematicrandom/Simple
2
2
2
2
2
2
2
2
dqpg
d
qp
effectdesigng
precisionabsoluted
p-1q
eprevalaencexpectedp
cesignificanoflevelwithassociatedScoreZz
Where,
8/4/2019 Stage-3.1 Distributions and Sampling
50/60
Sampling Cost
Sampling Cost = Fixed cost +Variable Cost Fixed cost is independent of sample size, e.g. cost of
planning and organizing the sampling experiment
Variable Cost increases with increase in sample size.,it includes cost of selection, measurement and
recording of each sampled item.
Error Cost: More difficult to estimate thansampling cost. Usually it increases more rapidlythan linear increment in amount of error. Often
a quadratic formula is used. Total Cost = sampling cost +error cost
Sampling Distributions
8/4/2019 Stage-3.1 Distributions and Sampling
51/60
Sampling DistributionsDefinitions and Key Concepts
A sample statistic used to estimate anunknown population parameter is called anestimate.
The discrepancy between the estimate andthe true parameter value is known assampling error.
A statistic is a random variable with a
probability distribution, called the samplingdistribution, which is generated by repeatedsampling.
8/4/2019 Stage-3.1 Distributions and Sampling
52/60
Distribution of Sample Means
How do the sample mean and variance vary inrepeated samples of size n drawn from the
population?
Generally, the exact distribution is difficult to calculate.
What can be said about the distribution of thesample mean when the sample is drawn from anarbitrary population?
In many cases we can approximate the distribution of
the sample mean when nis large by a normaldistribution.
The famous Central Limit Theorem
8/4/2019 Stage-3.1 Distributions and Sampling
53/60
Central Limit Theorem
Estimators and their properties
8/4/2019 Stage-3.1 Distributions and Sampling
54/60
Estimators and their properties Unbiased: if the estimators expected value is equal to the
population parameter it estimates. Sample mean is unbiasedestimator of population mean.
Efficiency: An estimator is efficient is it has relatively smallvariance( not S.D)
Consistency: if estimators probability of being close toparameter it estimates increases with increase in sample size.
Sufficient: An estimator is said to be sufficient if it contains all
the information in the data about the parameter it estimates. Estimator of population mean can be mean and median
S2 is an unbiased estimator of2 but SD (s) is not the unbisedestimator of popn SD . We still use S as estimator, ignoringsmall bias, relying on the fact that S2 is unbiased estimator of
2
MEAN MEDIAN
UNBIASED YES YES
EFFICIENCY HIGHER THANMEDIAN
SUFFICIENCY USES ALL VALUESFOR CALCULATION
USES ONLY POSITION
8/4/2019 Stage-3.1 Distributions and Sampling
55/60
Degree of freedom
When we calculate the sample variance, thedeviations are taken from and not from .The reason is simple while sample, almostalways the population mean is unknown and
we have to estimate using . This reducesour degree of freedom from n to (n-1).Buttaking squared deviations from induces adownward bias in the deviations.
Dividing the sum of squared deviation by onlyits d.f. Will yield an unbiased estimate ofpopulation variance.
x
x
x
8/4/2019 Stage-3.1 Distributions and Sampling
56/60
Exhibit:Sampling Error & need of Sampling Distribution
Sampling Error: Error resulting from using a sample, instead ofcensus, to estimate population quantity.
Suppose population of interest consists of heights (in inches) offive starting players on mens basketball team.
76 78 79 81 86 ( = 80)
(i) Determine the sampling distribution of the mean for randomsample of (a)two heights, (b) 4 heights, from a population offive heights.
(ii) Make some observation about sampling error when mean ofrandom sample of (a)two heights, (b) 4 heights, is used toestimate the population mean .
(iii) Employ the sampling distribution of the mean obtained aboveto find the probability that the sampling error made inestimating the population mean, , by the sample mean, will beat most 1 inch; that is , determine the probability that samplemean will be within 1 inch of
8/4/2019 Stage-3.1 Distributions and Sampling
57/60
Considering the sampling size of two
Sample 76,78
76,79
76,81
76,86
78,79
78,81
78,86
79,81
79,86
81,86
Mean 77.0 77.5 78.5 81.0 78.5 79.5 82.0 80.0 82.5 83.5
78.00 78.50 79.00 79.50 80.00 80.50 81.00 81.50
Probability of one sample being selected = 1/10 =.1Probability distribution of the random variable ( the samplingdistribution of mean)
Mean 77.0 77.5 78.5 81.0 78.5 79.5 82.0 80.0 82.5 83.5
Probability 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1
x
(ii) Using the above results we can make some simple but
8/4/2019 Stage-3.1 Distributions and Sampling
58/60
(ii) Using the above results we can make some simple butsignificant observation about sampling error when the mean, , ofa random sample of two heights is used to estimate populationmean .
It is unlikely that mean of the sample selected will be equal to 80.in fact only 1 of 10 samples have the mean 80, thus in this casechances are only .01 that will equal ; some sampling error islikely.
(iii)Since =80 inches, we need to find:
If we take a random sample of two heights, there is a 30% chancethat the mean of sample will be with in 1 inch of population mean.
x
x
3.01.01.01.0
)0.81()0.80()5.79(
)0.81,0.80,5.79()8179(
PPP
orxPxP
8/4/2019 Stage-3.1 Distributions and Sampling
59/60
Considering the sampling size of four
Sample 76,78,79,81 76,78,79,86 76,78,81,86 76,79,81.86 78,79,81,86
Mean 78.5 79.75 80.25 80.50 81.00
78.00 79.00 80.00 81.00 82.00
Probability of one sample being selected = 1/5 =.2Probability distribution of the random variable ( the samplingdistribution of mean)
Mean 78.5 79.75 80.25 80.50 81.00
Probability 0.2 0.2 0.2 0.2 0.2
(ii) Using the above results we observe that none of the sample of
8/4/2019 Stage-3.1 Distributions and Sampling
60/60
( ) g pfour heights has a mean equal to the population mean 80, thuswhen the mean , , of a random sample of four heights is used toestimate the population mean, , same sampling error is certain.
(iii)Since = 80 inches, we need to find:
If we take a random sample of two heights, there is a 80% chancethat the mean of sample will be with in 1 inch of population mean.
Hence as sample size , n ->, S.E -> 0
x
8.02.02.02.02.0
)0.81()50.80()25.80()75.79(
)0.8150.80,25.80,75.79()8179(
PPPP
orxPxP