Download - Stage-3.1 Distributions and Sampling

8/4/2019 Stage-3.1 Distributions and Sampling

1/60

Concepts(Review of Probability)

In probability, we assume that the population and itsparameters are known and compute the probability ofdrawing a particular sample.

In statistics, we assume that the population and itsparameters are unknown and the sample is used to infer thevalues of the parameters.

sampling variability : Different samples give differentestimates of population parameters.

Sampling variability leads to sampling error.

Probability is deductive (general -> particular)

Statistics is inductive (particular -> general)


2/60

Probability Concepts

Random experiment procedure whose outcome cannot be

predicted in advance. E.g. toss a coin twice

Sample Space (S) Mutually exclusive, collectively exhaustive

listing of all possible outcomes

S={H,H},{H,T},{T,H},{T,T}Event (A) a set of outcomes (subset of S). E.g. No heads A={T,T}

Union (or) E.g. A=heads on first, B=heads on second A U B=

{H,T},{H,H},{T,H}

Intersection (and): E.g. A= heads on first, B=heads on second A

B = {H,H}

Complement of Event A set of all outcomes not in A. E.g.

A={T,T}, Ac={H,H},{H,T},{T,H}


3/60

Probability Model

Example: There are 95% chancesthat sale of apples on Monday will be120 Kg

Estimate=______

Probability of error = ______

Interpretation:

To reach this kind of Judgement weneed data and fitting probability model


4/60

Process of getting a probabilitymodel

SpecifyExperiment

Recognize alloutcomes

SampleSpace

AssignNumber to

each outcome

RandomVariable

x

Determine probabilityfor each value of x

42

43

44

45

46

47

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

The determination of

probability distributioncompletes the process ofdescribing probabilitymodel


5/60

Some definitions

Random Experiment

Sample Point

Sample Space

Random Variable

Event: any subset of sample space ofa random variable is called event.

A random sample gives a non-zero(equal) chance to every unit of thepopulation to enter the sample.


6/60

Examples:

Experiment: Flip a Coin

Outcomes:Two Discrete Outcomes: H,T

Sample Space: Discrete & Finite

Random Variable:Define {x=1} if H occursAnd { x=0} if T occurs

Experiment: Taking an Exam

Outcomes: Grades A,B,C,D,E,F

Sample Space: Discrete and Finite

Random Variable: {y=4} if Grade is A

{y=3} if Grade is B etc


7/60

Random Variables


8/60

Continuous Random Variables


9/60

Cumulative Distribution Function


10/60

Two Famous Theorems

iid : independent identically distributed


11/60

Simply stating:

Law of Large Numbers (LLN) says that asn, the sample mean converges to thepopulation mean, i.e.,

0x,n As


12/60

Probability DistributionProbability distribution is defined for a random variable x

which takes values x1, x2,.xn with probabilities P(x1),P(x2),P(xn)

The function P(Xi)is called Probability Mass Function thissymbol is used if variable is discrete. f(xi) is calledprobability density function,notation f(x) is used if thevariable is continuous.

The summary of distinct values xi of a random variable Xtogether with their probabilities P(x) or f(x) is known as

probability distribution of the random variable

Discrete prob. Distributions: Binomial, Poisson

Continuous Prob. Distribution : Normal


13/60

Selected Discrete Distributions

)!(!

!

xnx

nCx

n


14/60

Binomial Rule

P (two six from 3 dice)

n =3 Trials,

Define Success: Occurrence of Six

Define Failure : non-occurrence of six r= 2 success

Formula: if probability of success in

any one trial is p, the probability of rsuccess in n trialsrnr

pprnr

n

)1(*

)!(!

!


15/60

Solve Following Questions

P ( 2 six from 3 dice)

P(3 six from five throws)

P(less than 2 six from 4 dice)

Binomial Theorem

Pascals Triangle


16/60

Poisson Distribution

The Poisson distribution can be used to determine the

probability of a designated number of events occurring, when

these events occur in a continuum of time.

A long-run mean() number of events for specific time ofinterest is required to find probability of designated number of

events.

The probability of x no of success in Poisson distribution is

given by P(x|

)= (x

e-

)/x! The mean of the Poisson process is always proportional to the

length of time, therefore if mean is available for one length of

time then mean for any other required time period can be

determined.


17/60

Question on Poisson

On an average 12 people per hour ask questions to a

decorating consultant in a fabric store. What is the

probability that three or more will approach the

consultant with questions during a 10 min period?Solution: Average per hour = 12

10 min = 1/6 of an hour

Av. Per 10 min = 1/6 *12 = 2

P(x 3| =2) =P(x =3| =2)+P(x =4| =2)+

=0.1804+0.0902+0.0361+0.0120+

S l d Di Di ib i


18/60

Selected Discrete Distributions(cont)


19/60

Selected Continuous Distributions


20/60

Normal Distribution

Many years ago I called the Laplace-Gauss curve the NORMAL

curve, which name, while it avoids an international question of

priority, has the disadvantage of leading people to believe that all

other distributions of frequency are in one sense or another

ABNORMAL.

That belief is, of course, not justifiable.

Karl Pearson, 1920.

N l Di ib i


21/60

Normal Distribution(Bell-curve, Gaussian)


22/60


23/60

Normal Distribution

Transformation of Normal RandomVariables

Finding probabilities using standard

normal distribution Finding values of standard normal

random variable

Excel Functions:NORMSDIST(z) Returns the area to the left of z in standard normaldistributionNORMSINV(p) returns the value of z on st. normal distribution forprobability p.NORMINV(p,m,s) returns the value of variable x with normal

distribution having probability p mean m and SD s.

Transformation of normal random variable to standard normal


24/60

Transformation of normal random variable to standard normalvariable

We move the distribution from its centre 50the centre of 0. this is done bysubtracting 50 from all the values of X. Thus, we shift the distribution 50 units

back so that its new centre is 0. the second thing we need to do is to make theweight of the distribution, the standard deviation, equal to 1. this is done bysqueezing the width down from 10 to 1. Because the total probability under thecurve must remain 1.00,the distribution must grow upwards to maintain the samearea. Mathematically, squeezing curve to make the width 1 is equivalent todividing the random variable by standard variation. the area under the curveadjusts so that the total remains the same. All probabilities adjust accordingly.Mathematically Z= (x- )/

Transformation

Subtraction: x-

Z

1.0

=10

=50

xDivision by

0 50 Z


25/60

Normal Probability Plots

After collecting data problem involvesdeciding whether a population or randomvariable is normally distributed?

Since distribution of a random sample from

population will approximate the distributionof population (larger sample providesbetter approximation)- if population isnormally distributed, then a graph of asample should reflect it.

A sensitive graphical technique calledNormal probability plot helps.


26/60

Normal Probability Plots

A Normal Probability Plot is plot ofsample data versus Normal scores

Normal Scores is the data we would

expect to get by taking a sample ofsame size from standard normaldistribution.

If sample is from a normal

population, then normal probabilityplot should be linear ( or roughlylinear)


27/60

Guidelines for Inference fromNormal Probability Plot

These guidelines should beinterpreted loosely for smallsamples, but can beinterpreted rather strictlyfor large samples. If the plot is roughly

linear, then accept as

reasonable thatpopulation isapproximately normallydistributed

If plot shows systematicdeviations from linearity(e.g. if displays

significant curvature),than conclude that thepopulation is probablynot approximatelyNormally distributed.

** Shapes of these plots are based on ideal situations, i.e. largesamples from exact distributions


28/60

Exhibit-1

The internal Revenue Service publishes dataon federal individual income tax returns instatistics of income, individual income taxreturns. A sample of 12 returns reveal theadjusted gross incomes, in thousands of

dollar, shown below9.7 93.1 33.0 21.281.4 51.1 43.5 10.612.8 7.8 18.1 12.7

a) Construct a normality plot for these datab) Assess the normality of adjusted gross

income


29/60

The normal probability plot in the figure above displays significantcurvature. Evidently, adjusted gross income are not approximatelynormally distributed.

D i O li i N l


30/60

Detecting Outliers using Normalprobability Plot

The dept. of agriculture publishesdata on chicken consumptions. Lastyears chicken consumptions, in Kgfor randomly selected people aredisplayed in table below. Use normalprobability plot to discuss distributionof chicken consumption and to detectany outliers

47 39 62 49 50 70 59 45 72

53 55 0 65 63 53 51 50


31/60

On removing the outlier

0Kg from the data set, it

appears plausible that

among people who eatchicken, the amounts they

consume annually are

approximately normally

distributed.


32/60

Statistics- The Easiest Subject

Sir Ronald A. Fisher

(1890-1962)

Wrote the first bookson statistical methods

(1926 & 1936):

A student should notbe made to read

Fishers books unlesshe has read them

before.

George W. Snedecor

(1882-1974)

Taught at Iowa StateUniv. where wrote a

college textbook (1937)

Thank God forSnedecor; now we can

understand Fisher.(named the distribution

for Fisher)


33/60

Sampling

Procedure by which some members ofthe defined population is selected as

representative of the entire population


34/60

Sampling Methods

Example 1: I know that the market for product X is Chinesevillages, is my assumption about them right?

Example 2: I know the product is for a peculiar population. Is

Delhi a right place to market the product?

PopulationSample

Or

Sampling Distribution

(Using C.L.T)

Analysis of data

collected from

previous step

What do youknow aboutpopulation ?

How can you be surethat your sample isfit for analysis?

what can you say aboutpopulation now?

What kind of analysisShall we carry out? Why?


35/60

Why do we use samples?

Get information from large populationsAt minimal cost.

At maximum speed (Least Time)

At increased accuracy.

Using enhanced tools


36/60

Sampling Terminology

Population:the relevant target group for study.

Census:data collection from entire population.

Sample: a subset of target population, selected to

represent population.

Sample Unit:elements of the targeted population

available for selection during sampling.

Sample Frame: a list or other way of identifyingunits from which a sample is to be drawn.


37/60

Sampling Terminology

Sample Representativeness: degree to whichsample is similar to target population in terms of keycharacteristics.

Incident Rate: percentage of people in the generalpopulation or on a list that fits the qualifications of

those the researcher wishes to describe.Sampling Error: Discrepancies between datagenerated from a sample and the actual populationdata as a result of sampling instead of census.

Non Sampling Error: All other biases at any stage,

including inaccurate population, old sampling frame,error in measurement etc. that can occur regardless of

sample or censusused.


38/60

Steps in planning sample study

Step 1: Define the target population (keycharacteristics)

Step 2: Define the data collection methodand, margin of error ()

Step 3: Obtain the designate sampling frame.

Step 4: Determine the sampling method.

Considering Time/ Area/Budget/ precision

Non probability / probability method of samplingStep 5: Determine sample size.

Step 6: Develop operational procedure


39/60

Types of Sampling Methods

Convenience

Samples

Non-ProbabilitySamples

Snowball Judgment

Probability Samples

Simple

Random

Systematic

Stratified

Cluster

Multi-Stage

Sampling

Quota


40/60

Stratified Random Sampling

Separates the population into mutually exclusive sets(strata), and then draw simple random samples from eachstratum.

Strata similar to blocks in an experiment

With this procedure we can acquire information about the whole population each stratum the relationships among strata.

Sex Male Female

Age under 20 20-30 31-40 41-50

Occupation professional clerical blue-collar


41/60

Stratified Random Sampling

There are several ways to build astratified sample. For example, keepthe proportion of each stratum in the

population.

Total 1,000

Stratum Income Population proportion

1 under $15,000 25% 2502 15,000-29,999 40% 400

3 30.000-50,000 30% 3004 over $50,000 5% 50

Stratum size


42/60

Determining Sample Size

NON STATISTICAL APPROACH

Arbitrary % of population. Conventional- suggested by past

research, industry standards

Cost /Time Constrains driven


43/60

Determining Sample Size

Statistical Approach- Using Confidence Intervals3 Factors in Determining Sample Size

Confidence Intervals (Confidence in estimates)

Sampling Error: Precision or tolerance for error aroundestimate stated in percentage points.

Estimated Standard Deviation: Estimate of variability ofpopulation characteristics based on prior information

Confidence level Z Confidence level Z

90% 1.65 95% 1.96

98% 2.33 99% 2.58

Determining Sample Size-


44/60

Determining Sample Size3 Questions

1. How close you want your sample estimate to be to theunknown parameter? The answer to this question isdenoted by e, desired accuracy range.

2. What do you want the confidence level to be so that thedistance between the estimate and parameter is less

than or equal to e?3. What is your estimate of variance (or standard deviation)

of the population in question?

Ans: this is often unknown and we need to estimate thisusing range (if you are sure of no outliers present), =

(Range/6) or if the population is approximately normaland you can get the 95% bounds on values in thepopulation, divide the difference between upper andlower bound by 4, or conduct a pilot survey to estimate.

Sample size formula for means


45/60

Sample size formula for means(Interval or Ratio data)

effectdesigng

RangeAccuracyDesiredeMeanPopulationforSDEstimated

levelconfidencedesiredforZofValueZ

Size.SmpleRequiredn

Where,

**2

222

ge

zn

Sample size formula for means


46/60

Sample size formula for means(Nominal or Ordinal data)

effectdesigng

rangeaccuracydesiredeP)-(100Q

proportionpopulationofestimationP

levelconfidencedesiredforvalueZTheZsizesamplerequiredn

where,

*)*(*

2

22

ge

QPzn


47/60

Exhibit1:

A market research firm wants to conduct a survey toestimate the average amount spent on entertainment byeach person visiting a popular resort. The people whoplan the survey would like to be able to determine theaverage amount spent by all people visiting the resort to

within $120, with 95% confidence. From post operationof the resort, an estimate of the population standarddeviations $400. what is the minimum required samplesize.

43684.42

120

160000*(1.96)

**

2

2

2

22

2

g

e

zn


48/60

Exhibit 2:

The manufacturer of a sports car wants to estimate theproportion of people in a given income bracket who areinterested in the model. The company wants to know thepopulation proportion p to be with in 0.10 with 99%confidence. Current company records indicate that

proportion p to within 0.25. what is the minimum requiredsample size for this survey?

12542.124

10.0

)75.0)(25.0(*(2.576)

**

2

2

2

2

2

ge

pqzn


49/60

Sample Size Calculations

108803.0

85.0*15.*96.1*2**z*n

SamplingCluster

54403.0

85.0*15.*96.1**zn

samplingsystematicrandom/Simple

2

2

2

2

2

2

2

2

dqpg

d

qp

effectdesigng

precisionabsoluted

p-1q

eprevalaencexpectedp

cesignificanoflevelwithassociatedScoreZz

Where,


50/60

Sampling Cost

Sampling Cost = Fixed cost +Variable Cost Fixed cost is independent of sample size, e.g. cost of

planning and organizing the sampling experiment

Variable Cost increases with increase in sample size.,it includes cost of selection, measurement and

recording of each sampled item.

Error Cost: More difficult to estimate thansampling cost. Usually it increases more rapidlythan linear increment in amount of error. Often

a quadratic formula is used. Total Cost = sampling cost +error cost

Sampling Distributions


51/60

Sampling DistributionsDefinitions and Key Concepts

A sample statistic used to estimate anunknown population parameter is called anestimate.

The discrepancy between the estimate andthe true parameter value is known assampling error.

A statistic is a random variable with a

probability distribution, called the samplingdistribution, which is generated by repeatedsampling.


52/60

Distribution of Sample Means

How do the sample mean and variance vary inrepeated samples of size n drawn from the

population?

Generally, the exact distribution is difficult to calculate.

What can be said about the distribution of thesample mean when the sample is drawn from anarbitrary population?

In many cases we can approximate the distribution of

the sample mean when nis large by a normaldistribution.

The famous Central Limit Theorem


53/60

Central Limit Theorem

Estimators and their properties


54/60

Estimators and their properties Unbiased: if the estimators expected value is equal to the

population parameter it estimates. Sample mean is unbiasedestimator of population mean.

Efficiency: An estimator is efficient is it has relatively smallvariance( not S.D)

Consistency: if estimators probability of being close toparameter it estimates increases with increase in sample size.

Sufficient: An estimator is said to be sufficient if it contains all

the information in the data about the parameter it estimates. Estimator of population mean can be mean and median

S2 is an unbiased estimator of2 but SD (s) is not the unbisedestimator of popn SD . We still use S as estimator, ignoringsmall bias, relying on the fact that S2 is unbiased estimator of

2

MEAN MEDIAN

UNBIASED YES YES

EFFICIENCY HIGHER THANMEDIAN

SUFFICIENCY USES ALL VALUESFOR CALCULATION

USES ONLY POSITION


55/60

Degree of freedom

When we calculate the sample variance, thedeviations are taken from and not from .The reason is simple while sample, almostalways the population mean is unknown and

we have to estimate using . This reducesour degree of freedom from n to (n-1).Buttaking squared deviations from induces adownward bias in the deviations.

Dividing the sum of squared deviation by onlyits d.f. Will yield an unbiased estimate ofpopulation variance.

x

x

x


56/60

Exhibit:Sampling Error & need of Sampling Distribution

Sampling Error: Error resulting from using a sample, instead ofcensus, to estimate population quantity.

Suppose population of interest consists of heights (in inches) offive starting players on mens basketball team.

76 78 79 81 86 ( = 80)

(i) Determine the sampling distribution of the mean for randomsample of (a)two heights, (b) 4 heights, from a population offive heights.

(ii) Make some observation about sampling error when mean ofrandom sample of (a)two heights, (b) 4 heights, is used toestimate the population mean .

(iii) Employ the sampling distribution of the mean obtained aboveto find the probability that the sampling error made inestimating the population mean, , by the sample mean, will beat most 1 inch; that is , determine the probability that samplemean will be within 1 inch of


57/60

Considering the sampling size of two

Sample 76,78

76,79

76,81

76,86

78,79

78,81

78,86

79,81

79,86

81,86

Mean 77.0 77.5 78.5 81.0 78.5 79.5 82.0 80.0 82.5 83.5

78.00 78.50 79.00 79.50 80.00 80.50 81.00 81.50

Probability of one sample being selected = 1/10 =.1Probability distribution of the random variable ( the samplingdistribution of mean)

Mean 77.0 77.5 78.5 81.0 78.5 79.5 82.0 80.0 82.5 83.5

Probability 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1

x

(ii) Using the above results we can make some simple but


58/60

(ii) Using the above results we can make some simple butsignificant observation about sampling error when the mean, , ofa random sample of two heights is used to estimate populationmean .

It is unlikely that mean of the sample selected will be equal to 80.in fact only 1 of 10 samples have the mean 80, thus in this casechances are only .01 that will equal ; some sampling error islikely.

(iii)Since =80 inches, we need to find:

If we take a random sample of two heights, there is a 30% chancethat the mean of sample will be with in 1 inch of population mean.

x

x

3.01.01.01.0

)0.81()0.80()5.79(

)0.81,0.80,5.79()8179(

PPP

orxPxP


59/60

Considering the sampling size of four

Sample 76,78,79,81 76,78,79,86 76,78,81,86 76,79,81.86 78,79,81,86

Mean 78.5 79.75 80.25 80.50 81.00

78.00 79.00 80.00 81.00 82.00

Probability of one sample being selected = 1/5 =.2Probability distribution of the random variable ( the samplingdistribution of mean)

Mean 78.5 79.75 80.25 80.50 81.00

Probability 0.2 0.2 0.2 0.2 0.2

(ii) Using the above results we observe that none of the sample of


60/60

( ) g pfour heights has a mean equal to the population mean 80, thuswhen the mean , , of a random sample of four heights is used toestimate the population mean, , same sampling error is certain.

(iii)Since = 80 inches, we need to find:

If we take a random sample of two heights, there is a 80% chancethat the mean of sample will be with in 1 inch of population mean.

Hence as sample size , n ->, S.E -> 0

x

8.02.02.02.02.0

)0.81()50.80()25.80()75.79(

)0.8150.80,25.80,75.79()8179(

PPPP

orxPxP