+ All Categories
Home > Documents > Introduction to Data Analysis Sampling and Probability Distributions.

Introduction to Data Analysis Sampling and Probability Distributions.

Date post: 16-Jan-2016
Category:
Upload: dulcie-wilkins
View: 226 times
Download: 0 times
Share this document with a friend
Popular Tags:
46
Introduction to Data Analysis Sampling and Probability Distributions
Transcript
Page 1: Introduction to Data Analysis Sampling and Probability Distributions.

Introduction to Data Analysis

Sampling and Probability Distributions

Page 2: Introduction to Data Analysis Sampling and Probability Distributions.

2

Today’s lecture

Sampling(A&F 2)Why sampleSampling methods

Probability distributions (A&F 4) Normal distribution.

Sampling distributions = normal distributions. Standard errors (part 1).

Page 3: Introduction to Data Analysis Sampling and Probability Distributions.

3

Sampling introduction

Last week we were talking about populations (albeit in some cases small ones, such as my friends).

Often when we see numbers used they are not numbers relating to a population, but a sample of that population.

Newspapers report the percentage of the electorate thinking Tony Blair is trustworthy, but this is really the percentage of their sample (say 1000 people) that they asked about Blair’s trustworthiness.

Page 4: Introduction to Data Analysis Sampling and Probability Distributions.

4

Samples and populations

For that statistic from the newspaper’s sample to be useful, the sample has to be ‘representative’.

i.e. the % saying Blair is trustworthy in newspaper’s survey (the sample of 1000 people) needs to be similar to the % in electorate (the population of 40 million people).

An intuitively obvious way of doing this is to pick 1000 people at random.

For the survey, metaphorically (or literally with a big hat) put every elector’s name into a hat and pull out 1000 names.

For a random sample of people in a large classroom, I could sample every 10th person along each row.

Page 5: Introduction to Data Analysis Sampling and Probability Distributions.

5

Why sample?

Cost. We could ask all 40 million people that are eligible to

vote in Britain. This would prove somewhat expensive. The last British census cost £220 million…

Speed. Equally the last British census took 5 years to process

the data…

Impossibility. Consuming every bottle of wine from a vineyard to

assess its quality leaves no wine to sell…

Page 6: Introduction to Data Analysis Sampling and Probability Distributions.

6

Why random?

Random sampling allows us to apply probability theory to our samples.

This means that we can assess how likely it is (given how big our sample is) that our sample is representative.

Deal with this in more detail later on.

Intuitively, non-random sampling doesn’t seem a very good idea.

Who’s heard of Alf Landon?

Page 7: Introduction to Data Analysis Sampling and Probability Distributions.

7

Alf vs. FDR

In 1936 the Literary Digest magazine predicted that the Republican Presidential candidate (Landon) would beat FDR.

The LD sent 10 million questionnaires out, of the 2 ½ million that were sent back, a large majority claimed to be voting Republican at the election.

The LD wanted to estimate the % of voters for each candidate (the parameter), and used the proportion from their sample (the statistic) to estimate this.

But, FDR won…

Page 8: Introduction to Data Analysis Sampling and Probability Distributions.

8

Why did the LD get it wrong?

LD’s sample was large, but unrepresentative. They did not send questionnaires to randomly selected

people, but rather lists of people with club memberships, lists of car /telephone owners.

These people were wealthier and therefore more likely to vote Republican; the sample was not representative of the US electorate as a whole.

The LD’s sampling frame was not the population (the electorate), but a wealthy subset of the population.

Page 9: Introduction to Data Analysis Sampling and Probability Distributions.

9

Non-probability sampling

The moral being… If we don’t sample randomly, and instead use non-probability

sampling, then we are likely to get sample statistics that are not similar to the population.

e.g. Newspapers and TV regularly invite readers/ viewers to ring up and ‘register their opinion’.

Scottish Daily Mirror ran a poll on who should be the new 1st Minister in 2001. One of Jack McConnell’s fellow MSPs rang up 169 times to indicate he should take the position…

If the Daily Mail and Independent hold phone in-polls on the same issue, the results will be different as the samples are different to one another in a non-random way (social class, ideology, etc.).

Page 10: Introduction to Data Analysis Sampling and Probability Distributions.

10

Experimental designs

Randomness is also useful in experimental sciences, just as with observational data.

If we are giving one set of subjects a treatment and one group nothing, then ideally we would randomly select who is in each group.

e.g. psychiatrists studying a drug for manic depressives, would give the drug to one group and a placebo to the other.

Their results are no good if the groups are initially different (say by age, sex, etc.).

Random selection into a group makes these differences unlikely, and allows us to test how likely it is that the drug has a real effect.

Page 11: Introduction to Data Analysis Sampling and Probability Distributions.

11

Simple random sampling (SRS)

‘Names out of a hat’ sampling. Select the n of the sample that we want, and then randomly pick that n of observations from the population.

Each member of the population is equally likely to be sampled.

e.g. if I wanted a sample from the room, then I might give everyone a number, and then use a table of random numbers to pick out 10 people. Any method that picks people randomly is acceptable.

Page 12: Introduction to Data Analysis Sampling and Probability Distributions.

12

Problems with SRS

A random sample may not include enough of a particular interesting group for analysis.

Interested in experiences of racism, 100 random people will on average include 85 whites, and an individual sample will potentially have even fewer (maybe even zero) non-whites.

Can be costly and difficult. A random sample of 50 school-children might include 49 in

England and Wales, and one in the Orkney’s. A complete list of every school-child might be possible to obtain,

but what of every person living in Britain. A list of the population of interest is not always available.

Page 13: Introduction to Data Analysis Sampling and Probability Distributions.

13

Solutions (1)

Stratified random sampling. Two stages: classify population members into groups,

then select by SRS within those groups.

e.g. ‘over-sample’ non-whites for our racism study.

Once we had divided the population by race, we would SRS within those racial groups.

Might take 50 whites and 50 non-whites for our sample if we were interested in comparing experiences of racism.

Page 14: Introduction to Data Analysis Sampling and Probability Distributions.

14

Solutions (2)

Cluster random sampling. If population members are naturally clustered, then we

SRS those clusters and then SRS the population members within those clusters.

Pupils in schools are naturally grouped by school.

We may not have a list of every school-child, but we do have a list of every school.

Again two stages. We randomly pick 5 schools, and then randomly pick 10 children in each school.

Page 15: Introduction to Data Analysis Sampling and Probability Distributions.

15

One further problem

This is not to say that all problems with random sampling are soluble.

Non-response. Not all members of our chosen sample may respond,

particularly when sanctions are nil and incentives are low (or in fact usually negative…).

This can matter if non-response is non-random. If certain types of people tend to respond and others do not.

Page 16: Introduction to Data Analysis Sampling and Probability Distributions.

16

Non-response

In 1992 opinion polls predicted a Labour victory, yet the Conservatives were returned by a large majority of votes (if not seats).

One of the (many) factors that may have caused this bias in the polls was that Conservative voters were less likely to respond to surveys than other voters.

If the members of the sample that choose to not respond are different to those that do then we have a biased sample. More on bias later on.

Ultimately, tricky to deal with. Some more on this later this semester.

Page 17: Introduction to Data Analysis Sampling and Probability Distributions.

17

Sampling – a summary

Sampling is a easy way of collecting information about a population.

SRS means everyone in the population of interest has the same chance of being selected.

We often use slightly different methods to SRS to overcome certain problems.

Random sampling allows us to estimate the probability of the sample being similar to the population.

Page 18: Introduction to Data Analysis Sampling and Probability Distributions.

18

Probability – an idiot’s guide

We’re interested in how likely, how probable, it is that our sample is similar to the population.

In order to make this judgement, we need to think about probability a little bit.

In particular we need to think about probability distributions.

Page 19: Introduction to Data Analysis Sampling and Probability Distributions.

19

Probability The proportion of times that an outcome would occur

in a long run of repeated observations. Imagine tossing a coin, on any one flip the coin can

land heads or tails. If we flip the coin lots of times then the number of

heads is likely to be similar to the number of tails (law of large numbers).

Thus the probability of a coin landing heads on any one flip is ½, or 0.5, or in bookmakers’ terms ‘evens’.

If the coin was double headed then the probability of heads would be 1 – a certainty.

Page 20: Introduction to Data Analysis Sampling and Probability Distributions.

20

Probability Distribution (1)

The mean of a probability distribution of a variable is:

µ = ∑ y P(y) if y is discrete. µ = ∫ y P(y)dy if y is continuous.

Also called the expected value: E(y)=Probability times payoff

Standard Dev (σ) of prob dist measures variability. Larger σ = more spread out distribution

Page 21: Introduction to Data Analysis Sampling and Probability Distributions.

21

Probability distribution (2) Lists the possible outcomes together with their

probabilities. Now, let’s take a continuous-level variable, like hours

spent working by students per week. The mean = 20, and standard deviation = 5 . But what about the distribution…?

Assign probabilities to intervals of numbers, for example the probability of students working between 0 and 10 hours is (let’s say) 2½ per cent.

Can graph this, with the area under the curve for a certain interval representing the probability of the variable taking that value.

Page 22: Introduction to Data Analysis Sampling and Probability Distributions.

22

0 5 10 15 20 25 30 35 40

Time spent working (hours)

Probability distribution (3)

Area between 0 and 10is 2.5 per cent of the totalArea beneath the curve

Page 23: Introduction to Data Analysis Sampling and Probability Distributions.

23

Probability distribution (4)

Given this distribution, there is a 0.025 probability (2.5%, or 40-1 for the gamblers) that if I picked a student they would have done less than 10 hours work in a week.

A lot of continuous variables have a certain distribution – this is known as the normal distribution.

The student work distribution is ‘normal’.

Page 24: Introduction to Data Analysis Sampling and Probability Distributions.

24

What is a ‘normal distribution’?

NDs are symmetrical. The distribution higher than the mean is the same as the

distribution lower than the mean. Unlike income, which has a skewed distribution.

For any normal distribution, the probability of falling within z standard deviations of the mean is the same, regardless of the distribution’s standard deviation.

The Empirical Rule tells us: For 1 s.d. (or a z-value of 1) the probability is .68 For 2 s.d. (actually 1.96) the probability is .95 For 3 s.d. the probability is almost 1.

Page 25: Introduction to Data Analysis Sampling and Probability Distributions.

25

Brief aside—what is Z?

The Z-score for a value Y on a variable is the number of standard deviations that Y falls from µ.

We can use Z-scores to determine the probability in the tail of a normal distribution that is beyond a number Y.

y

z

Page 26: Introduction to Data Analysis Sampling and Probability Distributions.

26

0 5 10 15 20 25 30 35 40

Time spent working (hours)

Normal distribution (1)Area under the curve here is 0.68 of the total area under the curve.

Hence the probability of working between 15 and 25 hours is 0.68.

1 s.d. less than the mean = 15

1 s.d. more than the mean = 25

Page 27: Introduction to Data Analysis Sampling and Probability Distributions.

27

Normal distribution (2)

For any value of z (i.e. not just whole numbers but say 2.34 s.d.), there is a corresponding probability.

Most stats book have z tables in their front/back covers.

Thus if we were to pick a student out of our population of known distribution we could work out how likely it would be that she was a hard worker.

Even non-normal distributions can be transformed to produce approximately normal distributions.

For example, incomes are not normally distributed, but we ‘log’ them to make a normal distribution (more on this later).

Page 28: Introduction to Data Analysis Sampling and Probability Distributions.

28

What’s the point?

But, surely we don’t know the distribution or mean of the population (that’s probably why we’re sampling it after all), so what use is all this…?

Page 29: Introduction to Data Analysis Sampling and Probability Distributions.

29

Back to sampling

The reason that normal distributions are of relevance to us, is that the distributions of sample means are normally distributed.

In order to understand what this means let’s take an example of sampling.

I want to take a driving trip around the world, visit every country and pay no attention to speed limits.

I don’t particularly want to go to prison however, so what to do…

Page 30: Introduction to Data Analysis Sampling and Probability Distributions.

30

Sampling example (1) The plan is to bribe all

policemen when caught speeding. Thus I want to measure how much it costs on average to bribe a policeman to avoid a speeding ticket.

It’s costly to collect this information, so I don’t want to investigate every country before I set off.

Therefore I sample the countries to try and estimate the average bribe I will need to pay.

James’ car

Ryan’s car

Page 31: Introduction to Data Analysis Sampling and Probability Distributions.

31

Sampling example (2)

I randomly sample 5 countries and measure the cost of the bribe.

Imagine for the minute I know what the population distribution looks like (it happens to be normal with a mean of $500).

Mean of population ($500)

Sample mean ($450)

One observation ($700)

Population distribution

Page 32: Introduction to Data Analysis Sampling and Probability Distributions.

32

Sampling distributions (1) If we took lots of samples we would get a distribution

of sample means, or the sampling distribution. It so happens that this sampling distribution (the distribution of sample means( or any statistic)) is normally distributed.

Due to averaging the sample mean does not vary as widely as the individual observations.

Moreover, if we took lots of samples then the distribution of the sample means would be centred around the population mean.

Page 33: Introduction to Data Analysis Sampling and Probability Distributions.

33

Sampling distributions (2)

Imagine I took lots of samples. There would be a normal distribution of their means, centred around the population mean.

Population distribution

Sampling distribution

Mean of populationMean of all sample means

Page 34: Introduction to Data Analysis Sampling and Probability Distributions.

34

3 Very Important Things If we have lots of sample means then the average will

be the same as the population mean. In technical language the sample mean is an unbiased estimator of

the population mean. If the sample size is large(ish), the distribution of

sample means (what is called the sampling distribution) is approximately normal.

This is true regardless of the shape of the population distribution. As n (the sample size) increases the sampling

distribution looks more and more like a normal distribution.

This is called the central limit theorem.

Page 35: Introduction to Data Analysis Sampling and Probability Distributions.

35

Sampling distributions (4)

Mean of populationMean of all sample means

Population distribution

Sampling distribution

Page 36: Introduction to Data Analysis Sampling and Probability Distributions.

36

Sampling distributions (5)

Page 37: Introduction to Data Analysis Sampling and Probability Distributions.

37

‘Accurate’/‘inaccurate’ samples (1) Some sampling

distributions are bigger than others…

The top sampling distribution is better for estimating the population mean as more of the sample means lie near the population mean.

Mean of population

68% of distribution

68% of distribution

Page 38: Introduction to Data Analysis Sampling and Probability Distributions.

38

‘Accurate’/‘inaccurate’ samples (2)

Sampling distributions that are tightly clustered will give us a more accurate estimate on average than those that are more dispersed.

Remember, high standard deviations give us a ‘short and flabby’ distribution and low standard deviations give us ‘tall and tight’ distribution.

We need to estimate what our sampling distribution’s standard deviation is.

But how do we do this…?

Page 39: Introduction to Data Analysis Sampling and Probability Distributions.

39

A (little) bit of math now…

Before we work out what the sampling distribution looks like, some important terms.

Page 40: Introduction to Data Analysis Sampling and Probability Distributions.

40

Standard error (1) For my bribery sample, we

know the following:

But, we want to know the standard deviation of the sampling distribution, so we can see what the typical deviation from the population mean will be.

Page 41: Introduction to Data Analysis Sampling and Probability Distributions.

41

Standard error (2) Fortunately for us:

The standard error is an estimate of how far any sample mean ‘typically’ deviates from the population mean.

Page 42: Introduction to Data Analysis Sampling and Probability Distributions.

42

Standard error (3)

For my bribery sample.

Thus, the ‘typical’ deviation of a sample mean from the population mean (of $500) would be $64 , if we repeatedly sampled the population.

Page 43: Introduction to Data Analysis Sampling and Probability Distributions.

43

2 More Very Important Things

The formula for standard error means that as:…the n of the sample increases the sampling

distribution is tighter. This makes sense, the bigger the sample the better it is

at estimating the population mean.

…the distribution of the population becomes tighter, the sampling distribution is also tighter.

This also makes sense. If a population is dispersed it will be more unlikely to get observations near the mean.

Page 44: Introduction to Data Analysis Sampling and Probability Distributions.

44

Binary variables

This works for binary variables too, where the mean is just the proportion…

Page 45: Introduction to Data Analysis Sampling and Probability Distributions.

45

Do we trust Blair? Take the example of Blair’s trustworthiness.

1000 people in the sample, 30% trust him (i.e. the mean is 0.30).

Given this, we can work out the standard error.

The typical deviation from the proportion would be 1.4% if we took lots of samples.

Page 46: Introduction to Data Analysis Sampling and Probability Distributions.

46

And finally, standard error (4)

Don’t forget that we know the shape of the distribution of the sample means (it’s normal).

We know the sample mean, the shape of the distribution of all the sample means, and how dispersed the distribution of sample means is.

So (at last) we can calculate the probability of the sample mean being ‘near’ to the population mean (i.e. calculate a z-score and look up corresponding probability). But wait, there’s more…


Recommended