+ All Categories
Home > Documents > Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics Notes 2016/17 - LT Scotland

Date post: 06-Jan-2022
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
49
Higher Statistics 2016/17 1 Dr.Hamilton (HSOG) Higher Statistics Notes 2016/17 Course Contents 1.1 Applying statistical literacy skills to data Types of data Random sampling Frequency table Contingency table Bar chart Histogram Stem and leaf diagram Box plot Outliers 1.2 Applying statistical skills to normally distributed data Normal distribution Mean Standard deviation Skewed distribution Median Interquartile range 1.3 Applying statistical skills to correlation and linear regression Scatter plot Correlation coefficient Line of best fit Linear regression Outliers 1.4 Applying statistical skills to data analysis, interpretation and communication Hypothesis test Significance Z-test T-test Confidence Intervals Causes of errors
Transcript
Page 1: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 1 Dr.Hamilton (HSOG)

Higher Statistics Notes 2016/17

Course Contents

1.1 Applying statistical literacy skills to data Types of data

Random sampling

Frequency table

Contingency table

Bar chart

Histogram

Stem and leaf diagram

Box plot

Outliers

1.2 Applying statistical skills to normally

distributed data

Normal distribution

Mean

Standard deviation

Skewed distribution

Median

Interquartile range

1.3 Applying statistical skills to correlation

and linear regression

Scatter plot

Correlation coefficient

Line of best fit

Linear regression

Outliers

1.4 Applying statistical skills to data analysis,

interpretation and communication

Hypothesis test

Significance

Z-test

T-test

Confidence Intervals

Causes of errors

Page 2: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 2 Dr.Hamilton (HSOG)

2.1 Undertaking a correlation and

regression analysis

Using software to test a linear relationship

Solving a problem and reporting on it

2.2 Undertaking a data analysis Using software to do a hypothesis test

Solving a problem and reporting on it

Page 3: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 3 Dr.Hamilton (HSOG)

Introduction

Statistics is analysing data to spot trends. The goal is to know about a large population when there is only access to a limited sample. For example, to determine how people might vote in an election you can’t ask them all, so have to rely on a sample. Likewise, before any new drug comes to market its effectiveness is tested with a small trial. The first question a statistician asks is therefore what sample should be taken? Once the results are collected the second question is what can be learned from the sample? If a sample is well chosen and well analysed, it gives a lot of information about the population. But there is always some uncertainty, as perhaps the sample was not a typical one. So the third question is how much confidence is there in the findings? These three questions are difficult to answer, though the second at least (analysing a sample with statistical tests) can be left to a computer. That leaves the statistician with the role of devising the sampling procedure, programming a computer, then reporting on the results. This final step is the most crucial, as it is notoriously easy to produce misleading conclusions, either by accident or even deliberately when trying to promote a particular product or viewpoint. By completing this course you will be able to present your findings accurately, and recognise when others have not.

Page 4: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 4 Dr.Hamilton (HSOG)

1.1 Applying statistical literacy skills to data

Types of Data Quantitative data is about quantities and so is numerical. It can be continuous and have any value (e.g. weight of ice cream) or discrete and only have some values (e.g. number of scoops of ice cream). Qualitative data is about qualities so is non-numerical. If it is ordinal it can be put into an order (e.g. choosing a rating out of ‘bad’, ‘medium’, ‘good’), nominal meaning it is names (e.g. choice of ice cream flavour) or categorical that assigns people to different groups (e.g. sex, race, nationality). Primary data is data collected by yourself, secondary data is data collected by someone else.

Random Sampling In a simple random sample every member of a population has an equal chance of being selected. One way to do this is to assign everyone a number, then use a random number generator. However, suppose your sample of the UK population happened just to be men aged 50-55, would you be happy that you had a good sample of the whole UK population?

In a systemic sample the population is first ordered by some measure (e.g. house number) then a random starting point is chosen and every nth person is selected, for example every 10th house (e.g. number 63, 73, 83, …). In a stratified random sample, the population is first grouped into strata by some measure (e.g. gender). Then equal numbers (or proportional numbers if the groups are different size) are picked from each group, e.g. picking half of the sample women and half men.

1. An audit of polar bears recorded their sex, weight, number of young and eye colour.

What types of data are these?

Sex is categorical, weight is numerical continuous, number of young is numerical discrete and

eye colour is qualitative.

https://www.tes.com/teaching-resource/classifying-data-starter-6332016

https://www.tes.com/teaching-resource/types-of-data-discrete-vs-continuous-6337366

Page 5: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 5 Dr.Hamilton (HSOG)

In a cluster sample a sample is chosen for convenience, for example surveying everyone who entered the Ice Cream Parlour on Sunday morning. This is not as good statistically as sampling throughout the week, but for practical reasons may be the only option.

Frequency Table When data arrives one by one it is recorded by tallying using a bar-and-gate system. This can then be totalled to show how frequently something happens The features are labelled columns, and a total at the end

(see Excel sheet ‘Ice Cream’)

Tally Marks Frequency

Belgian Chocolate |||| |||| 9

Strawberries and Cream || 2

Luxury Vanilla ||| 3

Salted Caramel |||| 5

Blackcurrant Sorbet 0

Total 19

2. Do Premiership Footballers speak more than one language?

Give four different ways of selecting twenty random Premiership Footballers to investigate,

demonstrating the four different types of random sample.

To perform a simple random sample number all the footballers and use a random number

generator, to perform a systemic sample list them all alphabetically and take twenty equally

spaced, to perform a stratified random sample choose one footballer at random from each

of the twenty teams, to perform a cluster sample pick the first twenty footballers you can

think of.

https://www.tes.com/teaching-resource/maths-gsce-statistics-sampling-techniques-

6082332

https://www.tes.com/teaching-resource/methods-of-sampling-6071355

Page 6: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 6 Dr.Hamilton (HSOG)

Contingency Table This is a table which displays the results of categorising data based on two different criteria. For example, suppose 100 people were asked for their gender and which sport they preferred out of Football, Hockey and Cricket.

Football Hockey Cricket TOTAL

Male 21 16 9 46

Female 22 30 2 54

TOTAL 43 46 11 100

Note that the questions are designed here so that everyone falls into one category or another, you can’t choose both Football and Hockey as your preferred sport, nor can you choose neither. The column totals show how many chose each sport, and the row categories how many of each gender there were. Notice that both the row totals add up to 100 (as everyone has been counted exactly once), and the column totals also add up to 100 (as everyone has been counted exactly once). The natural question to ask when looking at a table like the one above is whether there is a link between gender and favourite sport. The fact that girls much preferred hockey and disliked cricket suggests there is a link, but a statistical test would need to be done to determine how significant this effect is, and whether or not it could just be down to chance in the particular people asked (the name of the test is Chi-squared, and is beyond the scope of this course).

Bar Chart This is a visual display of data that gives an instant overview. It is used for qualitative or quantitative discrete data (continuous data is better displayed with a histogram). The features are a title, labelled axes and bars of equal width that do not touch

3. A dice is rolled 20 times with the following results

1 3 4 5 2 4 2 4 5 2 4 6 6 4 3 5 4 3 2 1

Record this data in a frequency table

https://www.tes.com/teaching-resource/investigation-frequency-table-and-graphs-6448279

4. In a class of pupils there were 10 girls with pets, 8 boys with pets, 7 girls with no pets and 5

boys with no pets. Represent this data in a frequency table.

Page 7: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 7 Dr.Hamilton (HSOG)

\

(see Excel sheet ‘’Ice Cream’)

Histogram A type of bar chart used for continuous data. The data must first be grouped, and the groups don’t all need to be the same size. The features are a title, labelled axes and bars of (potentially) variable width that touch (if a bar is wider than the rest, it must be lower, so the area of the bar shows the total).

0

1

2

3

4

5

6

7

8

9

10

BelgianChocolate

Strawberriesand Cream

Luxury Vanilla Salted Caramel BlackcurrantSorbet

Nu

mb

er o

f Sc

oo

ps

Flavours chosen at Luca's Ice Cream ParlourSunday 9am-10am

5. Based on the bar chart above, how many more scoops of Salted Caramel than Luxury

Vanilla were sold?

6. Draw a bar chart to visually demonstrate how many of each different denomination of coin

you have in your pockets

Page 8: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 8 Dr.Hamilton (HSOG)

(see Excel sheet ‘’Ice Cream’)

Stem and Leaf Diagram

This is a visual display that gives an overview, while preserving the value of each data point. It can only be used for quantitative data, and is only sensible when there are many data points which are well spread out. The features are a title and key

Age of customers in the ice cream shop

0-9 5 6 7 9

10-19 2 3 4 4 5 5 8 8

20-29 1 2 2 3 3 5 6 7 9

30-39 3 3 6 8

40-49 1 3 4 5 6 7 7 8

50-59 3 4 5 6 7

60-69 7 9

0

2

4

6

8

10

10-20 20-30 30-40 40-50 50-60

Freq

ue

ncy

Age Groups

Age of customers

7. For each type of data, determine if a bar chart or histogram is more appropriate

(a) A sample of colours of Smarties

(b) A sample of weights of cats

(c) A sample of pupil arrival times at school

(d) A sample of number of siblings of pupils

For (a) a bar chart is best as the data is qualitative, for (b) and (c) a histogram as the data is numerical continuous, for (d) a bar chart as the data is numerical discrete

Page 9: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 9 Dr.Hamilton (HSOG)

1|2 means 12 years old

The youngest customer was 5 and the oldest 69, and the most common age band was 20-29. A back-to-back stem and leaf diagram visually compares two sets of data. The features are a title and key, and both sets of data increase away from the middle (so the number on the left hand side count down instead of up).

Age of customers in the ice cream shop

Men Women

7 6 5 0-9 9

8 5 4 3 10-19 2 4 5 8

9 7 3 3 2 20-29 1 3 5 6

8 6 3 30-39 3

4 1 40-49 3 5 6 7 7 8

7 50-59 3 4 5 6

60-69 7 9

1|2 means 12 years old

The diagram shows that, since there are in general more women in the higher age bands, the female customers were on average older than the men.

Page 10: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 10 Dr.Hamilton (HSOG)

Box Plot This is a visual display showing these five pieces of information: minimum, lower quartile (Q1), median (Q2), upper quartile (Q3), maximum (these terms are all defined in section 1.2). The features are a title, plot of the five figures, scale and label for the scale.

8. (a) Draw a back to back stem and leaf diagram with the height in cm of 20 cats and 20 dogs

Cats: 38 31 32 24 33 35 32 30 29 28 40 30 33 32 36 37 43 44 49 41

Dogs: 58 32 54 64 73 35 44 51 34 32 35 67 50 71 62 40 41 43 50 60

(b) Make two comments comparing the heights of cats and dogs

(c) State the minimum and maximum Per Capita GNP from this graph:

For part (b), the dog heights are on average higher, and more spread out.

For part (c), the minimum GNP is an income of $1.80, the maximum an income of $8.90.

https://www.tes.com/teaching-resource/stem-and-leaf-diagrams-gcse-worksheet-6161978

https://www.tes.com/teaching-resource/interpreting-stem-and-leaf-diagrams-6387426

Page 11: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 11 Dr.Hamilton (HSOG)

Outlier

An outlier is a data value that is out of keeping with the other values. This could either be caused by a measurement or recording error (e.g. recording a long jump distance as 67.2 metres instead of 6.72 metres), or a genuine freak result (e.g. a long jump of 8.95 metres, which stood as the World Record for 30 years).

Informally, outliers can be determined by eye. More formally, a statistical test can determine if a data value is an outlier.

It is important to identify outliers and, if it is appropriate, remove them from the data, as they can affect any conclusions drawn.

9. (a) Fifty pupils measured the size of the TV in their living room in cm (measuring the screen

diagonally). After analysis, the following statistics were generated:

𝑀𝑖𝑛 = 40, 𝑄1 = 50, 𝑄2 = 55, 𝑄3 = 70, 𝑀𝑎𝑥 = 90

Display this information in a box and whisker plot

(b) Using the box and whisker plot, interpret the data

For part (b), the extended tail to the right indicates that the data has a positive skew, and

there are a few pupils with very large TVs.

For part (c), the comparisons are that Set A has a higher average (median) and that set B is

more spread out.

10. Make two comparisons between Set A and Set B

https://www.tes.com/teaching-resource/comparing-box-and-whisker-plots-6395162

Page 12: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 12 Dr.Hamilton (HSOG)

11. The time for Primary School pupils to run 100 metres was recorded, to the nearest second

17 18 16 19 21 15 9 14 16 18 20 53 24 23 16

(a) Identify the two outliers

(b) Give possible explanations for each outlier

(c) Estimate the average time

For (a) there is a low outlier of 9 seconds, and a high outlier of 53 seconds

For (b) the low outlier is likely a recording error (perhaps it was really 19 seconds). The high

outlier could also be a recording error, or something like a pupil walking on crutches.

For (c) the outliers should be ignored when calculating the average, which is about 20

seconds.

https://www.tes.com/teaching-resource/misleading-graphs-11461831

Page 13: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 13 Dr.Hamilton (HSOG)

1.2 Applying statistical skills to normally distributed data

Normal distribution The normal distribution refers to the likelihood of the different outcomes produced by a random process. For example, if a coin is thrown 20 times the most likely number of Heads is 10, followed by 9 or 11 Heads, then 8 or 12 Heads and so on. Getting something like 5 or 15 Heads is relatively unlikely. The full bar chart for throwing a coin 20 times is shown below, and is overlaid with the normal curve.

(See Excel sheet ‘Coins’)

The normal distribution has the following properties - Symmetrical about the mean

- Values far from the mean are much less common

- Distinctive bell shape

Any process that is sufficiently complicated tends to follow a normal distribution. For example, if a coin is thrown twenty or more times the number of Heads is approximately normally distributed. The more times the coin is thrown, the closer the distribution of likely outcomes follows a normal distribution. Any sample is characterised by the location of the centre, and how spread out the data is. For a normal distribution the best measures of location and centre are the mean and standard deviation.

Page 14: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 14 Dr.Hamilton (HSOG)

Mean When people talk about an average they usually mean the mean. For data that is normally distributed, the mean is the most appropriate measure of location (i.e. the most appropriate average). The mean can be found by ‘adding up and dividing’. For example, suppose the number of offspring of five gibbon mothers was

2 3 2 5 3

The mean is 2 + 3 + 2 + 5 + 3

5= 3

A more instructive way to see this calculation is as 2

5+

3

5+

2

5+

5

5+

3

5= 3

This shows that, since there are five gibbons, a fifth of the number of siblings of each one is taken. This helps to show that the mean is a very fair average, in that it takes a little contribution from all of the data values, giving an equal say to each one. However, the

12. The wingspan of several hundred eagles was recorded in cm in the histogram below

(a) How many eagles had a wingspan between 120 and 130 cm?

(b) What was the mean wingspan?

(c) What fraction of eagles had a wingspan above 155 cm?

(d) What fraction of eagles had a wingspan above 200 cm?

For (a) there are about 20, for (b) about 155 (between 150 and 160), for (c) half of the eagles had a wingspan above average as the data is symmetric, for (d) very few eagles had a wingspan above 200 cm (perhaps one fiftieth) https://www.tes.com/teaching-resource/statistics-1-normal-distribution-ks4-and-s1-3013736

Page 15: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 15 Dr.Hamilton (HSOG)

weakness of the mean is that if there are any extreme values in the data (outliers) they have a large effect on the mean. The calculation of the mean can also be described in more mathematical language

�̅� =∑ 𝑥

𝑛

Where 𝑛 represents the number of data values (𝑛 = 5 in the example of the gibbons), ∑ 𝑥 represents the sum of the data values (∑ 𝑥 = 15) and �̅� (pronounced 𝑥 bar) is the mean (�̅� =3).

Standard Deviation This is a measure of spread, that is appropriate for normally distributed data. Roughly, it works out the average distance of each data value from the mean. So if the data are far from the mean the standard deviation is high, if they are close to the mean the standard deviation is small.

The standard deviation of a population of data values is defined as

𝜎 = √∑(�̅� − 𝑥𝑖)2

𝑛

Where 𝜎 (sigma) is the Greek letter s (standing for standard deviation), 𝑛 is the number of data values, �̅� represents the mean, 𝑥𝑖 represents each data value and ∑(�̅� − 𝑥𝑖)2 is the sum of the squared differences from the mean.

13. Find the mean of each set of numbers:

(a) 4 5 6 7 8

(b) 4 5 6 7 8 60

(c) 40 50 60 70 80 600

(d) 41 51 61 71 81 601

(a) 30

5= 6. This can also be found just by noting that the numbers are equally spaced around

6 in the middle

(b) 90

6= 15. Notice the effect of the outlier (60) in pulling the mean up

(c) 150. This can be calculated easily, as each number is 10 times bigger than part (b)

(d) 151. Each number is one bigger than in part (c).

Page 16: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 16 Dr.Hamilton (HSOG)

This can be calculated by hand using a table. For example, when the number of offspring of five gibbon mothers is

2 3 5 3 2

then 𝑛 = 5, and the mean is �̅� = 3. The table looks like

𝑥𝑖 �̅� (𝑥𝑖 − �̅�) (𝑥𝑖 − �̅�)2

2 3 -1 1

3 3 0 0

5 3 2 4

3 3 0 0

2 3 -1 1

∑(�̅� − 𝑥𝑖)2 6

(See Excel sheet ‘Gibbons’) The standard deviation can then be calculated using ∑(𝑥 − 𝑥𝑖)

2 = 6 and 𝑛 = 5.

𝜎 = √∑(𝑥 − 𝑥𝑖)2

𝑛

𝜎 = √6

5= 1.09

The standard deviation is a good measure of spread, as it uses all of the data values. However, the standard deviation can only be used if the data is normally distributed (or approximately normally distributed). Every normal distribution has a mean and standard deviation. The mean determines where the peak of the curve is (the most likely outcome), the standard deviation how spread out the curve is. For example, the three curves below are all of normal distributions with a mean of 100, but with different standard deviations.

Page 17: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 17 Dr.Hamilton (HSOG)

The height of the graph shows the likelihood of that result. For the small standard deviation, the result is very likely to be close to the mean of 100. A high value such as 120 is reasonably likely when 𝜎 = 20, rare when 𝜎 = 10 and has almost zero probability when 𝜎 = 5. For every normal distribution there is a precise relationship between the standard deviation and how spread out the data is:

- 68% of the values are within one standard deviation of the mean

- 95% of the values are within about two (in fact 1.96) standard deviations of the mean

- 99% of the values are within three standard deviations of the mean

These facts are more often used in reverse, for example to say that 5% of values are more than about two standard deviations away from the mean. Note that this 5% is ‘shared’ equally between values below the mean and above the mean, so in fact 2.5% of values are more than two standard deviations below the mean, and 2.5% more than two standard deviations above the mean.

As an example with numbers, suppose the distance a cricket ball is thrown is normally distributed with mean 50 metres, and standard deviation 5 metres

- 68% of throws are between 45 and 55 metres

- 95% of throws are between 40 and 60 metres

- 99% of throws are between 35 and 65 metres

This is shown in the graph below

Page 18: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 18 Dr.Hamilton (HSOG)

When the data is normally distributed, any value that is more than two standard deviations away from the mean is considered to be unusual (an outlier). This means that 95% of data values are not considered unusual and the final 5% are. For the example of the cricket ball throws, any throw below 40 metres would be considered unusually short (two standard deviations below the mean) and any throw above 60 metres would be considered unusually far (two standard deviations above the mean). On average 2.5% of throws will be below 40 metres and 2.5% above 60 metres.

35 40 45 50 55 60 65

Page 19: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 19 Dr.Hamilton (HSOG)

Skewed Distribution A normal distribution is symmetrical, and has as many low numbers (relative to the mean) as high numbers. In a skewed distribution there is a tail in one direction, as shown below.

14. Calculate the standard deviation of the six numbers on a dice is

The average �̅� = 3.5, the sum ∑(�̅� − 𝑥𝑖)2 = 17.5 and the population size is 𝑛 = 6 so

𝜎 = √∑(𝑥 − 𝑥𝑖)2

𝑛= √

17.5

6= 1.708

15. The number of accidents per year on a busy section of road is normally distributed with a

mean of 120 accidents and a standard deviation of 20 accidents

(a) Find the probability of more than 120 accidents

(b) Find the probability of more than 140 accidents

(c) Find the probability of more than 160 accidents

(d) Find the probability of more than 180 accidents

(a) 50%, as 120 is the mean and the distribution is symmetric

(b) 16%, as only 32% of values are within one standard deviation of the mean

(c) 2.5%, as only 5% of values are within one standard deviation of the mean

(d) 0.5%, as only 1% of values are within one standard deviation of the mean

16. The mean rainfall in October in Skye is 152 mm with a standard deviation of 24 mm.

In 2012 there was 188 mm of rain. Is this an unusual amount of rain?

188 is only 1.5 standard deviations above the mean, so is not unusually high.

Hence 188 mm of rain is not considered unusual.

https://www.tes.com/teaching-resource/introduction-to-mean-and-standard-deviation-6132355 https://www.tes.com/teaching-resource/introduction-to-standard-deviation-via-pokemon-data-11417227

Page 20: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 20 Dr.Hamilton (HSOG)

A distribution with a few very large values is called positively skewed. For example, the numbers

2 2 3 3 3 4 4 4 14 17

are positively skewed, as low numbers are more probable than high numbers so there is a long tail to the right. A distribution with a few very small values is called negatively skewed. For example, the numbers

3 5 20 20 21 21 21 21 22 22

are negatively skewed, as high numbers are more probably than low numbers so there is a long tail to the right.

For a skewed distribution

- the mean is no longer an appropriate measure of location (because it is overly

influenced by the extreme values)

- the standard deviation is no longer an appropriate measure of spread (because with

asymmetric data it is no longer true that for example 95% of values are within 2

standard deviations of the mean)

In the case of data which is not normally distributed, the appropriate measures of location and spread are the median and interquartile range (defined in 1.3). This applies even if the data is not clearly positively or negatively skewed. The table below summarises which measure of location and spread to use.

Measure of Location Measure of Spread Shape of Graph

Normal Distribution Mean Standard Deviation Bell Curve

Non-normal distribution Median Interquartile Range (anything else)

Page 21: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 21 Dr.Hamilton (HSOG)

Median

17. Determine which of the three distributions below is normally distributed, negatively skewed

and positively skewed.

The first is negatively skewed, the second normally distirbuted, the third positively skewed

18. Visitor times to the website williamshatner.com were recorded and displayed in the table

below. What is the appropriate measure of location and spread for this data?

Since the data is not normally distributed (it is positively skewed) the appropriate measures

of location and spread are the median and the interquartile range.

https://www.tes.com/teaching-resource/investigating-skewness-6349338

https://www.tes.com/teaching-resource/investigating-skewness-6349338

Page 22: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 22 Dr.Hamilton (HSOG)

The median is an average which is appropriate when the data is not normally distributed. It is defined as the middle value, once the data is arranged form lowest to highest (or highest to lowest, which gives the same result). In each example below the central values are highlighted and the median calculated:

Example 1 𝑛 = 5 6 7 9 10 10 Median 9

Example 2 𝑛 = 5 6 7 9 10 1000 Median 9

Example 3 𝑛 = 6 6 7 7 9 10 10 Median 8

Example 4 𝑛 = 6 6 7 7 8 10 10 Median 7.5

Note that when there is an even number of data values the two central values are used to generate a median, and their average (mean) is taken. The median can therefore be a value that does not appear on the list. The median is an average that only cares about the middle values. This is a very different type of average to the mean, which cares about all the values equally. The advantage of using the median is that it is robust in the presence of extreme data values. In Example 2 above the median does not change even though the last data value goes from 10 to 1000. Because of this the median is preferred when little is known about the data (i.e. it is not normally distributed). The disadvantage of using the median is that it does not reflect the entire population as well as the mean.

Page 23: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 23 Dr.Hamilton (HSOG)

Interquartile range (IQR) For data that is not normally distributed the standard deviation is not appropriate, so instead the interquartile range (IQR) is used. Just like the standard deviation, if the IQR is small that means the data is tightly clustered together, and if the IQR is large the data is spread out.

The interquartile range is the range between the quartiles, from 𝑄1 to 𝑄3. 𝐼𝑄𝑅 = 𝑄3 − 𝑄1

The quartiles divide the data into quarters, in a similar way to the median divides it in half. Once the data is lined up in order from lowest to highest

- The lower quartile or first quartile (Q1) is the value 25% of the way along

- The median (Q2) is the value 50% of the way along

- The upper quartile or third quartile is the value 75% of the way along

19. Find the median of each set of data

(a) 2 10 3 7 3 11 6

(b) 2 10 3 7 3 1000 6

(c) 2 10 3 7 3 1000 6 5

(a) When placed in order the numbers are 2,3,3,6,7,10,11.

Here 𝑛 = 7 so the median is the 4th value, i.e. 6

(b) When placed in order the numbers are 2,3,3,6,7,10,1000.

Here 𝑛 = 7 so the median is the 4th value, i.e. 6

Note that changing the largest value from 11 to 1000 does not affect the median

(c) When placed in order the numbers are 2,3,3,5,6,7,10,1000.

Here 𝑛 = 8 so the median is the average of the 4th and 5th value, i.e. 5.5

20. For each set of data above, determine if the mean or median is the most appropriate

average

Data set (a) is approximately normally distributed (no outliers) so the mean should be used.

Data sets (b) and (c) are both skewed (postively skewed, as they have one very large positive

value) so the median should be used.

https://www.tes.com/teaching-resource/mean-median-mode-and-range-worksheet-6416468

https://www.tes.com/teaching-resource/mean-challenge-worksheet-6018605

Page 24: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 24 Dr.Hamilton (HSOG)

To find the quartiles first find the median, then remove the median value (or pair of values) and find the centre of each of the halves. This gives 𝑄1 and 𝑄3. Four examples of this are given below.

Ex 1 𝑛 = 15 9 12 13 21 22 25 25 31 33 34 40 44 50 51 53 𝑄1 = 21

𝑄2 = 32

𝑄3 = 44

Ex 2 𝑛 = 16 9 12 13 21 22 25 25 31 33 34 40 44 50 51 53 62 𝑄1 = 21

𝑄2 = 32

𝑄3 = 50

Ex 3 𝑛 = 17 9 12 13 21 22 25 25 31 33 34 40 44 50 51 53 62 70 𝑄1 = 21.5

𝑄2 = 34

𝑄3 = 50.5

Ex 4 𝑛 = 18 9 12 13 21 22 25 25 31 33 34 40 44 50 51 53 62 70 75 𝑄1 = 21.5

𝑄2 = 33.5

𝑄3 = 52

As an example, suppose the weight in kg of 27 mountain goats is measured:

77 79 83 73 73 65 68 70 82 62 62 80 81 79 83 63 63 77 75 69 69 72 62 73 69 70 76 68

This data is then ordered and the quartiles calculated: 62 62 62 63 63 65 68 68 69 69 69 70 70 72 73 73 73 75 76 77 77 79 79 80 81 82 83 83

The minimum is 62, 𝑄1 = 68, 𝑄2 = 72, 𝑄3 = 77 and the maximum is 83. This means that one quarter of the mountain goats weigh from 62 to 68 kg, one quarter weigh from 68 to 71 kg, one quarter weigh from 71 to 77 kg, and one quarter weigh from 77 to 83 kg. This is displayed in the box and plot diagram below.

The interquartile range is 𝐼𝑄𝑅 = 𝑄3 − 𝑄1

𝐼𝑄𝑅 = 77 − 68

𝐼𝑄𝑅 = 9 𝑘𝑔

Page 25: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 25 Dr.Hamilton (HSOG)

The IQR gives the range of the central 50% of values, from the 25% mark (𝑄1) to the 75% mark (𝑄3). This means it is not affected by the more extreme data values, so is more robust than the standard deviation.

21. The frame width of sixteen pairs of sunglasses, in centimetres, was measured as

11 8 9 10 10 11 13 12 11 10 11 11 12 12 8 7 9

Find the median, quartiles and interquartile range

Here 𝑛 = 17, the median is 11, the quartiles are 𝑄1 = 9 and 𝑄3 = 11.5, so the interquartile range is 2.5.

22. Data comparing the heights of boys and girls was recorded below

(note that ‘60’ means 1 metre 60 centimetres tall)

(a) Calculate the median and interquartile range for boys and girls

(b) Make two comparisons between the boys and girls

(a) For boys the median is 1m 69 and the interquartile range is 4. For girls the median is 1m

65.5 and the interquartile range is 5

(b) The boys are taller on average (median 1m69 is above 1m 65.5) and have more

consistent heights (IQR of 4 is less than 5)

https://www.tes.com/teaching-resource/interpreting-box-plots-treasure-hunt-game-6124072 https://www.tes.com/teaching-resource/quartiles-and-box-plots-6278340 https://www.tes.com/teaching-resource/calculating-the-interquartile-range-iqr-6321362

Page 26: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 26 Dr.Hamilton (HSOG)

1.3 Applying statistical skills to correlation and linear regression

Scatter Plot A scatter plot is a way of representing two related sets of numerical data. For example, suppose the height and weight of 10 penguins is measured

Paul Percy Pat Penny Polly Pam Penelope Page Phoebe Priscilla

Height in cm 55 54 61 65 62 70 56 76 65 56

Weight in kg 2.2 2.5 3.1 3.2 2.8 3.4 2.5 3.6 3.3 3.8

A scatter plot shows the relationship between height and weight more clearly:

(See Excel sheet ‘Penguins’)

The general trend is now visible; that taller penguins weigh more than shorter penguins. The plot also exposes one possible outlier that does not fit the trend, the short heavy penguin in the top left of the plot (Priscilla). The features of a scatter plot are

- Two labelled axes, representing different measurements

- Each data point represents one member of the population

2

2.5

3

3.5

4

50 55 60 65 70 75 80

We

igh

t in

kg

Height in cm

Height and Weight of Penguins

Page 27: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 27 Dr.Hamilton (HSOG)

Correlation Coefficient Two variables are correlated if there is a relationship between them.

- Positive correlation means if one goes up the other goes up

- Negative correlation means if one goes up the other goes down

- No correlation means there is no relationship

Correlation can be approximately determined by looking at the scatter graph. Three examples are given below. Note that for the examples of positive and negative correlation, a line of best fit has been added (more on this later).

23. For journeys from London the distance in miles and price of a 2nd class train ticket are shown

below

Hull Carlisle Brighton Coventry Glasgow Bury Liverpool Bath Perth

Miles 187 315 67 107 415 306 220 115 401

Cost £55 £83 £45 £33 £120 £90 £55 £25 £112

(a) Plot the data in a scatter graph

(b) Identify the general trend

(c) Identify a possible outlier

The general trend is that longer distance journeys cost more, and the possible outlier is Brighton which is a short but relatively expensive journey.

Page 28: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 28 Dr.Hamilton (HSOG)

In the example of the penguins above, there was positive correlation as height and weight are correlated. The exact strength of the correlation can be calculated. The correlation coefficient, 𝑅, which goes from -1 to 1, measures how close the points are to the line of best fit. It is therefore a measure of the strength of the linear correlation

𝑅 = 1 represents perfect positive linear correlation (every point on the line of best fit)

𝑅 = 0 represents no linear correlation

𝑅 = −1 represents perfect negative linear correlation (every point on the line of best

fit)

Some examples of scatter plots and the values of 𝑅 are given below.

If there is a strong correlation it means that the variable measured on the 𝑥-axis is a good predictor of the variable on the 𝑦-axis. For example, looking at the graph above with 𝑅 = 0.95 a large 𝑥-value almost certainly indicates a large 𝑦-value. In fact, the value of 𝑅2 (not 𝑅) gives the percentage of variation in the 𝑦-variable which is attributable to the 𝑥-variable. As an example, suppose that with the penguin heights and weights 𝑅 = 0.7, so 𝑅2 = 0.49. That means that 49% of the variation in the weight of a penguin is attributable to its height. So if a penguin is heavy that is partly (49%) explained by it being tall, and partly (51%) explained by other factors.

Page 29: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 29 Dr.Hamilton (HSOG)

The correlation coefficient is normally calculated by computer.

24. Estimate the correlation coefficient for each scatter plot below

The first two graphs show positive correlation, with perhaps 𝑅 = 0.8 and 𝑅 = 0.4

The second two graphs show negative correlation, with perhaps 𝑅 = −0.5 and 𝑅 = −0.7

25. The size of a baby’s head and their intelligence were found to be correlated with 𝑅 = 0.1

(a) What does the correlation coefficient measure?

(b) Comment on the strength of this relationship

(c) What proportion of variation in intelligence can attributed to head size?

(a) The strength of the linear relationship between head size and intelligence

(b) 𝑅 = 0.1 is a very weak positive correlation.

(c) 𝑅2 = 0.01 so 1% of the variation in intelligence can be attributed to head size.

Page 30: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 30 Dr.Hamilton (HSOG)

Line of Best Fit

A line of best fit summarises a scatter plot, and can be used to estimate missing data values

The graph above shows how the hours studied affect the grade for 35 pupils. There is a moderate positive correlation, indicated by the uphill blue line of best fit. When drawing the line of best fit by hand, note that

- It should have roughly the same number of points above as below

- It does not have to go through any of the data points exactly.

- It does not have to go through the origin of the graph

To determine the exact effect of extra study on the grade the equation of the line of best fit must be calculated, which is normally done by computer. The equation consists of two parts, the gradient (steepness) of the line and the intercept (where it crosses the vertical axis). The equation Is of the form

𝑦 = 𝑚𝑥 + 𝑐

- 𝑥 represents the variable measured on the horizontal axis, the independent variable

- 𝑦 represents the variable measured on the vertical axis, the dependent variable

Page 31: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 31 Dr.Hamilton (HSOG)

- 𝑚 is the gradient of the line, positive for positive correlation and negative for negative

correlation

- 𝑐 is the intercept, which is the value of 𝑦 when 𝑥 = 0

In the example above the independent variable is the Hours studied and the dependent variable the Grade (called dependent as it depends on the hours studied). The gradient is about 5, because each extra hour of study increases the Grade by about 5 marks. The intercept is about 53, because studying for 0 hours still gives a Grade of 53. Hence the equation of the line of best fit is therefore:

𝑦 = 𝑚𝑥 + 𝑐 𝑦 = 5𝑥 + 53

In words this is 𝑃𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒𝑑 𝐺𝑟𝑎𝑑𝑒 = 53 + (5 × 𝐻𝑜𝑢𝑟𝑠 𝑆𝑡𝑢𝑑𝑖𝑒𝑑)

25. The graph below shows the correlation between marks in the maths test and marks in the

science test

(a) Using the line of best fit, estimate a pupil’s mark on the Science Test if they scored 0 on

the Maths test

(b) Using the line of best fit, estimate a pupil’s mark on the Science Test if they scored 40 on

the Maths test

(c) Using the information from part (a) and (b), work out the gradient and intercept of the

line of best fit

Page 32: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 32 Dr.Hamilton (HSOG)

Linear Regression Linear regression consists of using the line of best fit as a simplified model of the relationship between two variables. If the correlation is strong (positive or negative) and one variable is known, the line of best fit can be used to make a prediction about the other variable. For example, a pupil who studied for six hours would be expected to get a Grade of

𝑦 = 5𝑥 + 53 𝐺𝑟𝑎𝑑𝑒 = (5 × 6) + 53 𝐺𝑟𝑎𝑑𝑒 = 83

A pupil who studied for six hours is estimated to get a grade of 83. Note that even though there are already pupils who studied for six hours who could be used for the prediction, the line of best fit is used instead, as it uses all the data points so is more accurate. The line of best fit can also be used in reverse to estimate how much someone studied for if, for example, they got a grade of 93

𝑦 = 5𝑥 + 53 93 = 5𝑥 + 53 𝑥 = 8

A pupil who got a grade of 93 would be estimated to have studied for 8 hours. Interpolation means using the line of best fit to estimate missing data values that are among the existing values (inter means between or among). The two examples above are of interpolation. Extrapolation means the using line of best fit to estimate values that are outside the current range (extra means beyond). For example, trying to estimate the effect of 11 hours study

𝑦 = 5𝑥 + 53 𝐺𝑟𝑎𝑑𝑒 = (5 × 11) + 53 𝐺𝑟𝑎𝑑𝑒 = 108

This is not a sensible prediction, as the line of best fit was based on people who studied for less than 11 hours. Extrapolation should be avoided (or at least treated with caution), as it can lead to some unreliable results. For example, it may be that the maximum grade possible is only 100.

Page 33: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 33 Dr.Hamilton (HSOG)

26. The line of best fit between marks in the Maths test and Science test is found to be

𝑆𝑐𝑖𝑒𝑛𝑐𝑒 𝑇𝑒𝑠𝑡 𝑚𝑎𝑟𝑘 =1

2(𝑀𝑎𝑡ℎ𝑠 𝑇𝑒𝑠𝑡 𝑚𝑎𝑟𝑘) + 10

(a) Estimate a pupil’s mark in the Science Test if they scored 4 in the Maths Test

(b) Estimate a pupil’s mark in the Science Test if they scored 24 in the Maths Test

(c) Estimate a pupil’s mark in the Science Test if they scored 25 in the Maths Test

(d) Estimate a pupil’s mark in the Science Test if they scored 6 in the Maths Test

(e) Which of the estimates above are likely to be accurate?

The marks in the Science Tests are 12 in (a) and 22 in (b). The marks in the Maths Test are 30 in (c) and -8 in (d). This last estimate is not accurate because the line of best fit was based on scores in the Science Test that were all above 6 (so this is extrapolation), and also of course because a score of -8 is impossible!

27. The graph below shows how the age of some fun runners affects the amount of sponsorship

money raised

X` (a) Estimate the correlation coefficient

(b) Using the line of best fit, estimate the amount raised by a 16-year-old and a 26-year-old

(c) Calculate the gradient and find the line of best fit (you can use that the intercept is -

$180)

(d) Use the line of best fit to estimate the amount raised by a 13-year-old

(e) Why can the line of best fit not be used to estimate the amount raised by a 30-year-old?

(a) Strong positive correlation, so perhaps 𝑅 = 0.9

(b) For a 16-year-old about $300, and for a 26-year-old about $600

(c) The gradient is 30 (as in 10 years the amount raised goes up by $300). So the equation of

the line of best fit is 𝑀𝑜𝑛𝑒𝑦 = 30 × 𝐴𝑔𝑒 − $180

(d) 30 × 13 − $180 = $210

(e) Because the line of best fit is only formed for 10 to 26 year olds, so that would be

extrapolation and is unreliable

Page 34: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 34 Dr.Hamilton (HSOG)

Outliers An outlier is an unusual or extreme value. Because of the way a line of best fit is calculated, a single outlier can have a large effect on the line of best fit, and dramatically weaken the correlation coefficient. It is therefore important to consider removing outliers before calculating the line of best fit. To demonstrate this, the scatter plot of penguins is analysed with and without the outlier.

(See Excel sheet ‘Penguins’)

In the scatter plot above, which includes an outlier, the correlation coefficient is calculated as 𝑅 = 0.615, so 𝑅2 = 0.378. This means that just 37.8% of the variation in the weight of the penguins is attributable to the difference in height (so if one penguin is heavier than another it’s probably partly due to being taller, but also due to other things too). The line of best fit has been found to have equation

𝑦 = 0.045𝑥 + 0.26 The best estimate of the weight of a penguin with height 55 cm is

𝑦 = 0.045𝑥 + 0.26 𝑦 = 0.045(55) + 0.26 𝑦 = 2.735 𝑘𝑔

y = 0.045x + 0.26

2

2.5

3

3.5

4

50 55 60 65 70 75 80

We

igh

t in

kg

Height in cm

Height and Weight of Penguins

Page 35: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 35 Dr.Hamilton (HSOG)

(See Excel sheet ‘Penguins’)

Above is the same plot, with the outlier removed. The correlation coefficient is recalculated as 𝑅 = 0.931, so 𝑅2 = 0.866. This means that just 86.6% of the variation in the weight of the penguins is attributable to the difference in height, which is a strong correlation. As expected, removing the outlier has led to a much stronger correlation. The line of best fit has moved slightly, as can be seen by the new equation.

𝑦 = 0.061𝑥 − 0.87 Now each extra cm of height corresponds to an extra 0.061 kg (61 grams), compared to 0.045 kg (45 grams) before. The reason for this big move is that the outlier was having a large effect on the shape of the line Using this line of best fit the best estimate of the weight of a penguin with height 55 cm is

𝑦 = 0.061𝑥 − 0.87 𝑦 = 0.061(55) − 0.87 𝑦 = 2.49 𝑘𝑔

This is a lot lighter than the 2.75 kg calculated earlier. The reason for the drop in weight is that the very heavy outlier has been removed, which was pulling the line of best fit up. Depending on the nature of the outlier, it may or may not be statistically valid to remove it. If it is caused by measurement or recording error, then of course it should be removed. But if it is a valid data point, and the only problem with it is that it makes the data look worse, then it should not be removed (or at least if it is removed this should be well documented and justified – you can’t just get rid of data points because they don’t fit your theory!)

y = 0.061x - 0.87

2

2.5

3

3.5

4

50 55 60 65 70 75 80

We

igh

t in

kg

Height in cm

Height and Weight of Penguins

Page 36: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 36 Dr.Hamilton (HSOG)

28. On a treadmill test patients of different ages recorded their peak heart rate.

Age 24 30 24 28 44 35 50 40 32 39

Peak heart rate 190 189 197 193 202 188 173 183 187 173

(See Excel sheet ‘Treadmill’)

(a) Plot the data above in a scatter plot, and identify the outlier

(b) Using all of the data, calculate the correlation coefficient, line of best fit, and estimate

the peak heart rate of a 33-year-old patient

(c) Remove the outlier, and repeat the analysis above

(a) The outlier is the 44-year-old man with a peak heart rate of 202

(b) The correlation coefficient is 𝑅 = 0.476 and line of best fit 𝑦 = −0.5131𝑥 + 205.25,

meaning that each extra year means the peak heart rate drops by about half a beat.

A 33-year-old patient has estimated peak heart rate of 188

(c) Without the outlier 𝑅 = 0.886 and the line of best fit 𝑦 = −0.8658 + 214.94, meaning

that each extra year means the heart rate drops by nearly one beat.

A 33-year-old patient has estimated peak heart rate of 186

Page 37: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 37 Dr.Hamilton (HSOG)

1.4 Applying statistical skills to data analysis, interpretation and communication

Hypothesis Test

A hypothesis is a prediction about the data.

- The null hypothesis (𝑯𝟎) is a statement that indicates nothing unusual is happening, for

example ‘this coin produces exactly 50% heads’

- The alternative hypothesis (𝑯𝟏) is a statement about the data that might be true, for

example ‘this coin produces more than 50% heads’.

Rather than try and prove the alternative hypothesis, which is generally not possible, the process is instead to show the null hypothesis is unlikely. This is an important and subtle point. In statistics nothing is ever proved, some things are just found to be highly unlikely. If the null hypothesis is sufficiently unlikely to be true, it is rejected and the alternative hypothesis accepted instead. If there is not enough evidence to reject the null hypothesis, it is not rejected. This does not mean that the null hypothesis is proved to be true, it is just not rejected (and it may yet be rejected in future experiments). One-tailed and Two-tailed There are two different types of alternative hypothesis. In a one-tailed test the alternative hypothesis is that the data is unusual in a particular direction only, for example, a coin having an unusually large number of Heads. In a two-tailed test the alternative hypothesis is that the data is unusual in either direction, for example, a coin having an unusually large number of Heads or Tails. Because a two-tailed test has two different ways of passing (either the drug is beneficial or harmful) it might seem that it is easier to pass. However, to compensate for this a more extreme result is required for a two-tailed test to reject the null hypothesis. For example, suppose a coin is being tested for fairness by counting the proportion of Heads, and 𝐻0 is that the coin is fair. Any proportion of Heads that would happen only 5% or less of the time if the coin is fair is considered grounds for rejecting the null hypothesis and concluding that the coin is unfair.

Page 38: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 38 Dr.Hamilton (HSOG)

- In a one-tailed test the null hypothesis is rejected if the number of Heads is unusually

high, in particular in the top 5%

- In a two-tailed test the null hypothesis is rejected if the number of Heads is unusually

low or high, in the bottom 2.5% or the top 2.5% of anticipated outcomes

On the left is a one tailed test, on the right a two-tailed test. In each case the grey area represents where a data value must fall for the null hypothesis to be rejected. Note that for the one tailed test there is only one grey region, but it is larger. A second example is in drug trials. Consider the null hypothesis and three different alternative hypotheses below. 𝐻0 – the drug has no effect

Page 39: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 39 Dr.Hamilton (HSOG)

𝐻1 – the drug has a beneficial effect (one-tailed) 𝐻1 – the drug has a harmful effect (one-tailed) 𝐻1 – the drug has an effect (two-tailed)

Significance A key idea in statistics is significance. An event is called significant if the probability of it happening by chance is sufficiently low. For example, consider flipping a coin ten times, to determine if it is biased in favour of Heads. This is a one-tailed test. 𝐻0 – the coin is not biased 𝐻1 – the coin is biased in favour of Heads Even if the coin is fair, it is unlikely that there will be exactly 5 Heads and 5 Tails. Indeed, it’s possible just by chance that a fair coin will produce 10 Heads in a row. The issue addressed by significance testing is how many Heads are required to reject the null hypothesis that the coin is fair? Intuition says that 7 or 8 Heads should not be considered unlikely, but 9 or 10 should. In statistics there is a more exact measure. The null hypothesis is rejected (and the coin is declared biased) if the result would only occur 5% of the time or less if the coin were fair. This is called 95% significance testing (as 95% of outcomes are accepted).

29. An investigation is done to determine whether it snows as often as expected on Christmas

Day.

(a) State the null hypothesis

(b) State an alternative hypothesis for both a one-tailed and two-tailed test

(a) 𝐻0 – the number of White Christmas’s is as expected given the time of year

(b) 𝐻1 – the number of White Christmas’s is not what is expected given the time of year

(two-tailed)

𝐻1 – the number of White Christmas’s is above (or below) what is expected given the

time of year (one-tailed)

Page 40: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 40 Dr.Hamilton (HSOG)

The threshold of 95% is chosen for practical reasons. It is high enough that it will only rarely be met by a coin which is actually fair, but not so high that it will never be met. If the coin actually is fair then the 95% threshold will still be met 5% of the time, hence 5% of the time the null hypothesis is rejected incorrectly. Thus if 20 tests are performed all on fair coins, it is likely that on average one of those trials will produce a sufficiently unlikely number of heads to lead to the conclusion that the fair coin is biased! Or, alternatively, if 20 homeopathic medicines are tested to see if they are effective then on average one of them will pass the test! This is unfortunate but necessary, else nothing would ever be declared significant. For more stringent testing, a 99% (or higher) significance level can be used. The 95% threshold for a fair coin can be calculated exactly. The distribution of possible outcomes is given by the table below, and is also shown in the graph.

(See Excel sheet ‘Coins’)

Note that, as expected, the most likely number of heads is 5, and the extreme values are much less likely. The graph is symmetrical, and actually is very close to a normal distribution. Note also that even though 5 is the most likely number of Heads in 10 throws, it will still only happen about 25% of the time. For a 95% significance test a result that only happens 5% of the time if the coin is fair causes the null hypothesis to be rejected. Using the data from the table:

0.0000

0.0500

0.1000

0.1500

0.2000

0.2500

0.3000

0 1 2 3 4 5 6 7 8 9 10

Pro

bab

ility

Number of Heads

Number of Heads in 10 throws

Number of Heads

0 1 2 3 4 5 6 7 8 9 10

Probability 0.1% 1.0% 4.4% 11.7% 20.5% 24.6% 20.5% 11.7% 4.4% 1.1% 0.1%

Page 41: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 41 Dr.Hamilton (HSOG)

- The probability of getting 10 heads is 0.1%.

This is less than 5%, so 10 heads would be enough to reject the null hypothesis.

- The probability of getting 9 (or more) heads is 1.1 + 0.1 = 1.2%.

This is less than 5%, so 9 heads would be enough to reject the null hypothesis.

(note that the probability for 9 or more Heads is added, not just 9 Heads)

- The probability of getting 8 (or more) heads is 4.4 + 1.1 + 0.1 = 5.6%.

This is more than 5%, so 8 heads would not be enough to reject the null hypothesis.

Thus the 95% significance level is getting at least 9 out of 10 heads to declare a coin biased. Note that although 8 out of 10 heads is not enough evidence to reject the null hypothesis it does not meant that the coin is fair, simply that there is not enough evidence to show it’s biased. This is an important and subtle difference. The coin may well still be biased, it just hasn’t been proved yet. For this example, the test-statistic is simply the number of Heads the coin produces. The critical value of the test-statistic is 9 Heads out of 10. For other tests the calculation of the test-statistic may be more complicated, and the critical value will often be found by looking up in a statistical table. In general, if the value of the test-statistic exceeds that of the critical value the result is considered significant. With the advent of computer software this process has been somewhat simplified. Most computer software will now analyse the data in even more detail, and generate a p-value. The p-value is the actual probability of the observed sample happening, given that the null hypothesis is true. At a 95% significance level, if the p-value is below 0.05 then the null hypothesis is rejected. This is because a p-value of less than 0.05 indicates a result that happens 5% of the time or less. With a p-value there is no need to compare the test-statistic with a critical value, as this this has already been done. Note that for a two-tailed the computer software will adjust the p-value so that 0.05 still corresponds to a 95% significance level. All significance tests therefore follow the same basic pattern

- State the null and alternative hypothesis, choosing if the test is one-tailed or two-tailed

- Choose a significance level, such as 95%

- Run the test to generate a test-statistic and compare it to a critical value, or get a p-

value

- State either the test-static and critical value, or the p-value

- interpret if the result is significant and draw a conclusion

Page 42: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 42 Dr.Hamilton (HSOG)

Z-test The z-test is the simplest significance test. It is used to determine if something is unusual, from being many standard deviations away from the mean. This has many applications, for example to determine if

- a single data-value is unusual

- a sample average is different from the population average

- two samples have different means

The third of these is the most applicable to Higher Statistics, and is done by calculating the mean difference between the populations (expected to be 0) and the standard deviation of this difference. In reality this is calculated by a computer, to produce a p-value, which is used to determine if a result is significant. For example, it was conjectured that, within one hospital, Doctors don’t smoke very much.50 Doctors and 200 non-Doctors were surveyed. A one-tailed test was chosen with a 95% significance level 𝐻0 – the proportion of Doctors who smoke is the same as non-Doctors 𝐻1 – the proportion of Doctors who smoke is lower than non-Doctors

Doctors 5 smokers out of 50

Non-Doctors 36 smokers out of 200

30. To determine if there is a correlation between the cost of making a film and the amount of

gross profit an audit was made of all major films released in 2016.

(a) State a null hypothesis and one-tailed alternative hypothesis

(b) Assuming an analysis on Excel produced a p-value of 0.04, interpret this result

(c) Assuming an analysis on Excel produced a p-value of 0.06, interpret this result

(a) 𝐻0 – there is no correlation between the production cost and gross profits of films

𝐻1 – there is a positive correlation between the production cost and gross profits of films,

i.e. films that are more expensive to make tend to make more money

(b) 𝑝 < 0.05 so the null hypothesis is rejected at the 95% level, and there is evidence here

that more expensive films make more money

(c) 𝑝 > 0.05 so the null hypothesis is not rejected at the 95% level, and there is no strong

evidence here that more expensive films make more money

(d)

Page 43: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 43 Dr.Hamilton (HSOG)

Although only 10% of Doctors smoke compared to 18% of non-Doctors the z-test returns a p-value of 0.1718, which is above 0.05% so the null hypothesis cannot be rejected. The z-test is only applicable

- when the sample is known to be normally distributed, which means it has to be large

(above 30)

- when the population standard deviation is known (typically because the sample is the

entire population)

If these conditions are not met a t-test must be used instead (see later).

T-test A t-test is used when a z-test is not applicable. This is when

- The sample is small (below 30)

- The population standard deviation is not known

Just like the z-test, a t-test works by calculating the value of the test-statistic and comparing it to the critical value to see if the result is significant, or just calculating a p-value and seeing if it is sufficiently low (below 0.05 for a 95% significance test). The t-test is used (among other things) to determine if two populations are different, when only a sample is available. There are two types of t-test used in this course.

31. It was conjectured that more women than men live to one hundred years old.

(a) State a null and alternative hypothesis

(b) An audit found that 23 out of 9,500 women and 11 out of 10,000 men centenarians. In

the sample, did more women live to 100?

(c) What statistical test should be used to determine if there is a significant difference?

(d) A z-test returns a p-value of 0.0146. Interpret this.

(a) 𝐻0 – the proportion of male to female centenarians is the same

𝐻1 – the proportion of female centenarians is greater (one-tailed test)

(b) In the sample, 0.24% of women and 0.11% of women reach 100, so in the sample at least

there are more female than male centenarians.

(c) A z-test, as proportions are being compared and both samples are large so can be

assumed to be normally distributed

(d) This p-value is below 0.05, so the result is significant at the 95% level. There is evidence

to reject the null hypothesis and conclude that more women than men live to 100 years

old.

Page 44: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 44 Dr.Hamilton (HSOG)

In an unpaired t-test two independent samples are compared (to determine if their means are similar). In a paired t-test the same sample is measured twice, to see if the mean has changed. This is called a dependent sample. Unpaired t-test An unpaired t-test attempts to determine if two samples come from populations with significantly different means. It does not require knowledge of the population means or the population variances. As an example, does a sample of twenty rugby players and ten hockey players indicate that rugby players are on average heavier than hockey players? Note that the two samples are not the same size, but can still be compared. A two tailed hypothesis test is set up: 𝐻0 – rugby and hockey players weigh the same 𝐻1 – rugby and hockey players have different weights The data Is collected:

Sample Size Sample mean weight Sample standard deviation

Rugby Players 𝑛1 = 20 �̅�1 = 78.4 kg 𝑠1 = 8.5

Hockey Players 𝑛2 = 10 �̅�2 = 67.1 kg 𝑠2 = 7.9

(See Excel sheet ‘Sports’)

The t-value is calculated using a computer, using the option t-test with unequal variances. This produces a t-value of 3.56. By looking up a table of values it could be determined that this is above the critical value, or in fact the computer program might just produce a p-value. In this case the p-value (for a two-tailed test) is 0.0002 (about 0.2%) so the result is significant, and the null hypothesis is rejected. There is evidence here that rugby and hockey players have different weights. Paired t-test In a paired t-test the same sample is tested twice, at different times. This is typically to determine if there has been a change, for example an improvement in health after some intervention. In a paired test the two samples must be the same size.

Page 45: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 45 Dr.Hamilton (HSOG)

As an example, the number of chin ups that eight athletes can do is measured before and after a week of intensive training. A 95% significance level is chosen and the null and alternative hypothesis are stated: 𝐻0 – training makes no difference 𝐻1 – training improves performance (one-tailed) The results in detail are:

Before After Improvement

Abby 10 12 2

Ally 11 13 2

Andy 9 10 1

Adrianna 8 7 -1

Amelie 12 12 0

Arthur 15 18 3

Anthony 13 15 2

Ashley 4 5 1

Mean Improvement 1.25

StDev Improvement 1.38

(See Excel sheet ‘Training’)

The value of the t-statistic is 𝑡 = 2.56 which is above the critical value of 1.895 (found in a lookup table) so the null hypothesis is rejected, and it can be said that training makes a difference. Alternatively, the computer program will provide a p-value of 0.0016 (two-tailed) indicating again that the result is significant.

Page 46: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 46 Dr.Hamilton (HSOG)

32. In each case, determine which test is appropriate (z-test, paired t-test, unpaired t-test)

(a) Determining if more male than female pupils own pets in one particular class

(b) Determining if coconuts weight more than pineapples, based on a sample of 50 of each

(c) Determining if eggs change weight when they are cooked

(a) A z-test for proportions

(b) An unpaired t-test

(c) A paired t-test

33. To determine if holding your breath for a minute decreases heart rate a one-tailed paired t-

test was used with a 95% significance level.

(a) Interpret the result if p = 0.03

(b) Interpret the result if p = 0.06

(a) Since p < 0.05 the null hypothesis can be rejected, and there is evidence that holding your

breath decreases heart rate

(b) Since p > 0.05 the null hypothesis cannot be rejected, and there is no evidence that

holding your breath decreases heart rate

(c)

Page 47: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 47 Dr.Hamilton (HSOG)

Confidence interval A 95% confidence interval is where a value is expected to be 95% of the time. For example, based on a small sample the 95% interval for the average weight of eggs marked as Medium might be (150,160), meaning that there is a 95% chance that the mean of the eggs is between 150 and 160 grams. If a larger sample is taken, this 95% confidence interval could be reduced to perhaps (156,158). A typical use for a confidence interval is after a t-test has been used to compare two samples & determine if the populations means are different. A computer program will produce a 95% confidence interval for the difference between the means. If this difference includes 0, then it is possible that there is no difference between the means, so the result is not significant. As an example, suppose the average price of a can of Coke was 75 pence and the average price of a can of Pepsi was 68 pence. On average Coke costs 7 pence more. However, this 7 pence difference is only based on a sample, and so has an error attached to it. The 95% confidence interval for the difference in prices is calculated, usually by computer. There are two possibilities:

- The 95% confidence interval for the price difference includes zero, for example (-1, 15).

This means that it is possible the average difference is in fact zero, so it can’t be

concluded that Coke really is more expensive.

- The 95% confidence interval excludes zero, (3, 10). This does not include zero, so it can

be concluded that Coke really is more expensive.

Note that in the popular press, confidence intervals are virtually never quoted, and there is no significance testing. So when it is quoted that for example that the unemployment rate has dropped by 0.1% this may in fact have a 95% confidence interval of (-0.02%,0.18%) meaning that in fact nothing meaningful can be said about the unemployment rate.

Page 48: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 48 Dr.Hamilton (HSOG)

Sampling error

Statistics is the art of using a sample to try and understand a population. It should therefore be understood that any conclusions about the population drawn from the sample are necessarily approximate, and could be different if a different sample was chosen. There are two types of error:

- Sampling error occurs if the sample is not typical of the population, and accidentally

introduces some bias (a problem with the design of the sample)

34. It was conjectured that rabbits live longer than guinea pigs.

(a) State a suitable test

(b) An unpaired t-test found that a sample of rabbits lived on average 1.2 years longer than

a sample of guinea pigs. The p-value was 0.03. Interpret this result

(c) The 95% confidence interval for how much longer rabbits live on average was (0.8, 2.1).

How does this support the p-value?

(a) An unpaired t-test (one-tailed)

(b) Since p < 0.05 the result is significant at the 95% level so there is evidence to reject the

null hypothesis, and conclude that rabbits live longer than guinea pigs

(c) The 95% confidence interval does not include 0. This indicates that it is not plausible (at a

95% level) for the difference to be 0, which supports the p-value in indicating there is

evidence to reject the null hypothesis.

35. It was conjectured that cutting gluten from the diet causes weight loss in a group of 30

subjects

(a) State a suitable test

(b) A paired t-test found subjects lost an average of 0.3 kg, with a p-value of 0.17.

Interpret this result

(c) The 95% confidence interval for weight lost was (-2.4, 2.9).

ow does this support the p-value?

(a) A paired t-test (one-tailed)

(b) Since p > 0.05 the result is not significant at the 95% level and there is insufficient

evidence to reject the null hypothesis that cutting gluten from the diet causes weight loss

(c) The 95% confidence interval includes 0. This indicates that it is plausible (at a 95% level)

for the difference to be 0, which supports the p-value in indicating there is not enough

evidence here to reject the null hypothesis.

Page 49: Higher Statistics Notes 2016/17 - LT Scotland

Higher Statistics 2016/17 49 Dr.Hamilton (HSOG)

- Non-sampling error occurs if data is incorrectly recorded, or there are gaps in the data

as it was not possible to sample fully (a problem with the execution of the sample)

A common cause of sampling error is if the sample is performed all at one time. Sampling error is minimised by taking a large sample, although this is time consuming so instead there is an attempt to try and describe the uncertainty, for example with confidence intervals.

36. A survey at my local Post Office last Tuesday found that 85% of respondents agreed that

“Young people have much less respect for their elders than they used to”. Does the figure of

85% represent the whole population?

(a) State two possible causes of sampling error

(b) State two possible causes of non-sampling error

(a) The sample is very limited. Because the sample was made at just one location it does not

reflect the whole population well. The sample was also just made at one time.

(b) Respondents may not have filled in the survey honestly. The figure of 85% may have been

reached by a counting error. Some surveys may have got lost.


Recommended