SQA Statistics David Young Department of Mathematics and Statistics, University of Strathclyde Royal...

Post on 11-Jan-2016

225 views 0 download

Tags:

transcript

SQA StatisticsDavid YoungDepartment of Mathematics and Statistics, University of StrathclydeRoyal Hospital for Sick Children, Yorkhill NHS Trust

Introduction/Overview• Course content• Computer labs – usernames and software (Minitab, Excel, R)

www.minitab.com• Different approach than AH Statistics• Emphasis on applications but theory needed to apply the

correct approach – not mathematically intensive• Applications of statistics to your own field of expertise e.g.

Geography, Modern Studies, Psychology, Science• http://personal.strath.ac.uk/david.young/SQA/• Lab sessions – opportunity to talk to staff from other disciplines

What is Statistics?The American Heritage Dictionary defines statistics as …

"A branch of mathematics dealing with the collection, analysis,

interpretation, and presentation of masses of data."

Introduction•what is statistics and why do we need it?•statistics is the science of collecting, analysing, presenting and

interpreting data•it enables the objective evaluation of research questions of

interest• it provides the means to weigh up how much evidence the

collected data provide for and against the research hypothesis of interest

Examples of Research Hypotheses•Can the symptom of oral dryness (xerostomia) in terminal

cancer patients be relieved using a mucin-containing oral spray?

• Is a new painkilling drug more effective than the best alternative currently on the market?

•How common is the problem of aggression towards hospital staff?

•How does deprivation impact life expectancy in Scotland?•How is tourism impacting wildlife in the Galapagos Islands?

Statistics and Medical Research•statistics plays an increasingly important role in research•it is not possible, for example, to have a new drug treatment

approved for use without solid, statistical evidence to support claims of efficacy and safety

•over the last few decades, many new statistical methods have been developed which have particular relevance for researchers in different fields e.g. psychology, clustering, big data

•these methods can be applied routinely using statistical software packages

Importance of Statistics•all researchers should understand some basic statistical

concepts to ensure …•appropriate study design and data collection•application of the correct method of statistical analysis when

using software•accurate and honest reporting of data gathered •adequate understanding of claims made by other researchers

when reviewing reports

7

Basic Terminology•data: values of the variable(s) of interest recorded on one or

more individuals e.g. age, gender, duration of illness•population: the large group of individuals under study (e.g. all

cancer patients)•sample: a subset of the population of interest ‘selected’ as

being representative of the population as a whole

Sampling•draw useful conclusions about a population of interest•‘target population’ too large – practical problems in terms of

time, money, staff, resources•study a sample of the population of interest•use the sample of individuals to infer something useful about

the underlying population•vast amounts of data can be collected which should be

summarised in a useful way – graphically or numerically

Sampling

POPULATION

SAMPLE

inference

Types of Datacategorical data

•nominal – the data can be classified into a number of specific categories with no particular ordering e.g. eye colour, blood type, marital status, gender

•ordinal – the data can be classified into a number of specific categories which can be placed in some order of importance e.g. pain scores (mild, moderate, severe or unbearable), depravation category score within Glasgow (ranges 1–7 from affluent to poor classified by postcodes)

Types of Datanumerical data

•discrete – data are recorded as a whole number and usually only take specific values e.g. age in years, number of cigarettes smoked in a day, number of children

•continuous – data are recorded to the precision of the measuring instrument and usually take any value within a certain range e.g. height, weight, blood pressure

Problem 1Which of the following are categorical variables?

(a) gender(b) no. of children(c) diastolic blood pressure(d) diagnosis(e) height

Problem 2Which of the following are continuous numerical variables?

(a) blood type(b) peak expiratory flow rate(c) age last birthday(d) exact age(e) family size

Graphical Summaries• enables checking of underlying assumptions required for any

formal analysis (e.g. are data normally distributed)• helps to identify patterns or outliers• gives a idea of relationships between variables• interpretation of patterns is subjective• beware of data dredging

Pie Chart

Bar Chart

Scatter Plot

Histogram

Types of Histogram

Types of Histogram

Types of Histogram

Types of Histogram

Numerical Summaries• measures of location

– mean, median, mode• mean (average value)

– if there are n observations x1, x2,…, xn in a data set then the mean is calculated as …

– which can be written mathematically as

n

xxxx n

...21

n

xn

ii

1

Measures of Location• median (‘middle value’): to calculate the median …

– arrange the data in ascending order– if n is odd then the median is the middle value– if n is even then the median is the average of the ‘two middle’

values• e.g. 15 17 17 19 24 36 42 has median value 19• e.g. 1 2 5 7 7 8 has a median value (5+7)/2=6• the median is often denoted as Q2

• mode (most common value)– in the two examples above the modal values are 17 and 7

respectively

Measures of Spread• measures of spread

– variance, standard deviation, range, inter-quartile range• variance

– the variance of n observations x1, x2,…, xn is given by …

• standard deviation– s (i.e. variance) is called the standard deviation and has the same

units of measurements as the data

1

)(...)()( 222

212

n

xxxxxxs n

Measures of Spread• range

– the difference between the maximum and minimum values in a data set

• inter-quartile range– arrange the data in ascending order and calculate the upper (Q3)

and lower quartiles (Q1) – the quartiles, along with the median, divide the data set into four

equal parts (i.e. quarters)– the inter-quartile range is Q3-Q1

• unlike the range, the inter-quartile range is unaffected by outliers

Example 1•the following numbers refer to the length of time spent in

hospital (days) for 7 patients after a particular operation … 2 2 3 2 15 1 3 •are these data are categorical or numerical?•calculate the mean, median and mode of this data •compute the range, inter-quartile range and standard

deviation of this data•are any of the data values unusual?•is it reasonable to assume that these data are normally

distributed?

Example 2• the following numbers are the ages of children admitted to an

outpatient department with burns …• 7 6 6 9 10 8 7 7 7 7 6• are these data are categorical or numerical?• calculate the mean, median and mode of this data • compute the range, inter-quartile range and standard

deviation of this data• are any of the data values unusual?• is it reasonable to assume that these data are normally

distributed?

Relationship between Mean and Median

30 40 50 60 70 80 90 100 110 120

Length of stay (days)

Mean=66.47

Median=53.00

Choice of Summary Statistics•for data which are normally distributed (or symmetric) and

mean and standard deviation are the appropriate measures of location and spread respectively

•when data are skewed, the mean and standard deviation are influenced by outliers and the appropriate measures of location and spread for reporting this type of data are the median and inter-quartile range

•note that for normally distributed data (and symmetric data) the mean and median will be approximately the same

Problem 3When a distribution is skewed to the right …

(a) the median is greater than the mean(b) the median is equal to the mean(c) the tail on the left is shorter than the tail on the right(d) the standard deviation is less than the variance(e) the majority of observations are less than the mean

Age and Weight Minitab Example

Descriptive Statistics: Age, Weight

Variable N Mean StDev Minimum Q1 Median Q3 MaximumAge 20 34.150 3.602 27.000 31.250 34.500 36.750 42.000Weight 20 65.65 12.04 52.00 56.00 63.00 74.25 92.00

Histogram of Age Variable

Boxplot of Age Variable

Histogram of Weight Variable

Boxplot of Weight Variable

The Normal Distribution• the normal (or Gaussian) distribution is the most important of

all the distributions since it has a wide range of practical applications

• the characteristic bell-shape of this distribution describes many distributions which occur in practice

• the distribution is symmetric about the mean• set proportions of the area under the curve fall within set limits

from the mean …

68.2% lies within one standard deviation of the mean95.4% lies within two standard deviations of the mean99.7% lies within three standard deviations of the mean

Histogram of Normal Data

Uses of the Normal Distribution•many physical measurements are closely approximated by the

normal distribution, particularly measurements in which there is natural variation (e.g. some biological measurements like height and weight are normally distributed)

•data which are normally distributed can be more easily analysed by using parametric methods of statistical testing (see later)

Example•In a published paper it is reported that the mean age of 500

injecting drug users in Glasgow is 28 years with a standard deviation of 5 years.

•Assuming that the authors have reported the correct summary statistics for the data they have gathered, what can you assume about the distribution of the ages?

•What do the reported summary statistics tell you about the location and spread of the ages of injecting drug users in Glasgow?

Interpretation

Problem 4In general, which of the following statements is FALSE?

(a) the sample mean is more sensitive to extreme valuesthan the median

(b) the sample inter-quartile range is more sensitive to extreme values than the standard deviation

(c) the sample standard deviation is a measure of spread around the sample mean

(d) the sample standard deviation is a measure of central tendency around the median

(e) if a distribution is normal, then the mean will be equal to the median

Problem 5If diastolic blood pressure has a distribution which is normally distributed …

(a) there would be fewer observations below the mean than above it

(b) the standard deviation would be equal to the mean(c) the majority of observations would be less than two

standard deviations from the mean(d) the standard deviation would estimate the spread of

blood pressure measurements(e) the mean will be equal to the median