Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | julia-anthony |
View: | 225 times |
Download: | 0 times |
SQA StatisticsDavid YoungDepartment of Mathematics and Statistics, University of StrathclydeRoyal Hospital for Sick Children, Yorkhill NHS Trust
Introduction/Overview• Course content• Computer labs – usernames and software (Minitab, Excel, R)
www.minitab.com• Different approach than AH Statistics• Emphasis on applications but theory needed to apply the
correct approach – not mathematically intensive• Applications of statistics to your own field of expertise e.g.
Geography, Modern Studies, Psychology, Science• http://personal.strath.ac.uk/david.young/SQA/• Lab sessions – opportunity to talk to staff from other disciplines
What is Statistics?The American Heritage Dictionary defines statistics as …
"A branch of mathematics dealing with the collection, analysis,
interpretation, and presentation of masses of data."
Introduction•what is statistics and why do we need it?•statistics is the science of collecting, analysing, presenting and
interpreting data•it enables the objective evaluation of research questions of
interest• it provides the means to weigh up how much evidence the
collected data provide for and against the research hypothesis of interest
Examples of Research Hypotheses•Can the symptom of oral dryness (xerostomia) in terminal
cancer patients be relieved using a mucin-containing oral spray?
• Is a new painkilling drug more effective than the best alternative currently on the market?
•How common is the problem of aggression towards hospital staff?
•How does deprivation impact life expectancy in Scotland?•How is tourism impacting wildlife in the Galapagos Islands?
Statistics and Medical Research•statistics plays an increasingly important role in research•it is not possible, for example, to have a new drug treatment
approved for use without solid, statistical evidence to support claims of efficacy and safety
•over the last few decades, many new statistical methods have been developed which have particular relevance for researchers in different fields e.g. psychology, clustering, big data
•these methods can be applied routinely using statistical software packages
Importance of Statistics•all researchers should understand some basic statistical
concepts to ensure …•appropriate study design and data collection•application of the correct method of statistical analysis when
using software•accurate and honest reporting of data gathered •adequate understanding of claims made by other researchers
when reviewing reports
7
Basic Terminology•data: values of the variable(s) of interest recorded on one or
more individuals e.g. age, gender, duration of illness•population: the large group of individuals under study (e.g. all
cancer patients)•sample: a subset of the population of interest ‘selected’ as
being representative of the population as a whole
Sampling•draw useful conclusions about a population of interest•‘target population’ too large – practical problems in terms of
time, money, staff, resources•study a sample of the population of interest•use the sample of individuals to infer something useful about
the underlying population•vast amounts of data can be collected which should be
summarised in a useful way – graphically or numerically
Sampling
POPULATION
SAMPLE
inference
Types of Datacategorical data
•nominal – the data can be classified into a number of specific categories with no particular ordering e.g. eye colour, blood type, marital status, gender
•ordinal – the data can be classified into a number of specific categories which can be placed in some order of importance e.g. pain scores (mild, moderate, severe or unbearable), depravation category score within Glasgow (ranges 1–7 from affluent to poor classified by postcodes)
Types of Datanumerical data
•discrete – data are recorded as a whole number and usually only take specific values e.g. age in years, number of cigarettes smoked in a day, number of children
•continuous – data are recorded to the precision of the measuring instrument and usually take any value within a certain range e.g. height, weight, blood pressure
Problem 1Which of the following are categorical variables?
(a) gender(b) no. of children(c) diastolic blood pressure(d) diagnosis(e) height
Problem 2Which of the following are continuous numerical variables?
(a) blood type(b) peak expiratory flow rate(c) age last birthday(d) exact age(e) family size
Graphical Summaries• enables checking of underlying assumptions required for any
formal analysis (e.g. are data normally distributed)• helps to identify patterns or outliers• gives a idea of relationships between variables• interpretation of patterns is subjective• beware of data dredging
Pie Chart
Bar Chart
Scatter Plot
Histogram
Types of Histogram
Types of Histogram
Types of Histogram
Types of Histogram
Numerical Summaries• measures of location
– mean, median, mode• mean (average value)
– if there are n observations x1, x2,…, xn in a data set then the mean is calculated as …
– which can be written mathematically as
n
xxxx n
...21
n
xn
ii
1
Measures of Location• median (‘middle value’): to calculate the median …
– arrange the data in ascending order– if n is odd then the median is the middle value– if n is even then the median is the average of the ‘two middle’
values• e.g. 15 17 17 19 24 36 42 has median value 19• e.g. 1 2 5 7 7 8 has a median value (5+7)/2=6• the median is often denoted as Q2
• mode (most common value)– in the two examples above the modal values are 17 and 7
respectively
Measures of Spread• measures of spread
– variance, standard deviation, range, inter-quartile range• variance
– the variance of n observations x1, x2,…, xn is given by …
• standard deviation– s (i.e. variance) is called the standard deviation and has the same
units of measurements as the data
1
)(...)()( 222
212
n
xxxxxxs n
Measures of Spread• range
– the difference between the maximum and minimum values in a data set
• inter-quartile range– arrange the data in ascending order and calculate the upper (Q3)
and lower quartiles (Q1) – the quartiles, along with the median, divide the data set into four
equal parts (i.e. quarters)– the inter-quartile range is Q3-Q1
• unlike the range, the inter-quartile range is unaffected by outliers
Example 1•the following numbers refer to the length of time spent in
hospital (days) for 7 patients after a particular operation … 2 2 3 2 15 1 3 •are these data are categorical or numerical?•calculate the mean, median and mode of this data •compute the range, inter-quartile range and standard
deviation of this data•are any of the data values unusual?•is it reasonable to assume that these data are normally
distributed?
Example 2• the following numbers are the ages of children admitted to an
outpatient department with burns …• 7 6 6 9 10 8 7 7 7 7 6• are these data are categorical or numerical?• calculate the mean, median and mode of this data • compute the range, inter-quartile range and standard
deviation of this data• are any of the data values unusual?• is it reasonable to assume that these data are normally
distributed?
Relationship between Mean and Median
30 40 50 60 70 80 90 100 110 120
Length of stay (days)
Mean=66.47
Median=53.00
Choice of Summary Statistics•for data which are normally distributed (or symmetric) and
mean and standard deviation are the appropriate measures of location and spread respectively
•when data are skewed, the mean and standard deviation are influenced by outliers and the appropriate measures of location and spread for reporting this type of data are the median and inter-quartile range
•note that for normally distributed data (and symmetric data) the mean and median will be approximately the same
Problem 3When a distribution is skewed to the right …
(a) the median is greater than the mean(b) the median is equal to the mean(c) the tail on the left is shorter than the tail on the right(d) the standard deviation is less than the variance(e) the majority of observations are less than the mean
Age and Weight Minitab Example
Descriptive Statistics: Age, Weight
Variable N Mean StDev Minimum Q1 Median Q3 MaximumAge 20 34.150 3.602 27.000 31.250 34.500 36.750 42.000Weight 20 65.65 12.04 52.00 56.00 63.00 74.25 92.00
Histogram of Age Variable
Boxplot of Age Variable
Histogram of Weight Variable
Boxplot of Weight Variable
The Normal Distribution• the normal (or Gaussian) distribution is the most important of
all the distributions since it has a wide range of practical applications
• the characteristic bell-shape of this distribution describes many distributions which occur in practice
• the distribution is symmetric about the mean• set proportions of the area under the curve fall within set limits
from the mean …
68.2% lies within one standard deviation of the mean95.4% lies within two standard deviations of the mean99.7% lies within three standard deviations of the mean
Histogram of Normal Data
Uses of the Normal Distribution•many physical measurements are closely approximated by the
normal distribution, particularly measurements in which there is natural variation (e.g. some biological measurements like height and weight are normally distributed)
•data which are normally distributed can be more easily analysed by using parametric methods of statistical testing (see later)
Example•In a published paper it is reported that the mean age of 500
injecting drug users in Glasgow is 28 years with a standard deviation of 5 years.
•Assuming that the authors have reported the correct summary statistics for the data they have gathered, what can you assume about the distribution of the ages?
•What do the reported summary statistics tell you about the location and spread of the ages of injecting drug users in Glasgow?
Interpretation
Problem 4In general, which of the following statements is FALSE?
(a) the sample mean is more sensitive to extreme valuesthan the median
(b) the sample inter-quartile range is more sensitive to extreme values than the standard deviation
(c) the sample standard deviation is a measure of spread around the sample mean
(d) the sample standard deviation is a measure of central tendency around the median
(e) if a distribution is normal, then the mean will be equal to the median
Problem 5If diastolic blood pressure has a distribution which is normally distributed …
(a) there would be fewer observations below the mean than above it
(b) the standard deviation would be equal to the mean(c) the majority of observations would be less than two
standard deviations from the mean(d) the standard deviation would estimate the spread of
blood pressure measurements(e) the mean will be equal to the median