Date post: | 03-Dec-2015 |
Category: |
Documents |
Upload: | james-smith |
View: | 5 times |
Download: | 0 times |
Sanjay Rampal Summarizing Data 1 1
Associate Professor Dr Sanjay RampalMBBS, MPH, PhD, CPH (US NBPHE), AMM
Faculty of Medicine, University of Malaya
[email protected] / [email protected]
Descriptive StatisticsSept 2015
CONTENTS
• Measures of central tendency
• Mean, Median, & Mode
• Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
• Other measures of location
• Normal Distribution & skewness
Sanjay Rampal Summarizing Data 1 2
Measures of central tendency
• Central tendency is an estimate of the “centre”
of a distribution of values.
• There are three major types of estimates of
central tendency
A) Mean
B) Median
C) Mode
Mean (1)
• The average value (sum of all observed values
divided by the total number of observations)
Mean, =
=
= (sigma) (means add)
xi= observed values
n= total number of observations
X
n
xi
n
XnXXX ......321
Sanjay Rampal Summarizing Data 1 3
• Used when the numbers can be added
(characteristics are measured on a numerical
scale)
• Should not be used with qualitative data
• Should not be used with ordinal scale because
arbitrary nature of ordinal scale
• Can be estimated from a frequency table.
Weighted average estimate of the mean is
formed by multiplying data value by number of
observations, add the products and divide the
sum by number of observations
Mean (arithmetic) (2)
Mean (arithmetic) (2)
1, 3, 5, 7, 7, 8, 8, 9
n=8
xi=1+3+5+7+7+8+8+9= 48
=
=
= 6
Xn
xi
8
48
Sanjay Rampal Summarizing Data 1 4
Mean: Advantages
• It is familiar to most people
• It reflects the inclusion of every item in the data set
• Utilize all values
• It always exists
• It is unique
• It is easily used with other statistical measurements
• The mean is the center of gravity of the data and, easy to understand and to calculate
• Distribution is determine symmetrical
• Important for statistical analyses and its applications
Mean: Disadvantages
• It can be affected by extreme values in the
data set, called outliers, and therefore be
biased
• Loss of accuracy when the distribution is
skewed
• Including or excluding a data (number) will
change the mean
• Manually, more tedious to calculate
Sanjay Rampal Summarizing Data 1 5
Other types of means
• Geometric mean
• Harmonic mean
• Generalized means
• Weighted arithmetic mean
• Truncated mean
• Inter-quartile mean
Mean (Geometric)
• It is an average that is useful for sets of
numbers that are interpreted according to
their product and not their sum (as is the case
with the arithmetic mean). E.g disease rates
Sanjay Rampal Summarizing Data 1 6
• The geometric mean is useful to determine “average factors”
E.g. if the incidence rate of a disease increased by 10% in
Y1, 20% in Y2 and decreased 15% in Y3
• The geometric mean of the disease rates 1.10, 1.20 and 0.85
= (1.10 × 1.20 × 0.85)1/3 = 1.039
Conclusion - the incidence rate increased 3.9 percent per
year, on average
Arithmetic Vs Geometric Mean
Arithmetic mean is relevant any time several quantities add
together to produce a total
• The arithmetic mean answers the question, "if all the
quantities had the same value, what would that value have
to be in order to achieve the same total?"
Geometric mean is relevant any time several quantities
multiply together to produce a product
• The geometric mean answers the question, "if all the
quantities had the same value, what would that value have
to be in order to achieve the same product?"
Sanjay Rampal Summarizing Data 1 7
• Suppose there is a disease rate which increases by
10% in 2004, 50% in 2005, and 30% in 2006. What is
its average increase in disease incidence?
• It is not the arithmetic mean, because what these
numbers signify is that on the 2004 the disease
incidence was multiplied (not added to) by 1.10, and
in 2005 it was multiplied by 1.50, and in 2006 it was
multiplied by 1.30
• The relevant quantity is the geometric mean of these
three numbers, which is about 1.28966 or about 29%
average annual increase in disease rates
• It is important to know whether arithmetic mean or
geometric mean should be used
• When averaging ratios geometric mean
Consider the following when considering the two
extremes. If one experiment yields a ratio of 10,000
and the next yields a ratio of 0.0001, an arithmetic
mean would misleadingly report that the average
ratio was near 5000. Taking a geometric mean will
more honestly represent the fact that the average
ratio was 1.
Sanjay Rampal Summarizing Data 1 8
Truncated Means
• This is a useful measure of central tendency in the
presence of extreme values or outliers
• The observations in the dataset are truncated
observations on either side comprising n % are
discarded and the mean is calculated where n
ranges from 5% to 50%
• 90% truncated mean 5% observations on either
extremes are discarded
Inter quartlie mean
• A type of truncated means
• When distribution is skewed or in the presence of
extreme values, an alternative measure of central
tendency is the inter quartile mean
• 25% of the observations on either ends of the
distribution are discarded
leaving the middle 50% (Q1 – Q3)
then an arithmetic mean is calculated on the
group of observations.
Sanjay Rampal Summarizing Data 1 9
Median (1)
• Is the middle observation point (50th percentile)
• It is the point at which half of the observations are
smaller and half are larger
• The median like the mean, may also be estimated
from frequency table
Median (1)
• Calculate the median by:
1) Arranging the observations from smallest to
largest
2) Find the middle value
e.g. 9, 7, 6, 5, 3, 1, 1
Sanjay Rampal Summarizing Data 1 10
Median (2)
• Odd Number of Measurements (n=odd value)
The median is the value of middle-most
observations in ascending order.
x = [ 1 2 3 4 5 6 7 ]
n =7
median = 4 (4th observation)
• Even Number of Measurements (n=even value)
The median is the average value of the two
middle-most observations in ascending order.
x = [ 1 2 3 4 5 6 7 8 ]
n=8
median = (4+5)/2= 4.5
Median (3)
Sanjay Rampal Summarizing Data 1 11
• If odd number of observations, median observation
= (n+1)/2
Or
• If even number of observations, median
2
1)/2][(n (n/2)
Median: Advantages
• Fairly easy to calculate and always exist
• Relatively easy to interpret - half of the sample (normally) lies above/below the median
• Is not affected by extreme data values
• Used when distribution of data is skewed
• Does not include values of observations, only their ranks
• Can be used with ordinal observations because calculation does not use actual vales of the observations
• Do not need a complete data set to calculate the rank
Sanjay Rampal Summarizing Data 1 12
Median: Disadvantages
• Manually tedious to find for a large sample which is
not in order (Requires ordering)
• Does not utilize all data values
Mode (1)
• The mode of a set of observations is the specific
value that occurs with the greatest frequency
• There may be more than one mode in a set of
observations, if there are several values that all
occur with the greatest frequency
• A mode may also not exist; this is true if all the
observations occur with the same frequency
Sanjay Rampal Summarizing Data 1 13
• Arrange the numbers in order by size
• Determine the number of instances of each numerical value
• The numerical value that has the most instances is the mode
E.g.
What is the mode for the following data?
2, 4, 5, 5, 5, 7, 8, 8, 9, 12
Mode (2)
• When a set of data has two modes, it is called bimodal
• What diseases have bimodal distributions?
• For frequency table or small number of observations, the mode is sometimes estimated by the modal class, which having the largest number of observations
Mode (3)
Sanjay Rampal Summarizing Data 1 14
• Quick and easy to calculate
• Unaffected by extreme values
Disadvantages• May not be representative of the whole
sample as they do not use all values
• Seldom gives statistical significance
Advantages
Mode (4)
1, 2, 3, 3, 4, 5
• Mean ?
• Median ?
• Mode ?
Sanjay Rampal Summarizing Data 1 15
Mean, Median, Mode
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9
20
)1(9)2(8...)2(2)1(1 Mean =
Median = 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 7 7 7 8 8 9
Mode = 5
Using central tendency (1)
• The choice of measure will depend on the following factors:
1) Scale of measurement
2) Shape of the distribution observations
• Mean is used for numerical data and symmetric (not
skewed) distributions
• The median is used for ordinal data or for numerical data if
the distribution is skewed
• The mode is used primarily for bimodal distributions
• The geometric mean is used primarily for observations
measured on a logarithmic scale
Sanjay Rampal Summarizing Data 1 16
• If the outlying values are small, the distribution is
skewed to the left (negatively skewed)
• If the outlying values are large, the distribution
skewed to the right (positively skewed)
• Mean=median (symmetrical)
Mean>median (distribution skewed to right)
Mean<median (distribution skewed to left)
Using central tendency (2)
Guidelines of central tendency
• Mean is used for numerical data and symmetric (not
skewed) distributions
• The median is used for ordinal data or for numerical
data if the distribution is skewed
• The mode is used primarily for bimodal distributions
• The geometric mean is used primarily for observations
measured on a logarithmic scale
Sanjay Rampal Summarizing Data 1 17
CONTENTS
• Measures of central tendency
• Mean, Median, & Mode
• Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
• Other measures of location
• Normal Distribution & skewness
Variability / Dispersion
• the variability of observed values from the measures of
central tendency
• data values in a sample are not all the same variation
between values is called dispersion
• When the dispersion is large, the values are widely
scattered; when it is small they are tightly clustered
• The width of diagrams such as dot plots, box plots, stem
and leaf plots is greater for samples with more dispersion
and vice versa
Sanjay Rampal Summarizing Data 1 18
• How spread out are the values?
a) All values the same = no variability
b) Small difference among values = small
variability
c) Big difference between values = large
variability
Variability of a sample selected
from a population
Sanjay Rampal Summarizing Data 1 19
Population distributions of height & weight
Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
Sanjay Rampal Summarizing Data 1 20
Range (1)
• The difference between the highest and the
lowest values in a set of data
Max. value - Min. value
• The range is affected by furthest outliers at either
end of the distribution
• Range is of limited use as a measure of
dispersion, because it reflects information about
extreme values
Range (2)
0 1 2 3 4 5 6
Range ?
0 1 2 3 4 5 6 51
Range ?
E.g.
Sanjay Rampal Summarizing Data 1 21
Interquartile range
• More on this in later slides
Measuring dispersion
• Real difference: xi - µ
• Absolute difference: |xi - µ|
• Mean absolute difference
where m(X) ~ Mean, Median, ModeNote:
• The sample mean absolute deviation is a biased estimator
of the population mean absolute deviation
• The sample median absolute deviation is a unbiased
estimator of the population median absolute deviation
Sanjay Rampal Summarizing Data 1 22
Deviation
• Deviation: Distance and Direction from the mean
• Deviation value: Values – mean
E.g.
Mean = 52
Scores =45, 53, 50, 60
Deviations scores -7, 1, -2, 8 (respectively)
Note: Tells you how far whether above or below the mean
Variance (1)
• The variance is a measure of how spread out a distribution is
• The average of squared deviations of the data points from the mean
Variance = s.d2
• E.g. the numbers 1, 2, and 3, the mean is 2 and the variance is:
= 0.667
N
22 )(
3
)23()22()21( 2222
Sanjay Rampal Summarizing Data 1 23
Variance (2)
• The formula for the variance in a population is
where µ=mean and N=number of observations / scores
• The formula for the variance in a sample is
N
22 )(
1
)( 22
n
Xs
• The SD is most commonly used measure of dispersion
with medical and health data
• Measure of the spread of data about their mean
(very important for statistical inference)
• Numerically, the standard deviation is the square root
of the variance
Standard deviation (1)
N
X 2)(
Population
1
)( 2
n
XXs
Sample
Sanjay Rampal Summarizing Data 1 24
Standard deviation (2)
• Measure of the spread of data about their mean
(Describe how observations cluster around the
mean and very important in statistical inference)
• Finds the average distance between each
score/datapoint and the mean
Standard deviation (3)
• To calculate SD of a population it is first
necessary to calculate that population's
variance
• Numerically, the standard deviation is the
square root of the variance
N
X 2)(
Population
1
)( 2
n
XXs
Sample
Sanjay Rampal Summarizing Data 1 25
Standard Deviation and Descriptive
Statistics
• Remember the goal of descriptive statistics is
to summarize and describe a set of data
• When you are given mean and standard
deviation, you should be able to visualize the
distribution
E.g. Mean= SD=4, tells you that the majority
of the values are within 4 points of the mean
These values are concrete and meaningful
Mean and Standard deviation
Sanjay Rampal Summarizing Data 1 26
Coefficient of variation (1)
• Useful when comparing the variation of two or
more quantitative data sets that are on different
scales or units
• An extension of the SD concept
• A measure of relative dispersion
• Adjusts the scales/units to be comparable
Coefficient of variation (2)
• An attribute of a distribution: its standard deviation divided by its mean
CV= Standard deviation X 100%
mean
• It is generally expresses the standard deviation as a percentage of the sample mean
Sanjay Rampal Summarizing Data 1 27
Coefficient of variation (3)
• Useful measure of relative spread in data
E.g.
Mean blood glucose (mg/dl)=152.1, SD=54.7
Mean serum cholesterol =217.0, SD=38.8
CV blood glucose =54.7/152.1 X 100≈36%
CV serum cholesterol =38.8/217.0 X 100≈18%
Variation in blood glucose > serum cholesterol
CONTENTS
• Measures of central tendency
• Mean, Median, & Mode
• Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
• Other measures of location
• Normal Distribution & skewness
Sanjay Rampal Summarizing Data 1 28
Other measures of location
• Quantiles
• Box plot
• Scatter plot
Quantiles
• Quantiles are a set of 'cut points' that divide
a sample of data into groups containing (as
far as possible) equal numbers of
observations
E.g. quantiles include:
quartiles, quintiles, deciles, percentiles
Sanjay Rampal Summarizing Data 1 29
Quartiles
• Quartiles divide an ordered data set into four
quartiles
25 %
50 %
75 %
Q1
Q2
Q3
(Median)
100 %
Q4
– Data: 6, 47, 49, 15, 43, 41, 7, 39, 43, 41, 36
Ordered Data: 6, 7, 15, 36, 39, 41, 41, 43, 43,
47, 49
Median (Q2) = 41
Third quartile cut off (Q3) = 43
Lower quartile cut off (Q1) = 15
Quartiles (2)
E.g.
Sanjay Rampal Summarizing Data 1 30
Quintiles
• Quintiles are values that divide a sample of
data into 5 quintiles containing (as far as
possible) equal numbers of observations
20%
40%
60%
80%
Q1
Q2
Q3
Q4
Q5
The use of percentiles in the presentation of data
50th percentile
= median
Percentiles
Sanjay Rampal Summarizing Data 1 31
Summary of quantiles
k Quantile name No of
quantiles
Description in ordered set
2 Median 1 50% of observations both above and
below median
4 Quartiles 3 25% of observations below 1st, above 3rd
and between successive quartiles
5 Quintiles 4 20% of observations below 1st, above 4th
and between successive quintiles
10 Deciles 9 10% of observations below 1st, above 9th
and between successive deciles
100 Percentiles 99 1% of observations below 1st, above and
between successive percentiles
Why use quantiles?
• It is an efficient way of dividing data into groups – groups are approximately equal sized
• Useful when studying relationships of skewed variables
• Not as efficient when the data variability is low small range thus categories do not differ much
Sanjay Rampal Summarizing Data 1 32
The Interquartile Range (IQR)
(Q3 – Q1)
2 3 4 4 5 5 6 6 6 7 7 8 8 9 10 11
Q1 Median Q3
• What are the advantages of Interquartile
range over the range?
• E.g.
0 1 2 3 4 5 6 7 8 9 10
Mean? Range? Median ? IQR?
0 1 2 3 4 5 6 7 8 9 1000
Mean? Range? Median ? IQR?
Sanjay Rampal Summarizing Data 1 33
Box-and-whisker plots (Boxplots) (1)
*
Extreme values
*Whisker
Outlier
Median + 1.5 IQR
Q3 = P75
Median
Q1 = P25
Box-and-whisker plots (Boxplots) (2)
• The box-length represents the interquartile
range
• The whiskers extend to the smallest and
largest observations
• The outliers and extreme values are indicated
by symbols as and *
Sanjay Rampal Summarizing Data 1 34
CONTENTS
• Measures of central tendency
• Mean, Median, & Mode
• Variability and Measures of Dispersion
1) Range
2) Interquartile range
3) Variance
4) Standard deviation
5) Coefficient of variation
• Other measures of location
• Normal Distribution & skewness
Normal distribution
• The Normal Curve is bell-shaped and
symmetrical.
• It is unimodal (mean = median = mode)
• Tails of the normal curve are asymptotic to
the horizontal axis (- to + ); i.e. the curve
approaches the horizontal axis but never
touches it
Sanjay Rampal Summarizing Data 1 35
The Normal curve is determined by
probability density function (pdf), given by the
formula
2
2 2
1exp
2
1
X
The Normal DistributionThe Normal Distribution
x
.
• Shape of curve depends on two parameters:
mean and variance ( and 2)
Effects of Effects of on the Probability Density on the Probability Density
Function of a Normal Random VariableFunction of a Normal Random Variable
x
0.0
0.1
0.2
0.3
0.4
1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
Mean = 5 Mean = 6
Effects of Effects of 22 on the Probability Density on the Probability Density
Function of a Normal Random VariableFunction of a Normal Random Variable
x
0.0
0.1
0.2
0.3
0.4
1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
Variance = 1
Variance = 4
Normal distribution (2)
Sanjay Rampal Summarizing Data 1 36
Properties of a Standard Normal
Distribution (3)
<----- 68.3%---->
<--------------95.5%-------------->
<----------------------99.7%------------------------->
µ - 3SDµ - 2SD µ - 1SD µ µ + 1SD µ + 2SD µ + 3SD
Skewed Distributions
• Skewness is defined as asymmetry in the
distribution of the sample data values
• Values on one side of the distribution tend to be
further from the 'middle' than values on the
other side
Sanjay Rampal Summarizing Data 1 37
Skewness
• Skewness measures the extent a distribution
of values deviates from symmetry around the
mean
• Simplest measurement is Mean-Median
– If Mean-Median >0, then +ve skew
– If Mean-Median <0, then -ve skew
Skewed distribution
+ve skewness -ve skewness
+ve skewness indicates a greater number of smaller values.
-ve skewness indicates a greater number of larger values.
Sanjay Rampal Summarizing Data 1 38
Positively Skewed Distribution
Median
Mean%
X
Sanjay Rampal Summarizing Data 1 39
Negatively Skewed Distribution
Median
Mean%
X
For further reading
Pearson’s coefficient of skewness
• Developed in the 1890s by Karl Pearson
• The value for of sk will fall within the range of
-3 to +3 with a value of 0 associated with a
perfect symmetrical distribution
Sanjay Rampal Summarizing Data 1 40
Kurtosis
• “Curvature”
• Defined as a measure reflectingthe degree to
which a distribution is peaked
• Provides information regarding the height of a
distribution relative to the value of its standard
deviation
• Can be divided into:
– Mesokurtic bell shaped
– Leptokurtic ↑ peak (Clustered around the mean)
– Platykurtic ↓ peak (More dispersed)
For further reading
Testing for Normality
• D’Agostino-Pearson test
• Kolmogrov – Smirnov Test
• Lilliefors test
• Shapiro-Wilk W test (7≤n≤2000 )
• Shapiro-Francia W' test (5 ≤n≤5000)
Sanjay Rampal Summarizing Data 1 41
For further reading
Transformation to normality
• If there is evidence of marked non-normality
then we may be able to remedy this by
applying suitable transformations
• The more commonly used transformations
which are appropriate for data which are
skewed to the right with increasing strength
(positive skew) are 1/x, log(x) and sqrt(x),
where the x's are the data values
Commonly used transformations
• If skewed to the right (positive skew) with
increasing strength are 1/x, log(x) and sqrt(x)
• If skewed to the left (negtive skew) with
increasing strength are squaring, cubing, and
exp(x)
where the x's are the data values
Sanjay Rampal Summarizing Data 1 42
Transformation when dealing with
associations between 2 variables
• The circle of powers — sometimes called
the ladder of powers — provides a general
guideline for choosing an appropriate
transformation
Sanjay Rampal Summarizing Data 1 43
• If the plotted data resemble Quadrant I, a
transformation that is either “up” on x or “up” on
y should be used In other words, we would raise
either x or y to a power greater than p = 1
• The more curvature in the data, the higher the
value of p needed to achieve linearity
• In general, we prefer to transform x whenever
possible