Descriptive Statistics

Sanjay Rampal Summarizing Data 1 1

Associate Professor Dr Sanjay RampalMBBS, MPH, PhD, CPH (US NBPHE), AMM

Faculty of Medicine, University of Malaya

[email protected] / [email protected]

Descriptive StatisticsSept 2015

CONTENTS

• Measures of central tendency

• Mean, Median, & Mode

• Variability and Measures of Dispersion

1) Range

2) Interquartile range

3) Variance

4) Standard deviation

5) Coefficient of variation

• Other measures of location

• Normal Distribution & skewness


Measures of central tendency

• Central tendency is an estimate of the “centre”

of a distribution of values.

• There are three major types of estimates of

central tendency

A) Mean

B) Median

C) Mode

Mean (1)

• The average value (sum of all observed values

divided by the total number of observations)

Mean, =

=

= (sigma) (means add)

xi= observed values

n= total number of observations

X

n

xi

n

XnXXX ......321


• Used when the numbers can be added

(characteristics are measured on a numerical

scale)

• Should not be used with qualitative data

• Should not be used with ordinal scale because

arbitrary nature of ordinal scale

• Can be estimated from a frequency table.

Weighted average estimate of the mean is

formed by multiplying data value by number of

observations, add the products and divide the

sum by number of observations

Mean (arithmetic) (2)

Mean (arithmetic) (2)

1, 3, 5, 7, 7, 8, 8, 9

n=8

xi=1+3+5+7+7+8+8+9= 48

=

=

= 6

Xn

xi

8

48


Mean: Advantages

• It is familiar to most people

• It reflects the inclusion of every item in the data set

• Utilize all values

• It always exists

• It is unique

• It is easily used with other statistical measurements

• The mean is the center of gravity of the data and, easy to understand and to calculate

• Distribution is determine symmetrical

• Important for statistical analyses and its applications

Mean: Disadvantages

• It can be affected by extreme values in the

data set, called outliers, and therefore be

biased

• Loss of accuracy when the distribution is

skewed

• Including or excluding a data (number) will

change the mean

• Manually, more tedious to calculate


Other types of means

• Geometric mean

• Harmonic mean

• Generalized means

• Weighted arithmetic mean

• Truncated mean

• Inter-quartile mean

Mean (Geometric)

• It is an average that is useful for sets of

numbers that are interpreted according to

their product and not their sum (as is the case

with the arithmetic mean). E.g disease rates


• The geometric mean is useful to determine “average factors”

E.g. if the incidence rate of a disease increased by 10% in

Y1, 20% in Y2 and decreased 15% in Y3

• The geometric mean of the disease rates 1.10, 1.20 and 0.85

= (1.10 × 1.20 × 0.85)1/3 = 1.039

Conclusion - the incidence rate increased 3.9 percent per

year, on average

Arithmetic Vs Geometric Mean

Arithmetic mean is relevant any time several quantities add

together to produce a total

• The arithmetic mean answers the question, "if all the

quantities had the same value, what would that value have

to be in order to achieve the same total?"

Geometric mean is relevant any time several quantities

multiply together to produce a product

• The geometric mean answers the question, "if all the

quantities had the same value, what would that value have

to be in order to achieve the same product?"


• Suppose there is a disease rate which increases by

10% in 2004, 50% in 2005, and 30% in 2006. What is

its average increase in disease incidence?

• It is not the arithmetic mean, because what these

numbers signify is that on the 2004 the disease

incidence was multiplied (not added to) by 1.10, and

in 2005 it was multiplied by 1.50, and in 2006 it was

multiplied by 1.30

• The relevant quantity is the geometric mean of these

three numbers, which is about 1.28966 or about 29%

average annual increase in disease rates

• It is important to know whether arithmetic mean or

geometric mean should be used

• When averaging ratios geometric mean

Consider the following when considering the two

extremes. If one experiment yields a ratio of 10,000

and the next yields a ratio of 0.0001, an arithmetic

mean would misleadingly report that the average

ratio was near 5000. Taking a geometric mean will

more honestly represent the fact that the average

ratio was 1.


Truncated Means

• This is a useful measure of central tendency in the

presence of extreme values or outliers

• The observations in the dataset are truncated

observations on either side comprising n % are

discarded and the mean is calculated where n

ranges from 5% to 50%

• 90% truncated mean 5% observations on either

extremes are discarded

Inter quartlie mean

• A type of truncated means

• When distribution is skewed or in the presence of

extreme values, an alternative measure of central

tendency is the inter quartile mean

• 25% of the observations on either ends of the

distribution are discarded

leaving the middle 50% (Q1 – Q3)

then an arithmetic mean is calculated on the

group of observations.


Median (1)

• Is the middle observation point (50th percentile)

• It is the point at which half of the observations are

smaller and half are larger

• The median like the mean, may also be estimated

from frequency table

Median (1)

• Calculate the median by:

1) Arranging the observations from smallest to

largest

2) Find the middle value

e.g. 9, 7, 6, 5, 3, 1, 1


Median (2)

• Odd Number of Measurements (n=odd value)

The median is the value of middle-most

observations in ascending order.

x = [ 1 2 3 4 5 6 7 ]

n =7

median = 4 (4th observation)

• Even Number of Measurements (n=even value)

The median is the average value of the two

middle-most observations in ascending order.

x = [ 1 2 3 4 5 6 7 8 ]

n=8

median = (4+5)/2= 4.5

Median (3)


• If odd number of observations, median observation

= (n+1)/2

Or

• If even number of observations, median

2

1)/2][(n (n/2)

Median: Advantages

• Fairly easy to calculate and always exist

• Relatively easy to interpret - half of the sample (normally) lies above/below the median

• Is not affected by extreme data values

• Used when distribution of data is skewed

• Does not include values of observations, only their ranks

• Can be used with ordinal observations because calculation does not use actual vales of the observations

• Do not need a complete data set to calculate the rank


Median: Disadvantages

• Manually tedious to find for a large sample which is

not in order (Requires ordering)

• Does not utilize all data values

Mode (1)

• The mode of a set of observations is the specific

value that occurs with the greatest frequency

• There may be more than one mode in a set of

observations, if there are several values that all

occur with the greatest frequency

• A mode may also not exist; this is true if all the

observations occur with the same frequency


• Arrange the numbers in order by size

• Determine the number of instances of each numerical value

• The numerical value that has the most instances is the mode

E.g.

What is the mode for the following data?

2, 4, 5, 5, 5, 7, 8, 8, 9, 12

Mode (2)

• When a set of data has two modes, it is called bimodal

• What diseases have bimodal distributions?

• For frequency table or small number of observations, the mode is sometimes estimated by the modal class, which having the largest number of observations

Mode (3)


• Quick and easy to calculate

• Unaffected by extreme values

Disadvantages• May not be representative of the whole

sample as they do not use all values

• Seldom gives statistical significance

Advantages

Mode (4)

1, 2, 3, 3, 4, 5

• Mean ?

• Median ?

• Mode ?


Mean, Median, Mode

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9

20

)1(9)2(8...)2(2)1(1 Mean =

Median = 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 7 7 7 8 8 9

Mode = 5

Using central tendency (1)

• The choice of measure will depend on the following factors:

1) Scale of measurement

2) Shape of the distribution observations

• Mean is used for numerical data and symmetric (not

skewed) distributions

• The median is used for ordinal data or for numerical data if

the distribution is skewed

• The mode is used primarily for bimodal distributions

• The geometric mean is used primarily for observations

measured on a logarithmic scale


• If the outlying values are small, the distribution is

skewed to the left (negatively skewed)

• If the outlying values are large, the distribution

skewed to the right (positively skewed)

• Mean=median (symmetrical)

Mean>median (distribution skewed to right)

Mean<median (distribution skewed to left)

Using central tendency (2)

Guidelines of central tendency

• Mean is used for numerical data and symmetric (not

skewed) distributions

• The median is used for ordinal data or for numerical

data if the distribution is skewed

• The mode is used primarily for bimodal distributions

• The geometric mean is used primarily for observations

measured on a logarithmic scale


CONTENTS




1) Range


3) Variance





Variability / Dispersion

• the variability of observed values from the measures of

central tendency

• data values in a sample are not all the same variation

between values is called dispersion

• When the dispersion is large, the values are widely

scattered; when it is small they are tightly clustered

• The width of diagrams such as dot plots, box plots, stem

and leaf plots is greater for samples with more dispersion

and vice versa


• How spread out are the values?

a) All values the same = no variability

b) Small difference among values = small

variability

c) Big difference between values = large

variability

Variability of a sample selected

from a population


Population distributions of height & weight

Measures of Dispersion

1) Range


3) Variance




Range (1)

• The difference between the highest and the

lowest values in a set of data

Max. value - Min. value

• The range is affected by furthest outliers at either

end of the distribution

• Range is of limited use as a measure of

dispersion, because it reflects information about

extreme values

Range (2)

0 1 2 3 4 5 6

Range ?

0 1 2 3 4 5 6 51

Range ?

E.g.


Interquartile range

• More on this in later slides

Measuring dispersion

• Real difference: xi - µ

• Absolute difference: |xi - µ|

• Mean absolute difference

where m(X) ~ Mean, Median, ModeNote:

• The sample mean absolute deviation is a biased estimator

of the population mean absolute deviation

• The sample median absolute deviation is a unbiased

estimator of the population median absolute deviation


Deviation

• Deviation: Distance and Direction from the mean

• Deviation value: Values – mean

E.g.

Mean = 52

Scores =45, 53, 50, 60

Deviations scores -7, 1, -2, 8 (respectively)

Note: Tells you how far whether above or below the mean

Variance (1)

• The variance is a measure of how spread out a distribution is

• The average of squared deviations of the data points from the mean

Variance = s.d2

• E.g. the numbers 1, 2, and 3, the mean is 2 and the variance is:

= 0.667

N

22 )(

3

)23()22()21( 2222


Variance (2)

• The formula for the variance in a population is

where µ=mean and N=number of observations / scores

• The formula for the variance in a sample is

N

22 )(

1

)( 22

n

Xs

• The SD is most commonly used measure of dispersion

with medical and health data

• Measure of the spread of data about their mean

(very important for statistical inference)

• Numerically, the standard deviation is the square root

of the variance

Standard deviation (1)

N

X 2)(

Population

1

)( 2

n

XXs

Sample



• Measure of the spread of data about their mean

(Describe how observations cluster around the

mean and very important in statistical inference)

• Finds the average distance between each

score/datapoint and the mean


• To calculate SD of a population it is first

necessary to calculate that population's

variance

• Numerically, the standard deviation is the

square root of the variance

N

X 2)(

Population

1

)( 2

n

XXs

Sample


Standard Deviation and Descriptive

Statistics

• Remember the goal of descriptive statistics is

to summarize and describe a set of data

• When you are given mean and standard

deviation, you should be able to visualize the

distribution

E.g. Mean= SD=4, tells you that the majority

of the values are within 4 points of the mean

These values are concrete and meaningful

Mean and Standard deviation


Coefficient of variation (1)

• Useful when comparing the variation of two or

more quantitative data sets that are on different

scales or units

• An extension of the SD concept

• A measure of relative dispersion

• Adjusts the scales/units to be comparable


• An attribute of a distribution: its standard deviation divided by its mean

CV= Standard deviation X 100%

mean

• It is generally expresses the standard deviation as a percentage of the sample mean



• Useful measure of relative spread in data

E.g.

Mean blood glucose (mg/dl)=152.1, SD=54.7

Mean serum cholesterol =217.0, SD=38.8

CV blood glucose =54.7/152.1 X 100≈36%

CV serum cholesterol =38.8/217.0 X 100≈18%

Variation in blood glucose > serum cholesterol

CONTENTS




1) Range


3) Variance






Other measures of location

• Quantiles

• Box plot

• Scatter plot

Quantiles

• Quantiles are a set of 'cut points' that divide

a sample of data into groups containing (as

far as possible) equal numbers of

observations

E.g. quantiles include:

quartiles, quintiles, deciles, percentiles


Quartiles

• Quartiles divide an ordered data set into four

quartiles

25 %

50 %

75 %

Q1

Q2

Q3

(Median)

100 %

Q4

– Data: 6, 47, 49, 15, 43, 41, 7, 39, 43, 41, 36

Ordered Data: 6, 7, 15, 36, 39, 41, 41, 43, 43,

47, 49

Median (Q2) = 41

Third quartile cut off (Q3) = 43

Lower quartile cut off (Q1) = 15

Quartiles (2)

E.g.


Quintiles

• Quintiles are values that divide a sample of

data into 5 quintiles containing (as far as

possible) equal numbers of observations

20%

40%

60%

80%

Q1

Q2

Q3

Q4

Q5

The use of percentiles in the presentation of data

50th percentile

= median

Percentiles


Summary of quantiles

k Quantile name No of

quantiles

Description in ordered set

2 Median 1 50% of observations both above and

below median

4 Quartiles 3 25% of observations below 1st, above 3rd

and between successive quartiles

5 Quintiles 4 20% of observations below 1st, above 4th

and between successive quintiles

10 Deciles 9 10% of observations below 1st, above 9th

and between successive deciles

100 Percentiles 99 1% of observations below 1st, above and

between successive percentiles

Why use quantiles?

• It is an efficient way of dividing data into groups – groups are approximately equal sized

• Useful when studying relationships of skewed variables

• Not as efficient when the data variability is low small range thus categories do not differ much


The Interquartile Range (IQR)

(Q3 – Q1)

2 3 4 4 5 5 6 6 6 7 7 8 8 9 10 11

Q1 Median Q3

• What are the advantages of Interquartile

range over the range?

• E.g.

0 1 2 3 4 5 6 7 8 9 10

Mean? Range? Median ? IQR?

0 1 2 3 4 5 6 7 8 9 1000

Mean? Range? Median ? IQR?


Box-and-whisker plots (Boxplots) (1)

*

Extreme values

*Whisker

Outlier

Median + 1.5 IQR

Q3 = P75

Median

Q1 = P25

Box-and-whisker plots (Boxplots) (2)

• The box-length represents the interquartile

range

• The whiskers extend to the smallest and

largest observations

• The outliers and extreme values are indicated

by symbols as and *


CONTENTS




1) Range


3) Variance





Normal distribution

• The Normal Curve is bell-shaped and

symmetrical.

• It is unimodal (mean = median = mode)

• Tails of the normal curve are asymptotic to

the horizontal axis (- to + ); i.e. the curve

approaches the horizontal axis but never

touches it


The Normal curve is determined by

probability density function (pdf), given by the

formula

2

2 2

1exp

2

1

X

The Normal DistributionThe Normal Distribution

x

.

• Shape of curve depends on two parameters:

mean and variance ( and 2)

Effects of Effects of on the Probability Density on the Probability Density

Function of a Normal Random VariableFunction of a Normal Random Variable

x

0.0

0.1

0.2

0.3

0.4

1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5

Mean = 5 Mean = 6

Effects of Effects of 22 on the Probability Density on the Probability Density

Function of a Normal Random VariableFunction of a Normal Random Variable

x

0.0

0.1

0.2

0.3

0.4

1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5

Variance = 1

Variance = 4

Normal distribution (2)


Properties of a Standard Normal

Distribution (3)

<----- 68.3%---->

<--------------95.5%-------------->

<----------------------99.7%------------------------->

µ - 3SDµ - 2SD µ - 1SD µ µ + 1SD µ + 2SD µ + 3SD

Skewed Distributions

• Skewness is defined as asymmetry in the

distribution of the sample data values

• Values on one side of the distribution tend to be

further from the 'middle' than values on the

other side


Skewness

• Skewness measures the extent a distribution

of values deviates from symmetry around the

mean

• Simplest measurement is Mean-Median

– If Mean-Median >0, then +ve skew

– If Mean-Median <0, then -ve skew

Skewed distribution

+ve skewness -ve skewness

+ve skewness indicates a greater number of smaller values.

-ve skewness indicates a greater number of larger values.


Positively Skewed Distribution

Median

Mean%

X


Negatively Skewed Distribution

Median

Mean%

X

For further reading

Pearson’s coefficient of skewness

• Developed in the 1890s by Karl Pearson

• The value for of sk will fall within the range of

-3 to +3 with a value of 0 associated with a

perfect symmetrical distribution


Kurtosis

• “Curvature”

• Defined as a measure reflectingthe degree to

which a distribution is peaked

• Provides information regarding the height of a

distribution relative to the value of its standard

deviation

• Can be divided into:

– Mesokurtic bell shaped

– Leptokurtic ↑ peak (Clustered around the mean)

– Platykurtic ↓ peak (More dispersed)

For further reading

Testing for Normality

• D’Agostino-Pearson test

• Kolmogrov – Smirnov Test

• Lilliefors test

• Shapiro-Wilk W test (7≤n≤2000 )

• Shapiro-Francia W' test (5 ≤n≤5000)


For further reading

Transformation to normality

• If there is evidence of marked non-normality

then we may be able to remedy this by

applying suitable transformations

• The more commonly used transformations

which are appropriate for data which are

skewed to the right with increasing strength

(positive skew) are 1/x, log(x) and sqrt(x),

where the x's are the data values

Commonly used transformations

• If skewed to the right (positive skew) with

increasing strength are 1/x, log(x) and sqrt(x)

• If skewed to the left (negtive skew) with

increasing strength are squaring, cubing, and

exp(x)

where the x's are the data values


Transformation when dealing with

associations between 2 variables

• The circle of powers — sometimes called

the ladder of powers — provides a general

guideline for choosing an appropriate

transformation


• If the plotted data resemble Quadrant I, a

transformation that is either “up” on x or “up” on

y should be used In other words, we would raise

either x or y to a power greater than p = 1

• The more curvature in the data, the higher the

value of p needed to achieve linearity

• In general, we prefer to transform x whenever

possible

Date post:	03-Dec-2015
Category:	Documents
Upload:	james-smith
View:	5 times
Download:	0 times

Descriptive Statistics

Documents