+ All Categories
Home > Documents > Descriptive Statistics

Descriptive Statistics

Date post: 03-Dec-2015
Category:
Upload: james-smith
View: 5 times
Download: 0 times
Share this document with a friend
Description:
Descriptive Statistics
Popular Tags:
43
Sanjay Rampal Summarizing Data 1 1 Associate Professor Dr Sanjay Rampal MBBS, MPH, PhD, CPH (US NBPHE), AMM Faculty of Medicine, University of Malaya [email protected] / [email protected] Descriptive Statistics Sept 2015 CONTENTS Measures of central tendency Mean, Median, & Mode Variability and Measures of Dispersion 1) Range 2) Interquartile range 3) Variance 4) Standard deviation 5) Coefficient of variation Other measures of location Normal Distribution & skewness
Transcript
Page 1: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 1

Associate Professor Dr Sanjay RampalMBBS, MPH, PhD, CPH (US NBPHE), AMM

Faculty of Medicine, University of Malaya

[email protected] / [email protected]

Descriptive StatisticsSept 2015

CONTENTS

• Measures of central tendency

• Mean, Median, & Mode

• Variability and Measures of Dispersion

1) Range

2) Interquartile range

3) Variance

4) Standard deviation

5) Coefficient of variation

• Other measures of location

• Normal Distribution & skewness

Page 2: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 2

Measures of central tendency

• Central tendency is an estimate of the “centre”

of a distribution of values.

• There are three major types of estimates of

central tendency

A) Mean

B) Median

C) Mode

Mean (1)

• The average value (sum of all observed values

divided by the total number of observations)

Mean, =

=

= (sigma) (means add)

xi= observed values

n= total number of observations

X

n

xi

n

XnXXX ......321

Page 3: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 3

• Used when the numbers can be added

(characteristics are measured on a numerical

scale)

• Should not be used with qualitative data

• Should not be used with ordinal scale because

arbitrary nature of ordinal scale

• Can be estimated from a frequency table.

Weighted average estimate of the mean is

formed by multiplying data value by number of

observations, add the products and divide the

sum by number of observations

Mean (arithmetic) (2)

Mean (arithmetic) (2)

1, 3, 5, 7, 7, 8, 8, 9

n=8

xi=1+3+5+7+7+8+8+9= 48

=

=

= 6

Xn

xi

8

48

Page 4: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 4

Mean: Advantages

• It is familiar to most people

• It reflects the inclusion of every item in the data set

• Utilize all values

• It always exists

• It is unique

• It is easily used with other statistical measurements

• The mean is the center of gravity of the data and, easy to understand and to calculate

• Distribution is determine symmetrical

• Important for statistical analyses and its applications

Mean: Disadvantages

• It can be affected by extreme values in the

data set, called outliers, and therefore be

biased

• Loss of accuracy when the distribution is

skewed

• Including or excluding a data (number) will

change the mean

• Manually, more tedious to calculate

Page 5: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 5

Other types of means

• Geometric mean

• Harmonic mean

• Generalized means

• Weighted arithmetic mean

• Truncated mean

• Inter-quartile mean

Mean (Geometric)

• It is an average that is useful for sets of

numbers that are interpreted according to

their product and not their sum (as is the case

with the arithmetic mean). E.g disease rates

Page 6: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 6

• The geometric mean is useful to determine “average factors”

E.g. if the incidence rate of a disease increased by 10% in

Y1, 20% in Y2 and decreased 15% in Y3

• The geometric mean of the disease rates 1.10, 1.20 and 0.85

= (1.10 × 1.20 × 0.85)1/3 = 1.039

Conclusion - the incidence rate increased 3.9 percent per

year, on average

Arithmetic Vs Geometric Mean

Arithmetic mean is relevant any time several quantities add

together to produce a total

• The arithmetic mean answers the question, "if all the

quantities had the same value, what would that value have

to be in order to achieve the same total?"

Geometric mean is relevant any time several quantities

multiply together to produce a product

• The geometric mean answers the question, "if all the

quantities had the same value, what would that value have

to be in order to achieve the same product?"

Page 7: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 7

• Suppose there is a disease rate which increases by

10% in 2004, 50% in 2005, and 30% in 2006. What is

its average increase in disease incidence?

• It is not the arithmetic mean, because what these

numbers signify is that on the 2004 the disease

incidence was multiplied (not added to) by 1.10, and

in 2005 it was multiplied by 1.50, and in 2006 it was

multiplied by 1.30

• The relevant quantity is the geometric mean of these

three numbers, which is about 1.28966 or about 29%

average annual increase in disease rates

• It is important to know whether arithmetic mean or

geometric mean should be used

• When averaging ratios geometric mean

Consider the following when considering the two

extremes. If one experiment yields a ratio of 10,000

and the next yields a ratio of 0.0001, an arithmetic

mean would misleadingly report that the average

ratio was near 5000. Taking a geometric mean will

more honestly represent the fact that the average

ratio was 1.

Page 8: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 8

Truncated Means

• This is a useful measure of central tendency in the

presence of extreme values or outliers

• The observations in the dataset are truncated

observations on either side comprising n % are

discarded and the mean is calculated where n

ranges from 5% to 50%

• 90% truncated mean 5% observations on either

extremes are discarded

Inter quartlie mean

• A type of truncated means

• When distribution is skewed or in the presence of

extreme values, an alternative measure of central

tendency is the inter quartile mean

• 25% of the observations on either ends of the

distribution are discarded

leaving the middle 50% (Q1 – Q3)

then an arithmetic mean is calculated on the

group of observations.

Page 9: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 9

Median (1)

• Is the middle observation point (50th percentile)

• It is the point at which half of the observations are

smaller and half are larger

• The median like the mean, may also be estimated

from frequency table

Median (1)

• Calculate the median by:

1) Arranging the observations from smallest to

largest

2) Find the middle value

e.g. 9, 7, 6, 5, 3, 1, 1

Page 10: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 10

Median (2)

• Odd Number of Measurements (n=odd value)

The median is the value of middle-most

observations in ascending order.

x = [ 1 2 3 4 5 6 7 ]

n =7

median = 4 (4th observation)

• Even Number of Measurements (n=even value)

The median is the average value of the two

middle-most observations in ascending order.

x = [ 1 2 3 4 5 6 7 8 ]

n=8

median = (4+5)/2= 4.5

Median (3)

Page 11: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 11

• If odd number of observations, median observation

= (n+1)/2

Or

• If even number of observations, median

2

1)/2][(n (n/2)

Median: Advantages

• Fairly easy to calculate and always exist

• Relatively easy to interpret - half of the sample (normally) lies above/below the median

• Is not affected by extreme data values

• Used when distribution of data is skewed

• Does not include values of observations, only their ranks

• Can be used with ordinal observations because calculation does not use actual vales of the observations

• Do not need a complete data set to calculate the rank

Page 12: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 12

Median: Disadvantages

• Manually tedious to find for a large sample which is

not in order (Requires ordering)

• Does not utilize all data values

Mode (1)

• The mode of a set of observations is the specific

value that occurs with the greatest frequency

• There may be more than one mode in a set of

observations, if there are several values that all

occur with the greatest frequency

• A mode may also not exist; this is true if all the

observations occur with the same frequency

Page 13: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 13

• Arrange the numbers in order by size

• Determine the number of instances of each numerical value

• The numerical value that has the most instances is the mode

E.g.

What is the mode for the following data?

2, 4, 5, 5, 5, 7, 8, 8, 9, 12

Mode (2)

• When a set of data has two modes, it is called bimodal

• What diseases have bimodal distributions?

• For frequency table or small number of observations, the mode is sometimes estimated by the modal class, which having the largest number of observations

Mode (3)

Page 14: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 14

• Quick and easy to calculate

• Unaffected by extreme values

Disadvantages• May not be representative of the whole

sample as they do not use all values

• Seldom gives statistical significance

Advantages

Mode (4)

1, 2, 3, 3, 4, 5

• Mean ?

• Median ?

• Mode ?

Page 15: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 15

Mean, Median, Mode

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9

20

)1(9)2(8...)2(2)1(1 Mean =

Median = 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 7 7 7 8 8 9

Mode = 5

Using central tendency (1)

• The choice of measure will depend on the following factors:

1) Scale of measurement

2) Shape of the distribution observations

• Mean is used for numerical data and symmetric (not

skewed) distributions

• The median is used for ordinal data or for numerical data if

the distribution is skewed

• The mode is used primarily for bimodal distributions

• The geometric mean is used primarily for observations

measured on a logarithmic scale

Page 16: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 16

• If the outlying values are small, the distribution is

skewed to the left (negatively skewed)

• If the outlying values are large, the distribution

skewed to the right (positively skewed)

• Mean=median (symmetrical)

Mean>median (distribution skewed to right)

Mean<median (distribution skewed to left)

Using central tendency (2)

Guidelines of central tendency

• Mean is used for numerical data and symmetric (not

skewed) distributions

• The median is used for ordinal data or for numerical

data if the distribution is skewed

• The mode is used primarily for bimodal distributions

• The geometric mean is used primarily for observations

measured on a logarithmic scale

Page 17: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 17

CONTENTS

• Measures of central tendency

• Mean, Median, & Mode

• Variability and Measures of Dispersion

1) Range

2) Interquartile range

3) Variance

4) Standard deviation

5) Coefficient of variation

• Other measures of location

• Normal Distribution & skewness

Variability / Dispersion

• the variability of observed values from the measures of

central tendency

• data values in a sample are not all the same variation

between values is called dispersion

• When the dispersion is large, the values are widely

scattered; when it is small they are tightly clustered

• The width of diagrams such as dot plots, box plots, stem

and leaf plots is greater for samples with more dispersion

and vice versa

Page 18: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 18

• How spread out are the values?

a) All values the same = no variability

b) Small difference among values = small

variability

c) Big difference between values = large

variability

Variability of a sample selected

from a population

Page 19: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 19

Population distributions of height & weight

Measures of Dispersion

1) Range

2) Interquartile range

3) Variance

4) Standard deviation

5) Coefficient of variation

Page 20: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 20

Range (1)

• The difference between the highest and the

lowest values in a set of data

Max. value - Min. value

• The range is affected by furthest outliers at either

end of the distribution

• Range is of limited use as a measure of

dispersion, because it reflects information about

extreme values

Range (2)

0 1 2 3 4 5 6

Range ?

0 1 2 3 4 5 6 51

Range ?

E.g.

Page 21: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 21

Interquartile range

• More on this in later slides

Measuring dispersion

• Real difference: xi - µ

• Absolute difference: |xi - µ|

• Mean absolute difference

where m(X) ~ Mean, Median, ModeNote:

• The sample mean absolute deviation is a biased estimator

of the population mean absolute deviation

• The sample median absolute deviation is a unbiased

estimator of the population median absolute deviation

Page 22: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 22

Deviation

• Deviation: Distance and Direction from the mean

• Deviation value: Values – mean

E.g.

Mean = 52

Scores =45, 53, 50, 60

Deviations scores -7, 1, -2, 8 (respectively)

Note: Tells you how far whether above or below the mean

Variance (1)

• The variance is a measure of how spread out a distribution is

• The average of squared deviations of the data points from the mean

Variance = s.d2

• E.g. the numbers 1, 2, and 3, the mean is 2 and the variance is:

= 0.667

N

22 )(

3

)23()22()21( 2222

Page 23: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 23

Variance (2)

• The formula for the variance in a population is

where µ=mean and N=number of observations / scores

• The formula for the variance in a sample is

N

22 )(

1

)( 22

n

Xs

• The SD is most commonly used measure of dispersion

with medical and health data

• Measure of the spread of data about their mean

(very important for statistical inference)

• Numerically, the standard deviation is the square root

of the variance

Standard deviation (1)

N

X 2)(

Population

1

)( 2

n

XXs

Sample

Page 24: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 24

Standard deviation (2)

• Measure of the spread of data about their mean

(Describe how observations cluster around the

mean and very important in statistical inference)

• Finds the average distance between each

score/datapoint and the mean

Standard deviation (3)

• To calculate SD of a population it is first

necessary to calculate that population's

variance

• Numerically, the standard deviation is the

square root of the variance

N

X 2)(

Population

1

)( 2

n

XXs

Sample

Page 25: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 25

Standard Deviation and Descriptive

Statistics

• Remember the goal of descriptive statistics is

to summarize and describe a set of data

• When you are given mean and standard

deviation, you should be able to visualize the

distribution

E.g. Mean= SD=4, tells you that the majority

of the values are within 4 points of the mean

These values are concrete and meaningful

Mean and Standard deviation

Page 26: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 26

Coefficient of variation (1)

• Useful when comparing the variation of two or

more quantitative data sets that are on different

scales or units

• An extension of the SD concept

• A measure of relative dispersion

• Adjusts the scales/units to be comparable

Coefficient of variation (2)

• An attribute of a distribution: its standard deviation divided by its mean

CV= Standard deviation X 100%

mean

• It is generally expresses the standard deviation as a percentage of the sample mean

Page 27: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 27

Coefficient of variation (3)

• Useful measure of relative spread in data

E.g.

Mean blood glucose (mg/dl)=152.1, SD=54.7

Mean serum cholesterol =217.0, SD=38.8

CV blood glucose =54.7/152.1 X 100≈36%

CV serum cholesterol =38.8/217.0 X 100≈18%

Variation in blood glucose > serum cholesterol

CONTENTS

• Measures of central tendency

• Mean, Median, & Mode

• Variability and Measures of Dispersion

1) Range

2) Interquartile range

3) Variance

4) Standard deviation

5) Coefficient of variation

• Other measures of location

• Normal Distribution & skewness

Page 28: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 28

Other measures of location

• Quantiles

• Box plot

• Scatter plot

Quantiles

• Quantiles are a set of 'cut points' that divide

a sample of data into groups containing (as

far as possible) equal numbers of

observations

E.g. quantiles include:

quartiles, quintiles, deciles, percentiles

Page 29: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 29

Quartiles

• Quartiles divide an ordered data set into four

quartiles

25 %

50 %

75 %

Q1

Q2

Q3

(Median)

100 %

Q4

– Data: 6, 47, 49, 15, 43, 41, 7, 39, 43, 41, 36

Ordered Data: 6, 7, 15, 36, 39, 41, 41, 43, 43,

47, 49

Median (Q2) = 41

Third quartile cut off (Q3) = 43

Lower quartile cut off (Q1) = 15

Quartiles (2)

E.g.

Page 30: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 30

Quintiles

• Quintiles are values that divide a sample of

data into 5 quintiles containing (as far as

possible) equal numbers of observations

20%

40%

60%

80%

Q1

Q2

Q3

Q4

Q5

The use of percentiles in the presentation of data

50th percentile

= median

Percentiles

Page 31: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 31

Summary of quantiles

k Quantile name No of

quantiles

Description in ordered set

2 Median 1 50% of observations both above and

below median

4 Quartiles 3 25% of observations below 1st, above 3rd

and between successive quartiles

5 Quintiles 4 20% of observations below 1st, above 4th

and between successive quintiles

10 Deciles 9 10% of observations below 1st, above 9th

and between successive deciles

100 Percentiles 99 1% of observations below 1st, above and

between successive percentiles

Why use quantiles?

• It is an efficient way of dividing data into groups – groups are approximately equal sized

• Useful when studying relationships of skewed variables

• Not as efficient when the data variability is low small range thus categories do not differ much

Page 32: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 32

The Interquartile Range (IQR)

(Q3 – Q1)

2 3 4 4 5 5 6 6 6 7 7 8 8 9 10 11

Q1 Median Q3

• What are the advantages of Interquartile

range over the range?

• E.g.

0 1 2 3 4 5 6 7 8 9 10

Mean? Range? Median ? IQR?

0 1 2 3 4 5 6 7 8 9 1000

Mean? Range? Median ? IQR?

Page 33: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 33

Box-and-whisker plots (Boxplots) (1)

*

Extreme values

*Whisker

Outlier

Median + 1.5 IQR

Q3 = P75

Median

Q1 = P25

Box-and-whisker plots (Boxplots) (2)

• The box-length represents the interquartile

range

• The whiskers extend to the smallest and

largest observations

• The outliers and extreme values are indicated

by symbols as and *

Page 34: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 34

CONTENTS

• Measures of central tendency

• Mean, Median, & Mode

• Variability and Measures of Dispersion

1) Range

2) Interquartile range

3) Variance

4) Standard deviation

5) Coefficient of variation

• Other measures of location

• Normal Distribution & skewness

Normal distribution

• The Normal Curve is bell-shaped and

symmetrical.

• It is unimodal (mean = median = mode)

• Tails of the normal curve are asymptotic to

the horizontal axis (- to + ); i.e. the curve

approaches the horizontal axis but never

touches it

Page 35: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 35

The Normal curve is determined by

probability density function (pdf), given by the

formula

2

2 2

1exp

2

1

X

The Normal DistributionThe Normal Distribution

x

.

• Shape of curve depends on two parameters:

mean and variance ( and 2)

Effects of Effects of on the Probability Density on the Probability Density

Function of a Normal Random VariableFunction of a Normal Random Variable

x

0.0

0.1

0.2

0.3

0.4

1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5

Mean = 5 Mean = 6

Effects of Effects of 22 on the Probability Density on the Probability Density

Function of a Normal Random VariableFunction of a Normal Random Variable

x

0.0

0.1

0.2

0.3

0.4

1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5

Variance = 1

Variance = 4

Normal distribution (2)

Page 36: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 36

Properties of a Standard Normal

Distribution (3)

<----- 68.3%---->

<--------------95.5%-------------->

<----------------------99.7%------------------------->

µ - 3SDµ - 2SD µ - 1SD µ µ + 1SD µ + 2SD µ + 3SD

Skewed Distributions

• Skewness is defined as asymmetry in the

distribution of the sample data values

• Values on one side of the distribution tend to be

further from the 'middle' than values on the

other side

Page 37: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 37

Skewness

• Skewness measures the extent a distribution

of values deviates from symmetry around the

mean

• Simplest measurement is Mean-Median

– If Mean-Median >0, then +ve skew

– If Mean-Median <0, then -ve skew

Skewed distribution

+ve skewness -ve skewness

+ve skewness indicates a greater number of smaller values.

-ve skewness indicates a greater number of larger values.

Page 38: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 38

Positively Skewed Distribution

Median

Mean%

X

Page 39: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 39

Negatively Skewed Distribution

Median

Mean%

X

For further reading

Pearson’s coefficient of skewness

• Developed in the 1890s by Karl Pearson

• The value for of sk will fall within the range of

-3 to +3 with a value of 0 associated with a

perfect symmetrical distribution

Page 40: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 40

Kurtosis

• “Curvature”

• Defined as a measure reflectingthe degree to

which a distribution is peaked

• Provides information regarding the height of a

distribution relative to the value of its standard

deviation

• Can be divided into:

– Mesokurtic bell shaped

– Leptokurtic ↑ peak (Clustered around the mean)

– Platykurtic ↓ peak (More dispersed)

For further reading

Testing for Normality

• D’Agostino-Pearson test

• Kolmogrov – Smirnov Test

• Lilliefors test

• Shapiro-Wilk W test (7≤n≤2000 )

• Shapiro-Francia W' test (5 ≤n≤5000)

Page 41: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 41

For further reading

Transformation to normality

• If there is evidence of marked non-normality

then we may be able to remedy this by

applying suitable transformations

• The more commonly used transformations

which are appropriate for data which are

skewed to the right with increasing strength

(positive skew) are 1/x, log(x) and sqrt(x),

where the x's are the data values

Commonly used transformations

• If skewed to the right (positive skew) with

increasing strength are 1/x, log(x) and sqrt(x)

• If skewed to the left (negtive skew) with

increasing strength are squaring, cubing, and

exp(x)

where the x's are the data values

Page 42: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 42

Transformation when dealing with

associations between 2 variables

• The circle of powers — sometimes called

the ladder of powers — provides a general

guideline for choosing an appropriate

transformation

Page 43: Descriptive Statistics

Sanjay Rampal Summarizing Data 1 43

• If the plotted data resemble Quadrant I, a

transformation that is either “up” on x or “up” on

y should be used In other words, we would raise

either x or y to a power greater than p = 1

• The more curvature in the data, the higher the

value of p needed to achieve linearity

• In general, we prefer to transform x whenever

possible


Recommended