Download - RESEARCH STATISTICS Jobayer Hossain, PhD Larry Holmes, Jr, PhD October 16, 2008.

RESEARCH STATISTICSRESEARCH STATISTICS

Jobayer Hossain, PhD

Larry Holmes, Jr, PhD

October 16, 2008

Data SummarizationData Summarization

In research, the first step of data analysis is to describe the

distribution of the variables included in the study.

The advantages of data descriptions are-

– To get quick over all idea of the study

– To get quick idea of the difference among comparing groups

– To check the balance of the distribution of the demographic and

other prognostic variables that influences the outcome unduly.

Is the balance of the distribution of prognostic Is the balance of the distribution of prognostic factors in comparing groups important?factors in comparing groups important?

Example: Primary biliary cirrhosis trial (a chronic and and

fatal liver disease)

– A randomized double-blind trial

– Study treatment groups: Azathioprine vs placebo

– Objective: To compare the survival time of two treatment

groups

– Primary end point: Time to death from randomizationExample from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai


Example … contd.

Bilirubin is a strong predictor of survival time.

Statistics Placebo Azathioprine No. of Patients 94 97 Mean 53.75 67.40 Std. Deviation 70.5 88.95 Median 30.90 38.02 Minimum 5.13 7.24 Maximum 436.52 537.03

Table 1: Summary stat for Bilirubin level (mol/L) at baseline

Is the baseline imbalanced?

You may expect a higher mortality rate in the Azathioprine group- why? How does it affect the primary end point of survival time?

Example from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai


Covariate Parameter Hazard Ratio P-value 95% CI Unadjusted Treatment (A vs P) 0.86 0.455 0.57 - 1.28 Adjusted Treatment (A vs P) 0.65 0.044 0.43 – 0.99 Log Bilirubin 2.80 <0.001 2.25 – 3.48

Table3: Adjusted and Unadjusted Hazard ratios of death from the Cox proportional Hazards model

There was no significant difference between two treatment groups (p-value=0.455) before adjustment for the covariate Bilirubin

But after adjustment for Bilirubin, a significant difference was found (p-value < 0.001) between treatment groups

Example from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai

Looking at DataLooking at Data

How are the data distributed?

– Where is the center?

– What is the range?

– What is the shape of the distribution (symmetric, skewed)

Are there outliers?

Are there data points that don’t make sense?

Distribution of a variableDistribution of a variable

Distribution - (of a variable) tells us what values the

variable takes and how often it takes these values. E.g.

distribution of some 26 pediatric patients of ages 1 to 6

at AIDHC are as follows-

Age 1 2 3 4 5 6

Frequency 5 3 7 5 4 2

Statistical Description/Summarization Statistical Description/Summarization of Dataof Data

Statistics describes the distribution of a numeric set of data by

its

Center (mean, median, mode etc)

Variability (standard deviation, range etc)

Shape (skewness, kurtosis etc)

Statistics describes distribution of a categorical set of data by

Frequency, percentage or proportion of each category

Statistical Description/summarization Statistical Description/summarization of Dataof Data

Examples of numerical and categorical variables-

– Numerical variable: Age, blood pressure (systolic and

diastolic), time, weight, height, bmi (body mass index)

– Categorical variable: Treatment group, disease status,

race, gender, blood type (O, A, B, AB), age groups

(such as 1-5 years, 6-9 years etc)

Statistical Presentation of DataStatistical Presentation of Data

Two types of statistical presentation of data - graphical and numerical.

Graphical Presentation: We look for the overall pattern and for striking deviations from that pattern. Over all pattern usually described by shape, center, and spread of the data. An individual value that falls outside the overall pattern is called an outlier.

Bar diagram and Pie charts are used for categorical variables.

Histogram, stem and leaf and Box-plot are used for numerical variable.

Statistical Presentation of DataStatistical Presentation of Data

Statistics presents the data either graphically or numerically

In graphical presentation, we look for the overall pattern (distribution) and for striking deviations from that pattern

An individual value that falls outside the overall pattern is called an outlier.

– Over all pattern of numerical data usually described by shape, center, and spread of data. Commonly used graphs are histogram, stem and leaf plot, and boxplot

– Overall pattern of a categorical data usually described by frequency and percentages. Commonly used graphs are bar plot and pie chart

Data Presentation –Categorical Data Presentation –Categorical VariableVariable

Bar Diagram: Lists the categories and presents the percent or count of individuals who fall in each category.

TreatmentGroup

Frequency Proportion Percent(%)

1 15 (15/60)=0.25 25.0

2 25 (25/60)=0.333 41.7

3 20 (20/60)=0.417 33.3

Total 60 1.00 100

Figure 1: Bar Chart of Subjects in Treatment Groups

0

5

10

15

20

25

30

1 2 3

Treatment Group

Nu

mb

er

of

Su

bje

cts

Data Presentation –Categorical Data Presentation –Categorical VariableVariable

Pie Chart: Lists the categories and presents the percent or count of individuals who fall in each category.

Figure 2: Pie Chart of Subjects in Treatment Groups

25%

42%

33% 1

2

3

TreatmentGroup

Frequency Proportion Percent(%)

1 15 (15/60)=0.25 25.0

2 25 (25/60)=0.333 41.7

3 20 (20/60)=0.417 33.3

Total 60 1.00 100

Data Presentation –Categorical Data Presentation –Categorical Variable (Frequency Distribution)Variable (Frequency Distribution)

Age 1 2 3 4 5 6


Frequency Distribution of Age

Grouped Frequency Distribution of Age:

Age Group 1-2 3-4 5-6

Frequency 8 12 6

Consider a data set of 26 children of ages 1-6 years. Then the frequency distribution of variable ‘age’ can be tabulated as follows:

Data Presentation –Categorical Data Presentation –Categorical Variable (Frequency Distribution)Variable (Frequency Distribution)

Age Group 1-2 3-4 5-6

Frequency 8 12 6

Cumulative Frequency 8 20 26

Age 1 2 3 4 5 6


Cumulative Frequency 5 8 15 20 24 26

Cumulative frequency of data in previous page

Data Presentation –Numerical Data Presentation –Numerical VariableVariable

Figure 3: Age Distribution

0

2

4

6

8

10

12

14

16

40 60 80 100 120 140 More

Age in Month

Nu

mb

er o

f S

ub

ject

s

Histogram: Overall pattern can be described by its shape, center, and spread. The following age distribution is right skewed. The center lies between 80 to 100. No outliers.

Mean 90.41666667

Standard Error 3.902649518

Median 84

Mode 84

Standard Deviation 30.22979318

Sample Variance 913.8403955

Kurtosis -1.183899591

Skewness 0.389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Graphical presentation- Numerical Graphical presentation- Numerical VariableVariable

Boxplot :

– A boxplot is a graph of the five number summary. The central box

spans the quartiles.

– A line within the box marks the median.

– Lines extending above and below the box mark the smallest and

the largest observations (i.e. the range).

– Outlying samples may be additionally plotted outside the range.

Graphical Presentation –Numerical Graphical Presentation –Numerical VariableVariable

0

20

40

60

80

100

120

140

160

1

q1

min

median

max

q3

Box-Plot: Box contains middle 50% of the data. The upper and lower whisker contains top 25% and bottom 25% of the ordered data.

Figure 3: Distribution of Age

Box Plot

Mean 90.41666667

Standard Error 3.902649518

Median 84

Mode 84

Standard Deviation 30.22979318

Sample Variance 913.8403955

Kurtosis -1.183899591

Skewness 0.389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

75th percentile

Median

25th percentile

Minimum

Maximum

The shape of the distribution is right skewed as the upper part of the box and the whisker are longer the corresponding lower parts

Side by Side BoxplotSide by Side Boxplot6

08

01

00

12

01

40

Side by Side boxplots of ages of three treatment groups

Trt 3Trt 2Trt 1

75th percentile

25th percentile

maximum

interquartile range

minimum

median

0.0

33.3

66.7

100.0

Box Plot: Age of patientsY

ears

Numerical PresentationNumerical Presentation

To understand how well a central value characterizes a set of observations, let

us consider the following two sets of data:

A: 30, 50, 70

B: 40, 50, 60

The mean of both two data sets is 50. But, the distance of the observations from

the mean in data set A is larger than in the data set B. Thus, the mean of data

set B is a better representation of the data set than is the case for set A.

A fundamental concept in summary statistics is that of a central value for a set

of observations and the extent to which the central value characterizes the

whole set of data. Measures of central value such as the mean or median must

be coupled with measures of data dispersion (e.g., average distance from the

mean) to indicate how well the central value characterizes the data as a whole.

Methods of Center MeasurementMethods of Center Measurement

Commonly used methods are mean, median, mode, geometric mean etc.

Mean: Summing up all the observation and dividing by number of observations. Mean of 20, 30, 40 is (20+30+40)/3 = 30.

n

x

n

xxxx

x

nxxx

n

ii

n

n

121

,21

...

variable, thisofmean Then the .

variablea of nsobservatio are ...,Let :Notation

Center measurement is a summary measure of the overall level of a dataset


Median: The middle value in an ordered sequence of observations.

That is, to find the median we need to order the data set and then

find the middle value. In case of an even number of observations

the average of the two middle most values is the median. For

example, to find the median of {9, 3, 6, 7, 5}, we first sort the data

giving {3, 5, 6, 7, 9}, then choose the middle value 6. If the

number of observations is even, e.g., {9, 3, 6, 7, 5, 2}, then the

median is the average of the two middle values from the sorted

sequence, in this case, (5 + 6) / 2 = 5.5.

Mean or MedianMean or Median

The mean is affected by outlier (s) but median is not.

0 1 2 3 4 5 6 7 8 9 10

Mean = 3

0 1 2 3 4 5 6 7 8 9 10

Mean = 4

35

15

5

54321

4

5

20

5

104321

Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Median = 3 Median = 3

Mean or MedianMean or Median

The median is less sensitive to outliers (extreme scores) than the mean and thus a better

measure than the mean for highly skewed distributions, e.g. family income. For

example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The median of these

four observations is (30+40)/2 =35. Here 3 observations out of 4 lie between 20-40.

So, the mean 270 really fails to give a realistic picture of the major part of the data. It is

influenced by extreme value 990.


Mode: The value that is observed most frequently. The

mode is undefined for sequences in which no observation

is repeated. A variable with single mode is unimodal, with

two modes is bimodal

A bimodal histogramA bimodal histogram

A modal class A modal class

Slide from Zhengyuan Zhu, UNC, http://www.unc.edu/~zhuz

Methods of Variability MeasurementMethods of Variability Measurement

Commonly used methods: range, variance, standard deviation, interquartile range, coefficient of variation etc.

Range: The difference between the largest and the smallest observations. The range of 10, 5, 2, 100 is (100-2)=98. It’s a crude measure of variability.

Variability (or dispersion) measures the amount of scatter in a dataset.


Variance: The variance of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of the n observations x1, x2,…xn is

Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is

413

)57()53()55( 222

1

)(....)( 2212

n

xxxxS n

Standard Deviation (SD) : Square root of the variance. The SD of the above example is 2.

If the distribution is bell shaped (symmetric), then the range is approximately (SD x 6)

Standard deviation of different Standard deviation of different distributions with the same centerdistributions with the same center

Mean = 15.5 S = 3.338 11 12 13 14 15 16 17 18 19 20 21

11 12 13 14 15 16 17 18 19 20 21

Data B

Data A

Mean = 15.5 S = 0.926

11 12 13 14 15 16 17 18 19 20 21

Mean = 15.5 S = 4.570

Data C

0.0

62.5

125.0

187.5

250.0

0.0 0.5 1.0 1.5 2.0

Std Dev of Shock Index

SI

Co

un

t

Std. dev is a measure of the “average” scatter around the mean.

Estimation method: if the distribution is bell shaped, the range is around 6 SD, so here rough guess for SD is 1.4/6 =.23

Slide from: Kristin L. Sainani, Stanford University, http://www.stanford.edu/~kcobb


Quartiles: Quartiles are values that divides the sorted dataset in to four equal parts so that each part contains 25% of the sorted data

The first quartile (Q1) is the value from which 25% observations are smaller and 75% observations are larger. This is the median of the 1st half of the ordered dataset.

The second quartile (Q2) is the median of the data.

In notations, quartiles of a data is the ((n+1)/4)qth observation of the data, where q is the desired quartile and n is the number of observations of data.

The third quartile (Q1) is the value from which 75% observations are smaller and 25% observations are larger. This is the median of the 2nd half of the ordered dataset.


An example with 15 numbers 3 6 7 11 13 22 30 40 44 50 52 61 68 80 94 Q1 Q2 Q3The first quartile is Q1=11. The second quartile is Q2=40 (This is also the Median.) The third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile range of the previous example is 61- 40=21. The middle half of the ordered data lie between 40 and 61.

In the following example Q1= ((15+1)/4)1 =4th observation of the data. The 4th observation is 11. So Q1 is of this data is 11.


25% 25% 25% 25%

Symmetric

Left Skewed

Right Skewed 25% 25% 25% 25%

Q1 Q2 Q3

Q1 Q2 Q3

Q1 Q2 Q3

Deciles and PercentilesDeciles and Percentiles

Deciles: If data are ordered and divided into 10 parts, then cut points are called Deciles

Percentiles: If data are ordered and divided into 100 parts, then cut points are called Percentiles. 25th percentile is the Q1, 50th percentile is the Median (Q2) and the 75th percentile of the data is Q3.

Suppose PC= ((n+1)/100)p, where n=number of observations and p is the desired percentile. If PC is an integer than pth percentile of a data set is the (PC)th observation of the ordered set of that data. Otherwise let PI be the integer part of PC and f be the fractional part of PC. Then pth percentile= OI + (OII -OI)x`f where OI is the (PI)th observation of the ordered set of data and OII is the (PI +1)th observation of the ordered set of data.

For example, Consider the following ordered set of data: 3, 5, 7, 8, 9, 11, 13, 15.

PC= (9/100)p

For 25 th percentile, PC=2.25 (not an integer), then

25th percentile = 5 + (7-5)x.25= 5.5

Coefficient of VariationCoefficient of Variation

Coefficient of Variation: The standard deviation of data divided

by it’s mean. It is usually expressed in percent.

Coefficient of Variation= 100x

Five Number SummaryFive Number Summary

Five Number Summary: The five number summary of a

distribution consists of the smallest (Minimum) observation, the

first quartile (Q1), the median(Q2), the third quartile, and the

largest (Maximum) observation written in order from smallest to

largest.

Choosing a SummaryChoosing a Summary

The five number summary is usually better than the mean and standard

deviation for describing a skewed distribution or a distribution with extreme

outliers. The mean and standard deviation are reasonable for symmetric

distributions that are free of outliers.

In real life we can’t always expect symmetry of the data. It’s a common

practice to include number of observations (n), mean, median, standard

deviation, and range as common for data summarization purpose. We can

include other summary statistics like Q1, Q3, Coefficient of variation if it is

considered to be important for describing data.

Shape of DataShape of Data

Shape of data is measured by – Skewness – Kurtosis

SkewnessSkewness Measures of asymmetry of data

– Positive or right skewed: Longer right tail

– Negative or left skewed: Longer left tail

– Symmetric: Bell shaped

2/3

1

2

1

3

21

)(

)(Skewness

Then, ns.observatio be ,...,Let

n

ii

n

ii

n

xx

xxn

nxxx

Right skewed

Left skewed


Bell-shaped HistogramsBell-shaped Histograms


Kurtosis FormulaKurtosis Formula

3

)(

)(Kurtosis

Then, ns.observatio be ,...,Let

2

1

2

1

4

21

n

ii

n

ii

n

xx

xxn

nxxx

KurtosisKurtosis

Kurtosis relates to the

relative flatness or

peakedness of a distribution.

A standard normal

distribution (blue line: µ = 0;

= 1) has kurtosis = 0. A

distribution like that

illustrated with the red curve

has kurtosis > 0 with a lower

peak relative to its tails.

Normal DistributionNormal Distribution

The Normal Distribution is a density curve based on the following formula. It’s completely defined by two parameters: mean; and standard deviation.

A density function describes the overall pattern of a distribution.

The total area under the curve is always 1.0. The normal distribution is symmetrical.mmetrical. TheThe mean, medianmean, median, and mode are all the same.

xexfx

, 2

1)(

22

)(2

1


The 68-95-99.7 Rule

In the normal distribution with mean µ and standard deviation σ:

68% of the observations fall within σ of the mean µ.

95% of the observations fall within 2σ of the mean µ.

99.7% of the observations fall within 3σ of the mean µ.

68-95-99.7 Rule68-95-99.7 Rule

68% of the data

95% of the data

99.7% of the data

Slide from: Kristin L. Sainani, Stanford University, http://www.stanford.edu/~kcobb


Standardizing and z-ScoresStandardizing and z-ScoresIf x is an observation from a distribution that has mean µ and standard deviation σ, the standardized value of x is,

A standardized value is often called a z-score. If x is a normal variable with mean µ and standard deviation σ, then z is a standard normal variable with mean 0 and standard deviation 1.

.

xz

, 2

1)(

, variablenormal standard a offunction density The

2

2

zezf

zz

Normal DistributionNormal Distribution Let x1, x2, …., xn be n random variables each with mean µ and standard deviation σ,

then sum of them ∑xi be also a normal with mean nµ and standard deviation σ√n. The distribution of mean is also a normal with mean µ and standard deviation σ/√n.

The standardized score of the mean is,

The mean of this standardized random variable is 0 and standard deviation is 1.

n

xz

/

x

x

SPSS demo- Data Summarization SPSS demo- Data Summarization Categorical variableCategorical variable

Frequencies/percentages: Analyze -> Frequencies -> Select variables (sex, grp,

shades, ped) -> oksex

30 50.0 50.0 50.0

30 50.0 50.0 100.0

60 100.0 100.0

f

m

Total

ValidFrequency Percent Valid Percent

CumulativePercent

grp

20 33.3 33.3 33.3

20 33.3 33.3 66.7

20 33.3 33.3 100.0

60 100.0 100.0

1

2

3

Total


CumulativePercent

SPSS demo- Data Summarization SPSS demo- Data Summarization Categorical variableCategorical variable

Shades

30 50.0 50.0 50.0

30 50.0 50.0 100.0

60 100.0 100.0

1

2

Total


CumulativePercent

Ped

30 50.0 50.0 50.0

30 50.0 50.0 100.0

60 100.0 100.0

1

2

Total


CumulativePercent

SPSS demo- Bar ChartSPSS demo- Bar Chart

Analyze -> Frequencies -> Select variables (sex, grp, shades, ped) then

select option chart - > Select Chart type (Bar, histogram, Piechart) and

select percentages or frequencies- > Continue-> ok

Or

Graphs ->Bar -> Select type (Select type, clustered, stacked) -> Define

-> Select Bars represents (n of cases, % of cases) -> select variable for

category axis (e.g. grp) and click titles for writing titles -> continue ->

ok

SPSS demo- Bar ChartSPSS demo- Bar Chart

grp

20 33.3 33.3 33.3

20 33.3 33.3 66.7

20 33.3 33.3 100.0

60 100.0 100.0

1

2

3

Total


CumulativePercent

SPSS demo- Data Summarization SPSS demo- Data Summarization Numerical variableNumerical variable

Analyze -> Descriptive Statistics -> Descriptive -> Select

variable (s) (e.g. Age, hgt) and click on radio button to

transfer the variable(s) in the other window and then select

options -> continue -> ok

Or

Analyze -> Compare means -> means ->select variable (s)

for dependent (age, hgt) and independent (grp, sex) list and

then select options -> Continue -> ok

SPSS demo- Data Summarization SPSS demo- Data Summarization Numerical variableNumerical variable

Descriptive Statistics

60 48.00 143.00 90.4167 30.22979

60 38.99 58.09 47.9931 5.77357

60

age

hgt

Valid N (listwise)

N Minimum Maximum Mean Std. Deviation

Report

86.6500 47.4651

20 20

32.53545 6.07346

69.0000 44.9115

53.00 40.03

142.00 58.09

95.8000 48.7766

20 20

28.00113 5.30175

93.0000 48.6208

48.00 40.30

143.00 57.16

88.8000 47.7377

20 20

30.77183 6.12433

81.5000 46.9664

51.00 38.99

138.00 57.68

90.4167 47.9931

60 60

30.22979 5.77357

84.0000 46.6863

48.00 38.99

143.00 58.09

Mean

N

Std. Deviation

Median

Minimum

Maximum

Mean

N

Std. Deviation

Median

Minimum

Maximum

Mean

N

Std. Deviation

Median

Minimum

Maximum

Mean

N

Std. Deviation

Median

Minimum

Maximum

grp1

2

3

Total

age hgt

SPSS demo – BoxplotsSPSS demo – Boxplots

Graph -> Boxplots -> Simple -> Define -> Select variables ( e.g.

PLUC_pre) and category axis (e.g. grp) -> OK

MS Excel demo: Summary Statistics- MS Excel demo: Summary Statistics- Categorical VariableCategorical Variable

Frequency: Type bins -> Insert -> Function -> Statistical -> Frequency -> Select ranges for data (grp) and bins -> take the curser left of equal sign and then press simultaneously Ctrl, Shift, and Enter.

Pie Chart: Select Frequency -> Chart -> Pie -> Series : write category labels (1,2,3) -> next Click title and write title, click data labels and select show percent then click on next.

Bin Freq1 202 203 20

Pie Chart: Patients by treatment group

1

2

3

MS Excel demo: Summary Statistics- MS Excel demo: Summary Statistics- numerical variablenumerical variable

age

Mean 90.41666667Standard Error 3.902649518Median 84Mode 84Standard Deviation 30.22979318Sample Variance 913.8403955Kurtosis -1.183899591Skewness 0.389872725Range 95Minimum 48Maximum 143Sum 5425Count 60

QuestionsQuestions