RESEARCH STATISTICSRESEARCH STATISTICS
Jobayer Hossain, PhD
Larry Holmes, Jr, PhD
October 16, 2008
Data SummarizationData Summarization
In research, the first step of data analysis is to describe the
distribution of the variables included in the study.
The advantages of data descriptions are-
– To get quick over all idea of the study
– To get quick idea of the difference among comparing groups
– To check the balance of the distribution of the demographic and
other prognostic variables that influences the outcome unduly.
Is the balance of the distribution of prognostic Is the balance of the distribution of prognostic factors in comparing groups important?factors in comparing groups important?
Example: Primary biliary cirrhosis trial (a chronic and and
fatal liver disease)
– A randomized double-blind trial
– Study treatment groups: Azathioprine vs placebo
– Objective: To compare the survival time of two treatment
groups
– Primary end point: Time to death from randomizationExample from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai
Is the balance of the distribution of prognostic Is the balance of the distribution of prognostic factors in comparing groups important?factors in comparing groups important?
Example … contd.
Bilirubin is a strong predictor of survival time.
Statistics Placebo Azathioprine No. of Patients 94 97 Mean 53.75 67.40 Std. Deviation 70.5 88.95 Median 30.90 38.02 Minimum 5.13 7.24 Maximum 436.52 537.03
Table 1: Summary stat for Bilirubin level (mol/L) at baseline
Is the baseline imbalanced?
You may expect a higher mortality rate in the Azathioprine group- why? How does it affect the primary end point of survival time?
Example from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai
Is the balance of the distribution of prognostic Is the balance of the distribution of prognostic factors in comparing groups important?factors in comparing groups important?
Covariate Parameter Hazard Ratio P-value 95% CI Unadjusted Treatment (A vs P) 0.86 0.455 0.57 - 1.28 Adjusted Treatment (A vs P) 0.65 0.044 0.43 – 0.99 Log Bilirubin 2.80 <0.001 2.25 – 3.48
Table3: Adjusted and Unadjusted Hazard ratios of death from the Cox proportional Hazards model
There was no significant difference between two treatment groups (p-value=0.455) before adjustment for the covariate Bilirubin
But after adjustment for Bilirubin, a significant difference was found (p-value < 0.001) between treatment groups
Example from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai
Looking at DataLooking at Data
How are the data distributed?
– Where is the center?
– What is the range?
– What is the shape of the distribution (symmetric, skewed)
Are there outliers?
Are there data points that don’t make sense?
Distribution of a variableDistribution of a variable
Distribution - (of a variable) tells us what values the
variable takes and how often it takes these values. E.g.
distribution of some 26 pediatric patients of ages 1 to 6
at AIDHC are as follows-
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Statistical Description/Summarization Statistical Description/Summarization of Dataof Data
Statistics describes the distribution of a numeric set of data by
its
Center (mean, median, mode etc)
Variability (standard deviation, range etc)
Shape (skewness, kurtosis etc)
Statistics describes distribution of a categorical set of data by
Frequency, percentage or proportion of each category
Statistical Description/summarization Statistical Description/summarization of Dataof Data
Examples of numerical and categorical variables-
– Numerical variable: Age, blood pressure (systolic and
diastolic), time, weight, height, bmi (body mass index)
– Categorical variable: Treatment group, disease status,
race, gender, blood type (O, A, B, AB), age groups
(such as 1-5 years, 6-9 years etc)
Statistical Presentation of DataStatistical Presentation of Data
Two types of statistical presentation of data - graphical and numerical.
Graphical Presentation: We look for the overall pattern and for striking deviations from that pattern. Over all pattern usually described by shape, center, and spread of the data. An individual value that falls outside the overall pattern is called an outlier.
Bar diagram and Pie charts are used for categorical variables.
Histogram, stem and leaf and Box-plot are used for numerical variable.
Statistical Presentation of DataStatistical Presentation of Data
Statistics presents the data either graphically or numerically
In graphical presentation, we look for the overall pattern (distribution) and for striking deviations from that pattern
An individual value that falls outside the overall pattern is called an outlier.
– Over all pattern of numerical data usually described by shape, center, and spread of data. Commonly used graphs are histogram, stem and leaf plot, and boxplot
– Overall pattern of a categorical data usually described by frequency and percentages. Commonly used graphs are bar plot and pie chart
Data Presentation –Categorical Data Presentation –Categorical VariableVariable
Bar Diagram: Lists the categories and presents the percent or count of individuals who fall in each category.
TreatmentGroup
Frequency Proportion Percent(%)
1 15 (15/60)=0.25 25.0
2 25 (25/60)=0.333 41.7
3 20 (20/60)=0.417 33.3
Total 60 1.00 100
Figure 1: Bar Chart of Subjects in Treatment Groups
0
5
10
15
20
25
30
1 2 3
Treatment Group
Nu
mb
er
of
Su
bje
cts
Data Presentation –Categorical Data Presentation –Categorical VariableVariable
Pie Chart: Lists the categories and presents the percent or count of individuals who fall in each category.
Figure 2: Pie Chart of Subjects in Treatment Groups
25%
42%
33% 1
2
3
TreatmentGroup
Frequency Proportion Percent(%)
1 15 (15/60)=0.25 25.0
2 25 (25/60)=0.333 41.7
3 20 (20/60)=0.417 33.3
Total 60 1.00 100
Data Presentation –Categorical Data Presentation –Categorical Variable (Frequency Distribution)Variable (Frequency Distribution)
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Frequency Distribution of Age
Grouped Frequency Distribution of Age:
Age Group 1-2 3-4 5-6
Frequency 8 12 6
Consider a data set of 26 children of ages 1-6 years. Then the frequency distribution of variable ‘age’ can be tabulated as follows:
Data Presentation –Categorical Data Presentation –Categorical Variable (Frequency Distribution)Variable (Frequency Distribution)
Age Group 1-2 3-4 5-6
Frequency 8 12 6
Cumulative Frequency 8 20 26
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Cumulative Frequency 5 8 15 20 24 26
Cumulative frequency of data in previous page
Data Presentation –Numerical Data Presentation –Numerical VariableVariable
Figure 3: Age Distribution
0
2
4
6
8
10
12
14
16
40 60 80 100 120 140 More
Age in Month
Nu
mb
er o
f S
ub
ject
s
Histogram: Overall pattern can be described by its shape, center, and spread. The following age distribution is right skewed. The center lies between 80 to 100. No outliers.
Mean 90.41666667
Standard Error 3.902649518
Median 84
Mode 84
Standard Deviation 30.22979318
Sample Variance 913.8403955
Kurtosis -1.183899591
Skewness 0.389872725
Range 95
Minimum 48
Maximum 143
Sum 5425
Count 60
Graphical presentation- Numerical Graphical presentation- Numerical VariableVariable
Boxplot :
– A boxplot is a graph of the five number summary. The central box
spans the quartiles.
– A line within the box marks the median.
– Lines extending above and below the box mark the smallest and
the largest observations (i.e. the range).
– Outlying samples may be additionally plotted outside the range.
Graphical Presentation –Numerical Graphical Presentation –Numerical VariableVariable
0
20
40
60
80
100
120
140
160
1
q1
min
median
max
q3
Box-Plot: Box contains middle 50% of the data. The upper and lower whisker contains top 25% and bottom 25% of the ordered data.
Figure 3: Distribution of Age
Box Plot
Mean 90.41666667
Standard Error 3.902649518
Median 84
Mode 84
Standard Deviation 30.22979318
Sample Variance 913.8403955
Kurtosis -1.183899591
Skewness 0.389872725
Range 95
Minimum 48
Maximum 143
Sum 5425
Count 60
75th percentile
Median
25th percentile
Minimum
Maximum
The shape of the distribution is right skewed as the upper part of the box and the whisker are longer the corresponding lower parts
Side by Side BoxplotSide by Side Boxplot6
08
01
00
12
01
40
Side by Side boxplots of ages of three treatment groups
Trt 3Trt 2Trt 1
75th percentile
25th percentile
maximum
interquartile range
minimum
median
0.0
33.3
66.7
100.0
Box Plot: Age of patientsY
ears
Numerical PresentationNumerical Presentation
To understand how well a central value characterizes a set of observations, let
us consider the following two sets of data:
A: 30, 50, 70
B: 40, 50, 60
The mean of both two data sets is 50. But, the distance of the observations from
the mean in data set A is larger than in the data set B. Thus, the mean of data
set B is a better representation of the data set than is the case for set A.
A fundamental concept in summary statistics is that of a central value for a set
of observations and the extent to which the central value characterizes the
whole set of data. Measures of central value such as the mean or median must
be coupled with measures of data dispersion (e.g., average distance from the
mean) to indicate how well the central value characterizes the data as a whole.
Methods of Center MeasurementMethods of Center Measurement
Commonly used methods are mean, median, mode, geometric mean etc.
Mean: Summing up all the observation and dividing by number of observations. Mean of 20, 30, 40 is (20+30+40)/3 = 30.
n
x
n
xxxx
x
nxxx
n
ii
n
n
121
,21
...
variable, thisofmean Then the .
variablea of nsobservatio are ...,Let :Notation
Center measurement is a summary measure of the overall level of a dataset
Methods of Center MeasurementMethods of Center Measurement
Median: The middle value in an ordered sequence of observations.
That is, to find the median we need to order the data set and then
find the middle value. In case of an even number of observations
the average of the two middle most values is the median. For
example, to find the median of {9, 3, 6, 7, 5}, we first sort the data
giving {3, 5, 6, 7, 9}, then choose the middle value 6. If the
number of observations is even, e.g., {9, 3, 6, 7, 5, 2}, then the
median is the average of the two middle values from the sorted
sequence, in this case, (5 + 6) / 2 = 5.5.
Mean or MedianMean or Median
The mean is affected by outlier (s) but median is not.
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
35
15
5
54321
4
5
20
5
104321
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Median = 3 Median = 3
Mean or MedianMean or Median
The median is less sensitive to outliers (extreme scores) than the mean and thus a better
measure than the mean for highly skewed distributions, e.g. family income. For
example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The median of these
four observations is (30+40)/2 =35. Here 3 observations out of 4 lie between 20-40.
So, the mean 270 really fails to give a realistic picture of the major part of the data. It is
influenced by extreme value 990.
Methods of Center MeasurementMethods of Center Measurement
Mode: The value that is observed most frequently. The
mode is undefined for sequences in which no observation
is repeated. A variable with single mode is unimodal, with
two modes is bimodal
A bimodal histogramA bimodal histogram
A modal class A modal class
Slide from Zhengyuan Zhu, UNC, http://www.unc.edu/~zhuz
Methods of Variability MeasurementMethods of Variability Measurement
Commonly used methods: range, variance, standard deviation, interquartile range, coefficient of variation etc.
Range: The difference between the largest and the smallest observations. The range of 10, 5, 2, 100 is (100-2)=98. It’s a crude measure of variability.
Variability (or dispersion) measures the amount of scatter in a dataset.
Methods of Variability MeasurementMethods of Variability Measurement
Variance: The variance of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of the n observations x1, x2,…xn is
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is
413
)57()53()55( 222
1
)(....)( 2212
n
xxxxS n
Standard Deviation (SD) : Square root of the variance. The SD of the above example is 2.
If the distribution is bell shaped (symmetric), then the range is approximately (SD x 6)
Standard deviation of different Standard deviation of different distributions with the same centerdistributions with the same center
Mean = 15.5 S = 3.338 11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5 S = 0.926
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5 S = 4.570
Data C
0.0
62.5
125.0
187.5
250.0
0.0 0.5 1.0 1.5 2.0
Std Dev of Shock Index
SI
Co
un
t
Std. dev is a measure of the “average” scatter around the mean.
Estimation method: if the distribution is bell shaped, the range is around 6 SD, so here rough guess for SD is 1.4/6 =.23
Slide from: Kristin L. Sainani, Stanford University, http://www.stanford.edu/~kcobb
Methods of Variability MeasurementMethods of Variability Measurement
Quartiles: Quartiles are values that divides the sorted dataset in to four equal parts so that each part contains 25% of the sorted data
The first quartile (Q1) is the value from which 25% observations are smaller and 75% observations are larger. This is the median of the 1st half of the ordered dataset.
The second quartile (Q2) is the median of the data.
In notations, quartiles of a data is the ((n+1)/4)qth observation of the data, where q is the desired quartile and n is the number of observations of data.
The third quartile (Q1) is the value from which 75% observations are smaller and 25% observations are larger. This is the median of the 2nd half of the ordered dataset.
Methods of Variability MeasurementMethods of Variability Measurement
An example with 15 numbers 3 6 7 11 13 22 30 40 44 50 52 61 68 80 94 Q1 Q2 Q3The first quartile is Q1=11. The second quartile is Q2=40 (This is also the Median.) The third quartile is Q3=61.
Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile range of the previous example is 61- 40=21. The middle half of the ordered data lie between 40 and 61.
In the following example Q1= ((15+1)/4)1 =4th observation of the data. The 4th observation is 11. So Q1 is of this data is 11.
Methods of Variability MeasurementMethods of Variability Measurement
25% 25% 25% 25%
Symmetric
Left Skewed
Right Skewed 25% 25% 25% 25%
Q1 Q2 Q3
Q1 Q2 Q3
Q1 Q2 Q3
Deciles and PercentilesDeciles and Percentiles
Deciles: If data are ordered and divided into 10 parts, then cut points are called Deciles
Percentiles: If data are ordered and divided into 100 parts, then cut points are called Percentiles. 25th percentile is the Q1, 50th percentile is the Median (Q2) and the 75th percentile of the data is Q3.
Suppose PC= ((n+1)/100)p, where n=number of observations and p is the desired percentile. If PC is an integer than pth percentile of a data set is the (PC)th observation of the ordered set of that data. Otherwise let PI be the integer part of PC and f be the fractional part of PC. Then pth percentile= OI + (OII -OI)x`f where OI is the (PI)th observation of the ordered set of data and OII is the (PI +1)th observation of the ordered set of data.
For example, Consider the following ordered set of data: 3, 5, 7, 8, 9, 11, 13, 15.
PC= (9/100)p
For 25 th percentile, PC=2.25 (not an integer), then
25th percentile = 5 + (7-5)x.25= 5.5
Coefficient of VariationCoefficient of Variation
Coefficient of Variation: The standard deviation of data divided
by it’s mean. It is usually expressed in percent.
Coefficient of Variation= 100x
Five Number SummaryFive Number Summary
Five Number Summary: The five number summary of a
distribution consists of the smallest (Minimum) observation, the
first quartile (Q1), the median(Q2), the third quartile, and the
largest (Maximum) observation written in order from smallest to
largest.
Choosing a SummaryChoosing a Summary
The five number summary is usually better than the mean and standard
deviation for describing a skewed distribution or a distribution with extreme
outliers. The mean and standard deviation are reasonable for symmetric
distributions that are free of outliers.
In real life we can’t always expect symmetry of the data. It’s a common
practice to include number of observations (n), mean, median, standard
deviation, and range as common for data summarization purpose. We can
include other summary statistics like Q1, Q3, Coefficient of variation if it is
considered to be important for describing data.
Shape of DataShape of Data
Shape of data is measured by – Skewness – Kurtosis
SkewnessSkewness Measures of asymmetry of data
– Positive or right skewed: Longer right tail
– Negative or left skewed: Longer left tail
– Symmetric: Bell shaped
2/3
1
2
1
3
21
)(
)(Skewness
Then, ns.observatio be ,...,Let
n
ii
n
ii
n
xx
xxn
nxxx
Right skewed
Left skewed
Slide from Zhengyuan Zhu, UNC, http://www.unc.edu/~zhuz
Bell-shaped HistogramsBell-shaped Histograms
Slide from Zhengyuan Zhu, UNC, http://www.unc.edu/~zhuz
Kurtosis FormulaKurtosis Formula
3
)(
)(Kurtosis
Then, ns.observatio be ,...,Let
2
1
2
1
4
21
n
ii
n
ii
n
xx
xxn
nxxx
KurtosisKurtosis
Kurtosis relates to the
relative flatness or
peakedness of a distribution.
A standard normal
distribution (blue line: µ = 0;
= 1) has kurtosis = 0. A
distribution like that
illustrated with the red curve
has kurtosis > 0 with a lower
peak relative to its tails.
Normal DistributionNormal Distribution
The Normal Distribution is a density curve based on the following formula. It’s completely defined by two parameters: mean; and standard deviation.
A density function describes the overall pattern of a distribution.
The total area under the curve is always 1.0. The normal distribution is symmetrical.mmetrical. TheThe mean, medianmean, median, and mode are all the same.
xexfx
, 2
1)(
22
)(2
1
Normal DistributionNormal Distribution
The 68-95-99.7 Rule
In the normal distribution with mean µ and standard deviation σ:
68% of the observations fall within σ of the mean µ.
95% of the observations fall within 2σ of the mean µ.
99.7% of the observations fall within 3σ of the mean µ.
68-95-99.7 Rule68-95-99.7 Rule
68% of the data
95% of the data
99.7% of the data
Slide from: Kristin L. Sainani, Stanford University, http://www.stanford.edu/~kcobb
Normal DistributionNormal Distribution
Standardizing and z-ScoresStandardizing and z-ScoresIf x is an observation from a distribution that has mean µ and standard deviation σ, the standardized value of x is,
A standardized value is often called a z-score. If x is a normal variable with mean µ and standard deviation σ, then z is a standard normal variable with mean 0 and standard deviation 1.
.
xz
, 2
1)(
, variablenormal standard a offunction density The
2
2
zezf
zz
Normal DistributionNormal Distribution Let x1, x2, …., xn be n random variables each with mean µ and standard deviation σ,
then sum of them ∑xi be also a normal with mean nµ and standard deviation σ√n. The distribution of mean is also a normal with mean µ and standard deviation σ/√n.
The standardized score of the mean is,
The mean of this standardized random variable is 0 and standard deviation is 1.
n
xz
/
x
x
SPSS demo- Data Summarization SPSS demo- Data Summarization Categorical variableCategorical variable
Frequencies/percentages: Analyze -> Frequencies -> Select variables (sex, grp,
shades, ped) -> oksex
30 50.0 50.0 50.0
30 50.0 50.0 100.0
60 100.0 100.0
f
m
Total
ValidFrequency Percent Valid Percent
CumulativePercent
grp
20 33.3 33.3 33.3
20 33.3 33.3 66.7
20 33.3 33.3 100.0
60 100.0 100.0
1
2
3
Total
ValidFrequency Percent Valid Percent
CumulativePercent
SPSS demo- Data Summarization SPSS demo- Data Summarization Categorical variableCategorical variable
Shades
30 50.0 50.0 50.0
30 50.0 50.0 100.0
60 100.0 100.0
1
2
Total
ValidFrequency Percent Valid Percent
CumulativePercent
Ped
30 50.0 50.0 50.0
30 50.0 50.0 100.0
60 100.0 100.0
1
2
Total
ValidFrequency Percent Valid Percent
CumulativePercent
SPSS demo- Bar ChartSPSS demo- Bar Chart
Analyze -> Frequencies -> Select variables (sex, grp, shades, ped) then
select option chart - > Select Chart type (Bar, histogram, Piechart) and
select percentages or frequencies- > Continue-> ok
Or
Graphs ->Bar -> Select type (Select type, clustered, stacked) -> Define
-> Select Bars represents (n of cases, % of cases) -> select variable for
category axis (e.g. grp) and click titles for writing titles -> continue ->
ok
SPSS demo- Bar ChartSPSS demo- Bar Chart
grp
20 33.3 33.3 33.3
20 33.3 33.3 66.7
20 33.3 33.3 100.0
60 100.0 100.0
1
2
3
Total
ValidFrequency Percent Valid Percent
CumulativePercent
SPSS demo- Data Summarization SPSS demo- Data Summarization Numerical variableNumerical variable
Analyze -> Descriptive Statistics -> Descriptive -> Select
variable (s) (e.g. Age, hgt) and click on radio button to
transfer the variable(s) in the other window and then select
options -> continue -> ok
Or
Analyze -> Compare means -> means ->select variable (s)
for dependent (age, hgt) and independent (grp, sex) list and
then select options -> Continue -> ok
SPSS demo- Data Summarization SPSS demo- Data Summarization Numerical variableNumerical variable
Descriptive Statistics
60 48.00 143.00 90.4167 30.22979
60 38.99 58.09 47.9931 5.77357
60
age
hgt
Valid N (listwise)
N Minimum Maximum Mean Std. Deviation
Report
86.6500 47.4651
20 20
32.53545 6.07346
69.0000 44.9115
53.00 40.03
142.00 58.09
95.8000 48.7766
20 20
28.00113 5.30175
93.0000 48.6208
48.00 40.30
143.00 57.16
88.8000 47.7377
20 20
30.77183 6.12433
81.5000 46.9664
51.00 38.99
138.00 57.68
90.4167 47.9931
60 60
30.22979 5.77357
84.0000 46.6863
48.00 38.99
143.00 58.09
Mean
N
Std. Deviation
Median
Minimum
Maximum
Mean
N
Std. Deviation
Median
Minimum
Maximum
Mean
N
Std. Deviation
Median
Minimum
Maximum
Mean
N
Std. Deviation
Median
Minimum
Maximum
grp1
2
3
Total
age hgt
SPSS demo – BoxplotsSPSS demo – Boxplots
Graph -> Boxplots -> Simple -> Define -> Select variables ( e.g.
PLUC_pre) and category axis (e.g. grp) -> OK
MS Excel demo: Summary Statistics- MS Excel demo: Summary Statistics- Categorical VariableCategorical Variable
Frequency: Type bins -> Insert -> Function -> Statistical -> Frequency -> Select ranges for data (grp) and bins -> take the curser left of equal sign and then press simultaneously Ctrl, Shift, and Enter.
Pie Chart: Select Frequency -> Chart -> Pie -> Series : write category labels (1,2,3) -> next Click title and write title, click data labels and select show percent then click on next.
Bin Freq1 202 203 20
Pie Chart: Patients by treatment group
1
2
3
MS Excel demo: Summary Statistics- MS Excel demo: Summary Statistics- numerical variablenumerical variable
age
Mean 90.41666667Standard Error 3.902649518Median 84Mode 84Standard Deviation 30.22979318Sample Variance 913.8403955Kurtosis -1.183899591Skewness 0.389872725Range 95Minimum 48Maximum 143Sum 5425Count 60
QuestionsQuestions