Lecture 3 – Data Summary Measures and Graphical
Display of Results
Univariate Data –
Analysis of one variable at a time
Why Think About/Explore Data?• Done to accomplish:
– Checking for data entry errors– Describing demographic and study
characteristics– Examining distributions of outcomes
•Central tendency•Variability
– Checking for outliers– Checking assumptions for subsequent
analyses– Give a picture of your sample
• In order to understand choices of which statistics could be appropriate, it is paramount to ascertain what measurement level the outcome (s) and predictor (s) have.
Dependent variable = outcome Independent variable = predictor
Types of DataNominal – Qualitative Data
Measured in unordered categories
Ordinal – Qualitative Data Measured in ordered categories
Continuous – Quantitative Data Measured on a continuum
(summarize with %’s):
(summarize with %’s):
summarize with Many Summary Measures
Types of DataNominal – Qualitative Data
Measured in unordered categoriesRace Blood TypeDead/Alive
Ordinal – Qualitative Data Measured in ordered categoriesCancer StagesSocio-economic Status (low, med, hi)
Continuous – Quantitative Data Measured on a continuumSerum CreatinineHeight/Weight/BMI
Gender On Dialysis/Not on Dialysis
Likert (unlikely, somewhat unlikely, neutral, likely, very likely)
Systolic Blood PressureDiastolic Blood PressureOthers???
Continuous (Numerical)
Mean Arithmetic AverageSum of Values/Number of ValuesNice mathematical/statistical properties
Median (a.k.a 50th Percentile)Value where half the sample is above, half
the sample is belowBetter measure for skewed data. Robust to
Extreme values
ModeMost Frequently Occurring value in Sample
Measures of Location
Continuous (Numerical)NORMAL DISTRIBUTION
Measures of VariabilityMeasures of Variability
• Range = (maximum - minimum)
• Interquartile range = (Q3 – Q1) always covers half the sample (75th - 25th percentile)
• Variance = average of the squares of the deviations of the observations from their mean
• Standard deviation =
Variance
Continuous (Numerical)
n
i
i
n
xx
1
2
1
)(var
Continuous (Numerical)NORMAL DISTRIBUTION
http://www.stattucino.com/berrie/dsl/index.html
Describing Data using Numerical Summaries
Descriptive statistics:
Explore data in order to describe their main features
Get an initial picture of data sample
Let’s Talk Data…
Categorical
GenderN %
Female 6163
38.4%
Male 3837
61.6%
DialysisN %
No 8093 80.9%
Yes 1907 19.1%
0%
20%
40%
60%
80%
Gender
Female Male
0%
20%
40%
60%
80%
100%
Dialysis
No Yes
CategoricalRace
N %
Black 1942
19.4%
Hispanic 723 7.2%
Other 1068
10.7%
White 6267
62.7%
EducationN %
Elementary 1491
14.9%
High School Grad
2640
26.4%
College Grad 3246
32.5%
Post Graduate
2616
26.2%
0%
20%
40%
Education
Elementary High School Grad College Grad Post Graduate
0%
20%
40%
60%
80%
Race/Ethnicity
Black Hispanic Other White
CategoricalRace
N %
Black 1942
19.4%
Hispanic 723 7.2%
Other 1068
10.7%
White 6267
62.7%
EducationN %
Elementary 1491
14.9%
High School Grad
2640
26.4%
College Grad 3246
32.5%
Post Graduate
2616
26.2%
0%
20%
40%
Education
Elementary High School Grad College Grad Post Graduate
0%
20%
40%
60%
80%
Race/Ethnicity
Black Hispanic Other White
Continuous
BMIMeasure
Mean 32.2
Std Dev 5.46
Median 31.8
Minimum 16.0
Maximum 50.7
25th Percentile
28.2
75th Percentile
35.9
Mode 29.0
N = 115
BMIMeasure
Mean 32.0
Std Dev 5.34
Median 31.2
Minimum 21.8
Maximum 44.5
25th Percentile
28.5
75th Percentile
34.8
Mode .
BMIMean: 32.2
Std: 5.4
Median: 31.8
Mean: 136.3
Std: 17.1
Median: 135
Mean: 189.77
Std: 148.9
Median: 154.11
Fra
ctio
n
z-3.19068 3.16666
0
.224
Fra
ctio
n
x-29.644 -.540257
0
.1955
Fra
ctio
n
z.397801 31.7841
0
.1995
Shape of a distributionsymmetric
skewed tothe right
skewed tothe left
Mean greater than Median(positively skewed)
Mean less than Median(negatively skewed)
Mean: 136.3
Std: 17.1
Median: 135
Skewness: 0.38
Mean: 189.77
Std: 148.9
Median: 154.11
Skewness: 5.63
NORMAL DISTRIBUTION
Normal Distribution – Has Excellent Statistical Properties
Many Statistical techniques require normal distributions
If data does not have Normal Distribution, need to consider alternative techniques appropriate for data
Box (and Whisker) PlotsBox (and Whisker) Plots• A graph of the 5 number summary
with suspected outliers plotted individually
• 5 number summary: Min, Q1, Median, Q3, Max• A line somewhere inside the box marks
the Median• IQR = Q3 – Q1• Cases more than 1.5*IQR are plotted
individually (possible outliers)• Lines from the box extend to the
smallest and largest values that are not more than 1.5*IQR
median
25th Percentile
75th Percentile
mean
1.5 x IQR
Outlier
Skewed to the right Skewed to the leftSymmetric
+
+
+
Normal Probability PlotNormal Probability Plot
• Plot that can help assess normality.
• Idea: plot the observed levels of the variable against the expected levels corresponding to a Normal distribution.
• If data lie in a reasonably straight diagonal line, then assumption of Normality is reasonable.
Normal Probability PlotsNormal Probability Plots
BMI
Triglycerides
Error Error Bar Bar
PlotsPlots
Circle denotes the mean and the bars denote the standard deviation (in this case).
Part II – Measures of Association
(plus a little more)
Measures of Association• Continuous Variables
– Correlation– Agreement (reliability)
• Categorical Variables– Two-way layout (2×2 tables)– “Risk” measures– Agreement– Others
Two Continuous Variables
Correlation– General sense: the relationship between two
variables (quantitative or qualitative)– Narrow (statistical) sense: measure of
interdependence between two continuous random variables
• The degree to which increases or decreases in Y occur with increases or decreases in X
• Values range between -1 (perfect discordance) and 1 (perfect concordance)
• A value of 0 indicates no association
Pearson Correlation
Data
Subject # X Y
1 x1 y1
2 x2 y2 . . .
.
.
.
.
.
. n xn yn
Purpose - measures linear association between two continuous variables X and Y
Pearson CorrelationThe Pearson (product-moment) correlation coefficient can be calculated for 2 continuous variables in a sample (regardless of distribution) using the formula:
N
1i
2
i
N
1i
2
i
N
1iii
xy
YYXX
YYXXrrρ̂
Correlation Figures
•
•• •••
•• •
••
•
••••••••
••••••••••
•••
••
••• •• ••
•
•
•• • • •
••••• •
••
•••
No relationship X
YA B C
D E
Perfect positive relationship Perfect negative relationship
Moderate positive relationship Strong negative relationship
•
••
•••
•
ρ = 0
ρ = 1ρ = -1
ρ = 0.5 ρ = -0.8
Correlation Inference• Easy “large sample” test for H0: ρ=0
For n ≥ 25, compute
which has N(0, ) distribution under H0
• This test assumes X,Y~ NBiv(μX, μY, σX
2, σX2, ρ)
e
ˆ1 1 ρlog
ˆ2 1 ρ
Many times a tenuous assumption!• Beware positive skewness & outliers• Beware data not truly continuous
1
(n-3)
Timeout: ASSUMPTIONS• As with any mathematical or physical
model, model assumptions are critical to making the correct inference
• Dealing with assumptions has lead to development of:– Nonparametric statistics: techniques that
reduce or eliminate dependence on the underlying distribution of the data
– Robust statistics: techniques that are affected little by departures from assumptions
Correlation (resumed)• A nonparametric version of the correlation
coefficient: Spearman’s Rank Correlation
• Like ρ, rs :
– ranges from -1 to 1– 0 no correlation, 1 perfect agreement– only requires ordinal data
2i i
s 2
6 [R(X ) R(Y)]r 1
n(n 1)
where R( ) is the of the variable
i
rank
Correlation Example: SBP and DBPSBP DBP R(SBP) R(DBP)
141.8 89.7 12 14
140.2 74.4 8.5 1
131.8 83.5 3 4
132.5 77.8 4 2
135.7 85.8 7 7
141.2 86.5 11 10
143.9 89.4 14 13
140.2 89.3 8.5 12
140.8 88.0 10 11
131.7 82.2 2 3
130.8 84.6 1 6
135.6 84.4 6 5
143.6 86.3 13 9
133.2 85.9 5 8
Correlation Example: SBP and DBP
SB
P
125
130
135
140
145
DBP
70 75 80 85 90
• All Data: ρ = 0.42; rs = 0.71
• Outlier deleted: ρ = 0.75; rs = 0.82
Questions –
1.Can we calculate a correlation coefficient between the incomes of a group of people and what city they live in?
Correlation Coefficient
No, we cannot, since city is a categorical variable. Correlation requires that both variables be quantitative.
Questions –
2.Does it change the correlation between height and weight if we measure height in inches rather than centimeters and weight in pounds rather than kilograms?
Correlation Coefficient
No. Because ρ (and r) uses the standardized values of the observations, ρ does not change when we change the units of measurements of x , y, or both.The correlation ρ itself has no unit of measure; it is just a number.
Question –
3.Does ρ = 0 mean there is no relationship between X and Y ?
Correlation Coefficient
Correlation only measures the strength of the linear relationship between two variables. Correlation does not describe nonlinear relationships between two variables, no matter how strong they are.
x
y •
• •••••
••••••
Correlation and Regression
••
••• •• ••
•
•
•• • • •
••••• •
••
•••
Moderate positive relationship Strong negative relationship
ρ = 0.5 ρ = -0.8
2i
Y
2Xi
(Y Y)σn-1ˆ ˆ ˆβ = ρ = ρσ(X X)
n-1
Y Y
X X
Y = α+βX
Correlation and RegressionS
BP
125
130
135
140
145
DBP
70 75 80 85 90
SBP = 40.1 + 1.12×DBP
DBP = 16.3 + 0.51×SBP
SBP and DBP example (continued)
σSBP= 4.9 (mmHg)
σDBP= 3.3 (mmHg)
ρ = 0.75
4.90.75
3.3
3.30.75
4.9
Correlation and Covariance• Suppose two random variables, X and Y:
E(X) = μX, V(X) = σX2; E(Y) = μY, V(Y) = σY
2; and Corr(X,Y) = ρ
• Define Cov(X,Y) = E[(X-μX)(Y-μY)]
Note: Cov(X,X) = E[(X-μX)(X-μx)] = E(X-μX)2 = σX2
• Population correlation (ρ) is defined as:
• Thus Cov(X,Y) = ρσXσY
X Y
X Y X Y
E[(X-μ )(Y-μ )] Cov(X,Y)ρ =
σ σ σ σ
Correlation and Covariance
What’s the big deal about covariance?Use it to find the variance of functions of
random variables, e.g.:
In general:2 2 2 2
X YV(aX+bY) = a σ b σ 2abCov(X,Y)
2 2X YV(X-Y) = σ σ 2Cov(X,Y)
2 2X YV(X+Y) = σ σ 2Cov(X,Y)
Correlation as AgreementSBP1 SBP2
141.8 139.7
140.2 144.4
131.8 133.5
132.5 127.8
135.7 135.8
141.2 146.5
143.9 139.4
140.2 139.3
140.8 138.0
131.7 132.2
130.8 134.6
135.6 134.4
143.6 146.3
133.2 135.9
Suppose two nurses are measuring SBP in the same patients and each nurse measures SBP 3 times in each patient.
Correlation as Agreement• Could use Pearson correlation
• Another measure, intraclass correlation– Can separate the variance into two sources: between-
subject and within-subject– The intraclass correlation is the ratio of the within-
subject to the total (i.e., within + between)– By definition, intraclass correlation ranges from 0 to 1– Best measure of the “individual” touch
• In SBP example:
ρ(Pearson) = 0.809 ρ(Intraclass) = 0.814
Things to Remember AboutCorrelation
• 5 warnings (adopted from Huck):
1. Does not speak to cause-and-effect
2. Beware outliers
3. Assumes linear relationship
4. Correlation vs. Independence Zero correlation implies independence for
Normal distribution only
5. Strength of relationship WRT trend
Categorical Outcomes: Two-way Tables
• Prospective DesignRelative Risk (RR)
P(Disease in Exposed Group) P(D|E)
P(Disease in Unexposed Group) P(D|E)
• Retrospective DesignOdds Ratio (OR)
=
P(E|D)P(Exposure in Cases)P(E|D)1-P(Exposure in Cases) P(E|D)P(E|D)
P(E|D)P(E|D)P(Exposure in Controls) P(E|D)1-P(Exposure in Controls) P(E|D)
Two-way TablesDisease
Yes No
Yes a b a+b
No c d c+d
a+c b+d n=a+b+c+dExp
osur
e
P(D|E) = a/(a+b)
P(D|E) = c/(c+d)
P(E|D) = a/(a+c)
P(E|D) = b/(b+d)
Prospective Retrospective
E
a dada+c b+d
OR = = c b bc
a+c b+d
acRR =
ac+bc
Two-way Tables• Prospective design and relative risk (RR)
are optimal
• Retrospective designs and odds ratio (OR) are easiest (cheapest)
• Can compute OR for prospective design
D
a dada+b c+d
OR = = b c bc
a+b c+d
Two-way Table• Why we like the odds ratio…
The exposure odds ratio is equivalent to the disease odds ratio!
• Regardless of study design (i.e., which margin is fixed) the estimate of the OR is the same
D E
a d a dada+b c+d a+c b+d
OR = = = = OR b c b cbc
a+b c+d b+d a+c
Two-way TablesCancer
Yes No
Yes 35 25 60
No 5 35 40
40 60 100
Sm
oke
35 5RR = = 4.7
35 5+25 5
35 35OR = = 9.8
25 5
Two-way TableWhy we like the odds ratio – Part II
• For retrospective design, if…– Cases are representative of the population of
all cases– Controls are representative of the population
of all controls– The disease is “rare” (i.e., prevalence <20%)
Then OR ≈ RR
Two-way TablesCancer
Yes No
Yes 75 325 400
No 25 575 600
100 900 1000
Sm
oke
35 5RR = = 4.5
35 5+25 5
35 35OR = = 5.3
25 5
Other Measures From Clinical TrialsOutcome
Yes No
Experimental 15 135 150
Control 100 150 250
115 285 400Tre
atm
ent
P(O|E) = 15/150 = 0.1 P(O|C) = 100/250 = 0.4
RR = P(O|E)/P(O|C) = 0.25
• Absolute Risk Reduction (ARR) = P(O|C) - P(O|E) = 0.3• Relative Risk Reduction (RRR) = 1 – RR = 0.75• Number Needed to Treat (NNT) = 1/ARR = 3.33 (number needed to treat in the population to prevent 1 outcome event)
Things to Remember About Measures of Association
1. Beware: some sources use “odds ratio” and “relative risk” interchangeably
– In most settings, OR overestimates RR
2. Be on guard when considering ARR, RRR, and NNT
– Almost never see a SE or CI estimate– Should be based on large, well planned,
prospective studies
Categorical Measures of Agreement
• The “kappa” coefficient or κ • Example: two physicians diagnosing a disease
Here pa, pb, pc, pd are the proportions of subjects, not the number of subjects.
DOCTOR B
Disease No Disease
Disease pa pb pA
No Disease pc pd qA
pB qB 1DO
CT
OR
A
a d b c
A B B A
2(p p p p )κ̂
p q p q
Kappa ExamplePsychiatrist B
Neurosis Normal
Neurosis 0.04 0.06 0.10
Normal 0.01 0.89 0.90
0.05 0.95 1.00Psy
chia
tris
t A
2(0.04 0.89 0.06 0.01)κ̂ 0.50
0.10 0.95 0.05 0.90
• Kappa is a categorical analog of the intraclass correlation• Kappa can be computed for any “square” (k×k) tables
Schedule