Conversing with Data
• Statistical Analysis
– Exploratory/Confirmatory
• Formulating the questions
– Regression Prediction as Formulation
• Refining the questions
• Comparing
– Testing
– Predicting
Diploma in Statistics 2013 Introduction to Regression Week 6
1
Hypothesis Testing - if you must • Scientific Hypothesis
– Proof/Disproof
• Thesis/Antithesis
• Devil’s advocate
• Specific (default Null) and Alt Hyps Ho and HA
– Test Statistic
• p values
• Critical values
• Stat Significance
• Reject/Fail to reject/Accept
Diploma in Statistics 2013 Introduction to Regression Week 6
2
http://en.wikipedia.org/wiki/Statistical_significance_test
PE Max revisited The regression equation is
PEmax = 62.1 - 12.5 Sex + 3.77 Age - 0.013 FRC
Predictor Coef SE Coef T P
Constant 62.13 42.84 1.45 0.162
Sex -12.54 11.26 -1.11 0.278
Age 3.771 1.441 2.62 0.016
FRC -0.0134 0.1673 -0.08 0.937
S = 27.4069 R-Sq = 41.2%
Analysis of Variance
Source DF SS MS F P
Regression 3 11058.8 3686.3 4.91 0.010
Residual Error 21 15773.9 751.1
Total 24 26832.6
Diploma in Statistics 2013 Introduction to Regression Week 6
3
p – values - Probability of ??? Data Coeff/Diff/Ratio p < 0.05
• All probability depends on – Depends on “info”
– Requires precise statement of ‘event’ • Care with Pr(A and B)=Pr(A)Pr(B)
• Under specific (default Null) hypothesis the probability is < 0.05 of observing “data like this” to be as or more extreme than “this data set”
Diploma in Statistics 2013 Introduction to Regression Week 6
4
http://en.wikipedia.org/wiki/Statistical_significance_test
and technical assumptions
95% Confidence Intervals • Informal
– Margin for error associated with this ‘coeff’
– Computed for ‘data like this’
– Values in this interval are ‘statistically consistent with the data’
• Formal – List of specific Hypotheses
that would not be rejected by Stat test procedure, sig at 5%
• NOT Prob 95% true value in this interval
Diploma in Statistics 2013 Introduction to
Regression Week 6 5
Role of T-ratios
ˆ
ˆ
Informally, if is not large (>2 in mag; p - 0.05)
then coeff of can be given a value of zero -
equivalently can be dropped from model -
with little appreciable impact.
Cautio : pn a p l
kk
k
k
k
k
tSE
t
x
x
ies one variable at a time
Diploma in Statistics 2013 Introduction to Regression Week 6
6
ˆ
ˆk o
k
k
tSE
ANOVA tests: Composite
• F – in simplest case – = SumSq t-ratios
• Null Hypothesis
– No ‘systematic variation’
– All variation ‘random’
– All slope coeffs = 0
• Alternative Hypothesis
– Some ‘systematic variation’
– At least one coeff is non-zero
Diploma in Statistics 2013 Introduction to Regression Week 6
7
As or more extreme
My view • Data a (small?) part of the info available
– Prob – given data and theory
– What data not in study?
• Beware – “Fooled by randomness” Narrative fallacy
– Rely on theory - including • Scientific common sense
• Philosophy of science - proof/disproof
• Use of p-values for screening ‘false pos/neg’
• Be prepared to struggle with editors/supervisors/peers
Diploma in Statistics 2013 Introduction to Regression Week 6
8
what you see is all there is WYSIATI Kahneman
Regression Model
• Some variation in Y explicable by variation in x-vars – by some weighted average of x-vars
– x-vars act together ‘without interaction’
• Inexplicable variation –residuals - is so subtle that – not worth the hassle of pursuit
– might as well regard it as unpredictable/random with common magnitude
• If so, rules for random variation – Can help provide some guidelines on big/small
Diploma in Statistics 2013 Introduction to Regression Week 6
9
Linear Model Theory
1 1 2 2
2
1
Classic Linear Model
....
where ~ 0, 0,
Statistical Theory
assumes unpredictable has Normal Dist (technical)
makes NO assumptions re dist of , ,..
makes assu
i i i p pi i
i
i
Y x x x
N Var or N SD
Y X
mption of additivity
makes assumption ( )does not change with vars
(crucial)
Var Y x
Diploma in Statistics 2013 Introduction to Regression Week 6
10
Linearity - Weighted Sum
2
1 1 2 2
a weighted sum of variables coeffs are weights
a weighted sum of coefficients vars are weights
( , ) straight line in ; linear in ,
(
Linear Model ....
i
i
i p pi ii i
Y x
Y x
y x y x x
y x
Y x x x
211 1 12 1 2 2 11 12 2
, ) quadratic relationship; linear in ,
log log ( , ) non-linear; linear in ,
log( ) log ; linear in , , ,
y x
y x y x
y x x x
Diploma in Statistics 2013 Introduction to Regression Week 6
11
Multiple Linear Regression
• Simple weighted average
– Variables act ‘one-at-a-time’
– Simultaneous effect is additive
• Interaction
– X1 acting in combination with X2
– Simultaneous effect is not-additive
– Simplest non-additive is multiplicative
wiki/Interaction_(statistics)
A drug X might be desirable for treating a certain condition, but not if you are taking drug Y, because if you do take drugs X and Y together there is a bad consequence from their combination, a bad drug "interaction"
Diploma in Statistics 2013 Introduction to Regression Week 6
12
Derived x-variables
• No extra explanatory power in
– Linear transform of individual x-vars
– Weighted averages of individual x-vars
• Possibly extra explanatory power in
– Non-linear transforms
– Multiplicative combinations also Ratios
– Other non-linear combinations
• Possibly simpler model
Fishing?
Diploma in Statistics 2013 Introduction to Regression Week 6
13
Transforming for Simplicity • Recall Multiple Linear Regression
– General purpose tool – Customise
• Use the language/imagery of users – Are changes/differences expressed as scale free?
• Temperature, Date • Twice, percent, proportion
• Automatically avoid howlers/satisfy constraints – Predict negative values – Create Prediction intervals with negative values – When log transform?
Diploma in Statistics 2013 Introduction to Regression Week 6
14
CEO salaries
Diploma in Statistics 2013 Introduction to Regression Week 6
15
120000100000800006000040000200000
12000000
10000000
8000000
6000000
4000000
2000000
0
Sales
Co
mp
en
sa
tio
n
S 1367165
R-Sq 13.7%
R-Sq(adj) 13.5%
Regression
95% PI
Fitted Line PlotCompensation = 1437416 + 61.57 Sales
120000100000800006000040000200000
16000000
14000000
12000000
10000000
8000000
6000000
4000000
2000000
0
Sales
Co
mp
en
sa
tio
n
S 0.274596
R-Sq 24.0%
R-Sq(adj) 23.9%
Regression
95% PI
Fitted Line Plotlog10(Compensation) = 5.076 + 0.3086 log10(Sales)
100000100001000100
10000000
1000000
100000
Sales
Co
mp
en
sa
tio
n
S 0.274596
R-Sq 24.0%
R-Sq(adj) 23.9%
Regression
95% PI
Fitted Line Plotlog10(Compensation) = 5.076 + 0.3086 log10(Sales)
Mammal Weights
Diploma in Statistics 2013 Introduction to Regression Week 6
16
70006000500040003000200010000
8000
7000
6000
5000
4000
3000
2000
1000
0
-1000
BodyW
Bra
inW
S 334.720
R-Sq 87.3%
R-Sq(adj) 87.1%
Regression
95% PI
Fitted Line PlotBrainW = 91.00 + 0.9665 BodyW
70006000500040003000200010000
30000
25000
20000
15000
10000
5000
0
BodyW
Bra
inW
S 0.301528
R-Sq 92.1%
R-Sq(adj) 91.9%
Regression
95% PI
Fitted Line Plotlog10(BrainW) = 0.9271 + 0.7517 log10(BodyW)
1000
0.00
0
1000
.000
100.00
0
10.000
1.00
0
0.10
0
0.01
0
0.00
1
100000.00
10000.00
1000.00
100.00
10.00
1.00
0.10
0.01
BodyW
Bra
inW
S 0.301528
R-Sq 92.1%
R-Sq(adj) 91.9%
Regression
95% PI
Fitted Line Plotlog10(BrainW) = 0.9271 + 0.7517 log10(BodyW)
Random Variation Additive?
0.00
0.01
0.10
1.00
10.00
0 10 20 30 40 50 60
time t
Exp Decay on log scale. Random variation constant in time
0.00
1.00
2.00
3.00
4.00
5.00
6.00
0 10 20 30 40 50 60
time t
Exp Decay. Random variation decreases with time
Diploma in Statistics 2013 Introduction to Regression Week 6
17
Log scale on one or both axes
Non-linear Transformation • Evidence for
– Curvature in regression line
– Non-constant scatter (residuals) around line • Constant scatter
• Constant percentage scatter
• Constrained scatter
– Non-normal residuals
• Reasons for – Simpler more natural
– Technical
Diploma in Statistics 2013 Introduction to
Regression Week 6 18
Residuals Unusual Observations
Lactic
Obs Acid Taste Fit SE Fit Residual St Resid
15 1.52 54.90 29.45 3.04 25.45 2.63R
In fact barely outlying despite 2.63 Recall, one has to be largest!!
Diploma in Statistics 2013 Introduction to Regression Week 6
21
• Examine carefully: – Why outlying?
– Anything special about this case/obs?
– Refit without • does its removal change anything important?
• If delete, then formally – Conclusions are based on ‘something like this
never happening in future’
– Is this a meaningful statement?
Options with Large Residuals
Diploma in Statistics 2013 Introduction to Regression Week 6
22
Residuals, Standardized Residuals, Deleted T-residuals
Diploma in Statistics 2013 Introduction to Regression Week 6
23
Residual from best line using all dataStandardised Residual =
(All residuals)
Residual from best line using all dataBetter? =
(All other residuals)
Residual from best line usDeleted t-residual
ii
i
ii
SD
SD
ing all other data
(All other residuals)SD
Multiple Linear Regression Variants
• ANOVA and T-tests
• Best subsets – For prediction – barely relevant
– For understanding – dangerous
• Auto-regression – Time series, focus on prediction (and pred ints)
• Weighted regression WLS – Y variables not equally ‘precise’
• Correlated residuals GLS
Diploma in Statistics 2013 Introduction to Regression Week 6
24
Weighted Least Squares Data points:
means (x and y)
unequal sample sizes
wts prop to sample size
More generally
wts inversely proportional
to variance of y-vals for each x
Diploma in Statistics 2013 Introduction to Regression Week 6
25
Hans Rosling: Gapminder
2
2
ˆ values chosen to min
ˆ values chosen to min
i
i i
OLS
res
WLS
w res
Multiple Linear Regression Extensions
Logistic Regression
• Y variable 0/1
– Success/Failure at differing levels of x
• Y variable proportions pi (summaries of binary)
– Y constrained in 0/1
– Possible alternative: logistic transfn logpi
1−pi
regress on xi x
• Y variable nominal/ordinal
Diploma in Statistics 2013 Introduction to Regression Week 6
26
Multiple Linear Regression Extensions
Generalised Linear Modelling glm
includes:
– (Multiple) Linear regression
– Logistic regression
– Non-Normal random variation
– eg count data, survival times
• f(E[Y]) = +x
• Var[Y] depends on E[Y]
Diploma in Statistics 2013 Introduction to
Regression Week 6 27
Multiple Linear Regression Extensions
• Smoothers – Generalised Additive Modelling
• E[Y] = smooth function of x no coefficients
• Var[Y] depends on E[Y]
• ‘Modern methods’ p>>n – Data mining
– Machine Learning
– L1
– Sparse models
Diploma in Statistics 2013 Introduction to Regression Week 6
28
ˆ chosen tomini
res
Exam
• 3 questions
• General Question
– Prepare in advance
• Use MINITAB output
– To show competence
– To illustrate principles
Diploma in Statistics 2013 Introduction to Regression Week 6
29
Exam Derived variables, indicator variables and
transformations greatly extend the reach of regression.
Compute and plot fitted regressions.
Different foci of MLR and implications for analysis.
Network diagrams; no ‘discovery’
Tree Vol
Diam
Ht* Diam2
Ht
Theory
Diploma in Statistics 2013 Introduction to Regression Week 6
30
General Questions
• Reward
– independent thinking & learning
– As evidenced by ability to illustrate
• No more than 1 page
– Diagram
– Example
– Principle
Diploma in Statistics 2013 Introduction to
Regression Week 6 31
Exam in 2012
No
F -tables
t-tables
In Exam use “ fit ±2” approx for PI MINITAB uses (a) Tech formulae for curvature (b) Percentiles from t-tables
Diploma in Statistics 2013 Introduction to Regression Week 6
32
Q2, 2011
Analysis in the Log Scale
Regression Analysis: Log10Cu versus Log10Sh, Log10Dist The regression equation is
Log10Cu = - 0.906 + 3.03 Log10Sh
- 0.988 Log10Dist
Predictor Coef SE Coef T P
Constant -0.90610 0.01024 -88.51 0.000
Log10Sh 3.03146 0.02479 122.29 0.000
Log10Dist -0.98809 0.01508 -65.51 0.000
S = 0.0295692 R-Sq = 99.9%
Obs 5 is highlighted above as
Obs 19 is highlighted above as
0.80.40.0 0.50.0-0.5
0
-1
-2
0.8
0.4
0.0
Log10Cu
Log10Sh
Log10Dist
Matrix Plot of Log10Cu, Log10Sh, Log10Dist
0.0500.0250.000-0.025-0.050
99
90
50
10
1
Residual
Pe
rce
nt
10-1-2-3
0.050
0.025
0.000
-0.025
-0.050
Fitted Value
Re
sid
ua
l
0.040.020.00-0.02-0.04-0.06
4.8
3.6
2.4
1.2
0.0
Residual
Fre
qu
en
cy
30282624222018161412108642
0.050
0.025
0.000
-0.025
-0.050
Observation Order
Re
sid
ua
l
Normal Probability Plot Versus Fits
Histogram Versus Order
Residual Plots for Log10Cu
10-1-2-3
1
0
-1
-2
-3
FITLog
Lo
g1
0C
u
S 0.0290364
R-Sq 99.9%
R-Sq(adj) 99.9%
Regression
95% PI
Fitted Line PlotLog10Cu = - 0.000000 + 1.000 FITLog
Analysis in the Linear Scale
Obs 5 is highlighted above as
Obs 19 is highlighted above as
Regression Analysis: Cu versus shell, dist
The regression equation is
Cu = 0.169 + 0.633 shell - 0.209 dist
Predictor Coef SE Coef T P
Constant 0.1687 0.2796 0.60 0.551
shell 0.6328 0.1107 5.72 0.000
dist -0.20927 0.05207 -4.02 0.000
S = 0.537055 R-Sq = 61.9%
Unusual Observations
Obs shell Cu Fit SE Fit Resid St Resid
5 2.38 3.5412 1.572 0.2057 1.968 3.97R
19 4.68 2.4961 2.119 0.3260 0.376 0.88X
R denotes an observation with a large
standardized residual.
X denotes an obs whose X values gives it large
leverage.
420 5.02.50.04
2
0
4
2
0
Cu
shell
dist
Matrix Plot of Cu, shell, dist
210-1
99
90
50
10
1
Residual
Pe
rce
nt
210-1
2
1
0
Fitted Value
Re
sid
ua
l
2.01.51.00.50.0-0.5
10.0
7.5
5.0
2.5
0.0
Residual
Fre
qu
en
cy
30282624222018161412108642
2
1
0
Observation Order
Re
sid
ua
l
Normal Probability Plot Versus Fits
Histogram Versus Order
Residual Plots for Cu
2.52.01.51.00.50.0-0.5-1.0
4
3
2
1
0
-1
-2
FITLin
Cu
S 0.527377
R-Sq 61.9%
R-Sq(adj) 60.5%
Regression
95% PI
Fitted Line PlotCu = - 0.0000 + 1.000 FITLin
Diploma in Statistics 2013 Introduction to Regression Week 6
33
Q3, 2011 Math Marks – Summary statistics
80
40
0
80400
50250 80400
50
25
0
70
45
20
80
40
0
80400
80
40
0
704520
Stat
Anal
Alg
Vect
Mech
Matrix Plot of Stat, Anal, Alg, Vect, Mech
Correlations: Stat, Anal, Alg, Vect,
Mech
Stat Anal Alg Vect
Anal 0.607
Alg 0.665 0.711
Vect 0.436 0.485 0.610
Mech 0.389 0.409 0.547 0.553
Cell Contents: Pearson correlation
Descriptive Stats: Stat, Anal, Alg, Vect,
Mech
Variable Mean StDev
Stat 42.31 17.26
Anal 46.68 14.85
Alg 50.60 10.62
Vect 50.59 13.15
Mech 38.95 17.49
Vectors
Algebra
Mechanics
Statistics
Analysis
Diploma in Statistics 2013 Introduction to Regression Week 6
34
Q3 2010 1. Appendix 3: Gas contains an analysis of the relationship between gas
consumption and temperature for a single building before and after the installation of insulation.
(a) A simple comparison of the means as below suggests that insulation has had no impact. Discuss the study design and its implications for this analysis.
Two-sample T for Gas
Insulated N Mean StDev SE Mean
0 16 4.006 0.622 0.16
1 15 3.960 0.485 0.13
Difference = mu (0) - mu (1)
Estimate for difference: 0.046
95% CI for difference: (-0.366, 0.458)
1086420
5.0
4.5
4.0
3.5
3.0
2.5
Temperature
Ga
s
0
1
Insulated
Scatterplot of Gas vs Temperature
Diploma in Statistics 2013 Introduction to Regression Week 6
35