Conversing with Data - Trinity College, Dublin 6/Lecture Week 6 Wrap up.pdf · –X1 acting in...

Conversing with Data

• Statistical Analysis

– Exploratory/Confirmatory

• Formulating the questions

– Regression Prediction as Formulation

• Refining the questions

• Comparing

– Testing

– Predicting

Diploma in Statistics 2013 Introduction to Regression Week 6

1

Hypothesis Testing - if you must • Scientific Hypothesis

– Proof/Disproof

• Thesis/Antithesis

• Devil’s advocate

• Specific (default Null) and Alt Hyps Ho and HA

– Test Statistic

• p values

• Critical values

• Stat Significance

• Reject/Fail to reject/Accept


2

http://en.wikipedia.org/wiki/Statistical_significance_test



PE Max revisited The regression equation is

PEmax = 62.1 - 12.5 Sex + 3.77 Age - 0.013 FRC

Predictor Coef SE Coef T P

Constant 62.13 42.84 1.45 0.162

Sex -12.54 11.26 -1.11 0.278

Age 3.771 1.441 2.62 0.016

FRC -0.0134 0.1673 -0.08 0.937

S = 27.4069 R-Sq = 41.2%

Analysis of Variance

Source DF SS MS F P

Regression 3 11058.8 3686.3 4.91 0.010

Residual Error 21 15773.9 751.1

Total 24 26832.6


3

p – values - Probability of ??? Data Coeff/Diff/Ratio p < 0.05

• All probability depends on – Depends on “info”

– Requires precise statement of ‘event’ • Care with Pr(A and B)=Pr(A)Pr(B)

• Under specific (default Null) hypothesis the probability is < 0.05 of observing “data like this” to be as or more extreme than “this data set”


4


and technical assumptions



95% Confidence Intervals • Informal

– Margin for error associated with this ‘coeff’

– Computed for ‘data like this’

– Values in this interval are ‘statistically consistent with the data’

• Formal – List of specific Hypotheses

that would not be rejected by Stat test procedure, sig at 5%

• NOT Prob 95% true value in this interval

Diploma in Statistics 2013 Introduction to

Regression Week 6 5

Role of T-ratios

ˆ

ˆ

Informally, if is not large (>2 in mag; p - 0.05)

then coeff of can be given a value of zero -

equivalently can be dropped from model -

with little appreciable impact.

Cautio : pn a p l

kk

k

k

k

k

tSE

t

x

x

ies one variable at a time


6

ˆ

ˆk o

k

k

tSE

ANOVA tests: Composite

• F – in simplest case – = SumSq t-ratios

• Null Hypothesis

– No ‘systematic variation’

– All variation ‘random’

– All slope coeffs = 0

• Alternative Hypothesis

– Some ‘systematic variation’

– At least one coeff is non-zero


7

As or more extreme

My view • Data a (small?) part of the info available

– Prob – given data and theory

– What data not in study?

• Beware – “Fooled by randomness” Narrative fallacy

– Rely on theory - including • Scientific common sense

• Philosophy of science - proof/disproof

• Use of p-values for screening ‘false pos/neg’

• Be prepared to struggle with editors/supervisors/peers


8

what you see is all there is WYSIATI Kahneman

Regression Model

• Some variation in Y explicable by variation in x-vars – by some weighted average of x-vars

– x-vars act together ‘without interaction’

• Inexplicable variation –residuals - is so subtle that – not worth the hassle of pursuit

– might as well regard it as unpredictable/random with common magnitude

• If so, rules for random variation – Can help provide some guidelines on big/small


9

Linear Model Theory

1 1 2 2

2

1

Classic Linear Model

....

where ~ 0, 0,

Statistical Theory

assumes unpredictable has Normal Dist (technical)

makes NO assumptions re dist of , ,..

makes assu

i i i p pi i

i

i

Y x x x

N Var or N SD

Y X

mption of additivity

makes assumption ( )does not change with vars

(crucial)

Var Y x


10

Linearity - Weighted Sum

2

1 1 2 2

a weighted sum of variables coeffs are weights

a weighted sum of coefficients vars are weights

( , ) straight line in ; linear in ,

(

Linear Model ....

i

i

i p pi ii i

Y x

Y x

y x y x x

y x

Y x x x

211 1 12 1 2 2 11 12 2

, ) quadratic relationship; linear in ,

log log ( , ) non-linear; linear in ,

log( ) log ; linear in , , ,

y x

y x y x

y x x x


11

Multiple Linear Regression

• Simple weighted average

– Variables act ‘one-at-a-time’

– Simultaneous effect is additive

• Interaction

– X1 acting in combination with X2

– Simultaneous effect is not-additive

– Simplest non-additive is multiplicative

wiki/Interaction_(statistics)

A drug X might be desirable for treating a certain condition, but not if you are taking drug Y, because if you do take drugs X and Y together there is a bad consequence from their combination, a bad drug "interaction"


12

Derived x-variables

• No extra explanatory power in

– Linear transform of individual x-vars

– Weighted averages of individual x-vars

• Possibly extra explanatory power in

– Non-linear transforms

– Multiplicative combinations also Ratios

– Other non-linear combinations

• Possibly simpler model

Fishing?


13

Transforming for Simplicity • Recall Multiple Linear Regression

– General purpose tool – Customise

• Use the language/imagery of users – Are changes/differences expressed as scale free?

• Temperature, Date • Twice, percent, proportion

• Automatically avoid howlers/satisfy constraints – Predict negative values – Create Prediction intervals with negative values – When log transform?


14

http://www.scss.tcd.ie/john.haslett/st7002/When log transform.pdf

CEO salaries


15

120000100000800006000040000200000

12000000

10000000

8000000

6000000

4000000

2000000

0

Sales

Co

mp

en

sa

tio

n

S 1367165

R-Sq 13.7%

R-Sq(adj) 13.5%

Regression

95% PI

Fitted Line PlotCompensation = 1437416 + 61.57 Sales

120000100000800006000040000200000

16000000

14000000

12000000

10000000

8000000

6000000

4000000

2000000

0

Sales

Co

mp

en

sa

tio

n

S 0.274596

R-Sq 24.0%

R-Sq(adj) 23.9%

Regression

95% PI

Fitted Line Plotlog10(Compensation) = 5.076 + 0.3086 log10(Sales)

100000100001000100

10000000

1000000

100000

Sales

Co

mp

en

sa

tio

n

S 0.274596

R-Sq 24.0%

R-Sq(adj) 23.9%

Regression

95% PI

Fitted Line Plotlog10(Compensation) = 5.076 + 0.3086 log10(Sales)

Mammal Weights


16

70006000500040003000200010000

8000

7000

6000

5000

4000

3000

2000

1000

0

-1000

BodyW

Bra

inW

S 334.720

R-Sq 87.3%

R-Sq(adj) 87.1%

Regression

95% PI

Fitted Line PlotBrainW = 91.00 + 0.9665 BodyW

70006000500040003000200010000

30000

25000

20000

15000

10000

5000

0

BodyW

Bra

inW

S 0.301528

R-Sq 92.1%

R-Sq(adj) 91.9%

Regression

95% PI

Fitted Line Plotlog10(BrainW) = 0.9271 + 0.7517 log10(BodyW)

1000

0.00

0

1000

.000

100.00

0

10.000

1.00

0

0.10

0

0.01

0

0.00

1

100000.00

10000.00

1000.00

100.00

10.00

1.00

0.10

0.01

BodyW

Bra

inW

S 0.301528

R-Sq 92.1%

R-Sq(adj) 91.9%

Regression

95% PI

Fitted Line Plotlog10(BrainW) = 0.9271 + 0.7517 log10(BodyW)

Random Variation Additive?

0.00

0.01

0.10

1.00

10.00

0 10 20 30 40 50 60

time t

Exp Decay on log scale. Random variation constant in time

0.00

1.00

2.00

3.00

4.00

5.00

6.00

0 10 20 30 40 50 60

time t

Exp Decay. Random variation decreases with time


17

Log scale on one or both axes

Non-linear Transformation • Evidence for

– Curvature in regression line

– Non-constant scatter (residuals) around line • Constant scatter

• Constant percentage scatter

• Constrained scatter

– Non-normal residuals

• Reasons for – Simpler more natural

– Technical


Regression Week 6 18

Normality of data or errors/resids?


19

Extreme example


20

Residuals Unusual Observations

Lactic

Obs Acid Taste Fit SE Fit Residual St Resid

15 1.52 54.90 29.45 3.04 25.45 2.63R

In fact barely outlying despite 2.63 Recall, one has to be largest!!


21

• Examine carefully: – Why outlying?

– Anything special about this case/obs?

– Refit without • does its removal change anything important?

• If delete, then formally – Conclusions are based on ‘something like this

never happening in future’

– Is this a meaningful statement?

Options with Large Residuals


22

Residuals, Standardized Residuals, Deleted T-residuals


23

Residual from best line using all dataStandardised Residual =

(All residuals)

Residual from best line using all dataBetter? =

(All other residuals)

Residual from best line usDeleted t-residual

ii

i

ii

SD

SD

ing all other data

(All other residuals)SD

Multiple Linear Regression Variants

• ANOVA and T-tests

• Best subsets – For prediction – barely relevant

– For understanding – dangerous

• Auto-regression – Time series, focus on prediction (and pred ints)

• Weighted regression WLS – Y variables not equally ‘precise’

• Correlated residuals GLS


24

Weighted Least Squares Data points:

means (x and y)

unequal sample sizes

wts prop to sample size

More generally

wts inversely proportional

to variance of y-vals for each x


25

Hans Rosling: Gapminder

2

2

ˆ values chosen to min

ˆ values chosen to min

i

i i

OLS

res

WLS

w res

http://www.gapminder.org/




Multiple Linear Regression Extensions

Logistic Regression

• Y variable 0/1

– Success/Failure at differing levels of x

• Y variable proportions pi (summaries of binary)

– Y constrained in 0/1

– Possible alternative: logistic transfn logpi

1−pi

regress on xi x

• Y variable nominal/ordinal


26


Generalised Linear Modelling glm

includes:

– (Multiple) Linear regression

– Logistic regression

– Non-Normal random variation

– eg count data, survival times

• f(E[Y]) = +x

• Var[Y] depends on E[Y]




• Smoothers – Generalised Additive Modelling

• E[Y] = smooth function of x no coefficients

• Var[Y] depends on E[Y]

• ‘Modern methods’ p>>n – Data mining

– Machine Learning

– L1

– Sparse models


28

ˆ chosen tomini

res

Exam

• 3 questions

• General Question

– Prepare in advance

• Use MINITAB output

– To show competence

– To illustrate principles


29

Exam Derived variables, indicator variables and

transformations greatly extend the reach of regression.

Compute and plot fitted regressions.

Different foci of MLR and implications for analysis.

Network diagrams; no ‘discovery’

Tree Vol

Diam

Ht* Diam2

Ht

Theory


30

General Questions

• Reward

– independent thinking & learning

– As evidenced by ability to illustrate

• No more than 1 page

– Diagram

– Example

– Principle



Exam in 2012

No

F -tables

t-tables

In Exam use “ fit ±2” approx for PI MINITAB uses (a) Tech formulae for curvature (b) Percentiles from t-tables


32

Q2, 2011

Analysis in the Log Scale

Regression Analysis: Log10Cu versus Log10Sh, Log10Dist The regression equation is

Log10Cu = - 0.906 + 3.03 Log10Sh

- 0.988 Log10Dist


Constant -0.90610 0.01024 -88.51 0.000

Log10Sh 3.03146 0.02479 122.29 0.000

Log10Dist -0.98809 0.01508 -65.51 0.000

S = 0.0295692 R-Sq = 99.9%

Obs 5 is highlighted above as


0.80.40.0 0.50.0-0.5

0

-1

-2

0.8

0.4

0.0

Log10Cu

Log10Sh

Log10Dist

Matrix Plot of Log10Cu, Log10Sh, Log10Dist

0.0500.0250.000-0.025-0.050

99

90

50

10

1

Residual

Pe

rce

nt

10-1-2-3

0.050

0.025

0.000

-0.025

-0.050

Fitted Value

Re

sid

ua

l

0.040.020.00-0.02-0.04-0.06

4.8

3.6

2.4

1.2

0.0

Residual

Fre

qu

en

cy

30282624222018161412108642

0.050

0.025

0.000

-0.025

-0.050

Observation Order

Re

sid

ua

l

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for Log10Cu

10-1-2-3

1

0

-1

-2

-3

FITLog

Lo

g1

0C

u

S 0.0290364

R-Sq 99.9%

R-Sq(adj) 99.9%

Regression

95% PI

Fitted Line PlotLog10Cu = - 0.000000 + 1.000 FITLog

Analysis in the Linear Scale



Regression Analysis: Cu versus shell, dist

The regression equation is

Cu = 0.169 + 0.633 shell - 0.209 dist


Constant 0.1687 0.2796 0.60 0.551

shell 0.6328 0.1107 5.72 0.000

dist -0.20927 0.05207 -4.02 0.000

S = 0.537055 R-Sq = 61.9%

Unusual Observations

Obs shell Cu Fit SE Fit Resid St Resid

5 2.38 3.5412 1.572 0.2057 1.968 3.97R

19 4.68 2.4961 2.119 0.3260 0.376 0.88X

R denotes an observation with a large

standardized residual.

X denotes an obs whose X values gives it large

leverage.

420 5.02.50.04

2

0

4

2

0

Cu

shell

dist

Matrix Plot of Cu, shell, dist

210-1

99

90

50

10

1

Residual

Pe

rce

nt

210-1

2

1

0

Fitted Value

Re

sid

ua

l

2.01.51.00.50.0-0.5

10.0

7.5

5.0

2.5

0.0

Residual

Fre

qu

en

cy

30282624222018161412108642

2

1

0

Observation Order

Re

sid

ua

l

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for Cu

2.52.01.51.00.50.0-0.5-1.0

4

3

2

1

0

-1

-2

FITLin

Cu

S 0.527377

R-Sq 61.9%

R-Sq(adj) 60.5%

Regression

95% PI

Fitted Line PlotCu = - 0.0000 + 1.000 FITLin


33

Q3, 2011 Math Marks – Summary statistics

80

40

0

80400

50250 80400

50

25

0

70

45

20

80

40

0

80400

80

40

0

704520

Stat

Anal

Alg

Vect

Mech

Matrix Plot of Stat, Anal, Alg, Vect, Mech

Correlations: Stat, Anal, Alg, Vect,

Mech

Stat Anal Alg Vect

Anal 0.607

Alg 0.665 0.711

Vect 0.436 0.485 0.610

Mech 0.389 0.409 0.547 0.553

Cell Contents: Pearson correlation

Descriptive Stats: Stat, Anal, Alg, Vect,

Mech

Variable Mean StDev

Stat 42.31 17.26

Anal 46.68 14.85

Alg 50.60 10.62

Vect 50.59 13.15

Mech 38.95 17.49

Vectors

Algebra

Mechanics

Statistics

Analysis


34

Q3 2010 1. Appendix 3: Gas contains an analysis of the relationship between gas

consumption and temperature for a single building before and after the installation of insulation.

(a) A simple comparison of the means as below suggests that insulation has had no impact. Discuss the study design and its implications for this analysis.

Two-sample T for Gas

Insulated N Mean StDev SE Mean

0 16 4.006 0.622 0.16

1 15 3.960 0.485 0.13

Difference = mu (0) - mu (1)

Estimate for difference: 0.046

95% CI for difference: (-0.366, 0.458)

1086420

5.0

4.5

4.0

3.5

3.0

2.5

Temperature

Ga

s

0

1

Insulated

Scatterplot of Gas vs Temperature


35

Good Luck


36

Date post:	20-Mar-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

Conversing with Data - Trinity College, Dublin 6/Lecture Week 6 Wrap up.pdf · –X1 acting in...

Documents