Statistics - Statistical Methods for Data Analysis

Chapter

3

Statistical

Methods

PaulC

.TaylorU

niversityofH

ertfordshire

28thM

arch2001

3.1Introduction

�

Generalized

LinearM

odels

�

SpecialTopics

inR

egressionM

odelling

�

ClassicalM

ultivariateA

nalysis

�

Sum

mary

1

3.2G

eneralized

LinearM

odels�

Regression

�

Analysis

ofVariance

�

Log-linearM

odels

�

LogisticR

egression

�

Analysis

ofSurvivalD

ata

2

The

fittingof

generalizedlinear

models

iscurrently

them

ostfrequently

appliedstatisticaltechnique.

Generalized

linearm

odelsare

usedto

describedthe

rela-tionship

between

them

ean,sometim

escalled

thetrend,ofone

variableand

thevalues

takenby

severalothervariables.

3

3.2.1R

egression

How

isa

variable,� ,relatedto

one,orm

ore,othervariables,�� ,�� ,...,�� ?

Nam

esfor� :

response;dependentvariable

;output.

Nam

esfor

the

�� ’s:regressors;explanatory

variables;independentvariables;inputs.

Here,w

ew

illusethe

terms

outputandinputs.

4

Com

mon

reasonsfor

doinga

regressionanalysis

include:

�the

outputis

expensiveto

measure,

butthe

inputsare

not,and

socheap

predictionsofthe

outputaresought;

�

thevalues

oftheinputs

areknow

nearlier

thanthe

outputis,anda

working

predictionofthe

outputisrequired;

�

we

cancontrol

thevalues

ofthe

inputs,w

ebelieve

thereis

acausal

linkbetw

eenthe

inputsand

theoutput,

andso

we

want

toknow

what

valuesof

theinputs

shouldbe

chosento

obtaina

particulartarget

valuefor

theoutput;

�

itisbelieved

thatthereis

acausallink

between

some

oftheinputs

andthe

output,andw

ew

ishto

identifyw

hichinputs

arerelated

tothe

output.

5

The

(general)linear

model

is��

� ��

(3.1)

where

the � ’sare

independentlyand

identicallydistributed

as

��

and�

isthe

number

ofdatapoints.

The

modelis

linearin

the’s.

��

��

(3.2)

(Aw

eightedsum

ofthe

’s.)

6

The

main

reasonsfor

theuse

ofthelinear

model.

�T

hem

aximum

likelihoodestim

atorsof

the

’sare

thesam

eas

theleast

squaresestim

ators;seeS

ection2.4

ofChapter2.

�

Explicitform

ulaeand

rapid,reliablenum

ericalmethods

forfinding

theleast

squaresestim

atorsofthe

’s.

�

Many

problems

canbe

framed

asgenerallinear

models.

For

example,

��

(3.3)

canbe

convertedby

setting�� ,��

and�� .

�

Even

when

thelinear

modelis

notstrictly

appropriate,there

isoften

aw

ayto

transformthe

outputand/orthe

inputs,sothata

linearm

odelcanprovide

usefulinformation.

7

Non-linear

Regression

Two

examples

are:

�� ! "��

��

(3.4)

�� # �%$&(' ")+*",�- ./0 � ��

(3.5)

where

the ’sand�

areas

in(3.1).

8

Problem

s

1.E

stimation

iscarried

outusingiterative

methods

which

requiregood

choicesof

startingvalues,

might

notconverge,

might

convergeto

alocaloptim

umrather

thanthe

globaloptimum

,andw

illrequirehum

anintervention

toover-

come

thesedifficulties.

2.T

hestatisticalproperties

ofthe

estimates

andpredictions

fromthe

model

arenot

known,

sow

ecannot

performstatistical

inferencefor

non-linearregression.

9

Generaliz

edLinear

Models

The

generalizationis

intw

oparts.

1.T

hedistribution

ofthe

outputdoes

nothave

tobe

thenorm

al,but

canbe

anyofthe

distributionsin

theexponentialfam

ily.

2.Instead

ofthe

expectedvalue

ofthe

outputbeing

alinear

functionof

the

’s,we

have1# �� 0 � �

��

(3.6)

where1� ��

isa

monotone

differentiablefunction.

The

function1� ��

iscalled

thelink

function.

There

isa

reliablegeneralalgorithm

forfitting

generalizedlinear

models.

10

Generaliz

edA

dditive

Models

Generalized

additivem

odelsare

ageneralization

ofgeneralizedlinear

models.

The

generalizationis

that1# �� 0

neednotbe

alinear

functionofa

setof

’s,buthas

theform

1# �� 0 � �

�� 2

� � ��

(3.7)

where

the2� ’s

arearbitrary,usually

smooth,functions.

An

example

ofthe

model

producedusing

atype

ofscatterplot

smoother

isshow

nin

Figure

3.1.

11

•

•

• ••

••

•

•

• •

•

••

•

••

•

•

•

•

•

•

•

•

••

•

••

•

••

••

••

•

•

•

•

••

Diabetes D

ata --- Spline S

mooth

df=3

Age

Log C-peptide

510

15

3 4 5 6

Figure

3.1

12

Methods

forfitting

generalizedadditive

models

existandare

generallyreliable.

The

main

drawback

isthat

thefram

ework

ofstatistical

inferencethat

isavail-

ablefor

generalizedlinear

models

hasnot

yetbeen

developedfor

generalizedadditive

models.

Despite

thisdraw

back,generalized

additivem

odelscan

befitted

byseveralof

them

ajorstatisticalpackages

already.

13

3.2.2A

nalysisof

Variance

The

analysisof

variance,or

AN

OV

A,

isprim

arilya

method

ofidentifying

which

ofthe

’sin

alinear

modelare

non-zero.T

histechnique

was

developedfor

theanalysis

ofagriculturalfieldexperim

ents,butisnow

usedquite

generally.

Exam

ple27

Turnipsfor

Winter

Fodder.

The

datain

Table3.1

arefrom

anex-

periment

toinvestigate

thegrow

thof

turnips.T

hesetypes

ofturnips

would

begrow

nto

providefood

forfarm

animals

inw

inter.T

heturnips

were

harvestedand

weighed

bystaff

andstudents

ofthe

Departm

entsof

Agriculture

andA

p-plied

Statistics

ofThe

University

ofReading,in

October,1990.

14

Table3.1

Treatments

Blocks

Variety

Date

Density

LabelI

IIIII

IV

Barkant

21/8/901

kg/haA

2.71.4

1.23.8

2kg/ha

B7.3

3.83.0

1.24

kg/haC

6.54.6

4.70.8

8kg/ha

D8.2

4.06.0

2.5

28/8/901

kg/haE

4.40.4

6.53.1

2kg/ha

F2.6

7.17.0

3.24

kg/haG

24.014.9

14.62.6

8kg/ha

H12.2

18.915.6

9.9

Marco

21/8/901

kg/haJ

1.21.3

1.51.0

2kg/ha

K2.2

2.02.1

2.54

kg/haL

2.26.2

5.70.6

8kg/ha

M4.0

2.810.8

3.1

28/8/901

kg/haN

2.51.6

1.30.3

2kg/ha

P5.5

1.22.0

0.94

kg/haQ

4.713.2

9.02.9

8kg/ha

R14.9

13.39.3

3.615

The

following

linearm

odel

��

�3 �3� �4 �4� ��

5 �5�

�66 �667� �666 �6667� �68 �687� � �� :9 ;

(3.8)

oran

equivalentonecould

befitted

tothese

data.T

heinputs

takethe

values0

or1

andare

usuallycalled

dumm

yor

indicatorvariables.

On

firstsight,(3.8)should

alsoinclude

a<

anda6 ,butw

edo

notneedthem

.

16

The

firstquestionthatw

ew

ouldtry

toansw

eraboutthese

datais

Does

achange

intreatm

entproducea

changein

theturnip

yield?

which

isequivalentto

asking

Are

anyof3

,4

,...,5non-zero?

which

isthe

sortofquestionthatcan

beansw

eredusing

AN

OV

A.

17

This

ishow

theA

NO

VA

works.

Recall,the

generallinearm

odelof(3.1),

��

� ��

The

estimate

of�

is =� .

Fitted

values

=�� = �

�� =� ��

(3.9)

Residuals

>� �� $

=��

(3.10)

The

sizeof

theresiduals

isrelated

tothe

sizeof�

�,the

varianceof

the � ’s.It

turnsoutthatw

ecan

estimate�

�by

? ��@BA� �� $

=��

� $�DC��

�

(3.11)

18

The

keyfacts

about? �

isthatallow

usto

compare

differentlinearm

odelsare:

�if

thefitted

modelis

adequate(‘the

rightone’),

then? �

isa

goodestim

ateof�

�;

�

ifthe

fittedm

odelincludesredundant

terms

(thatis

includessom

e

’sthat

arereally

zero),then? �

isstilla

goodestim

ateof�

�;

�

ifthefitted

modeldoes

notincludeone

orm

oreinputs

thatitoughtto,then

? �

willtend

tobe

largerthan

thetrue

valueof�

�.

So

ifw

eom

ita

usefulinput

fromour

model,

theestim

ateof�

�

will

shootup,

whereas

ifwe

omita

redundantinputfromour

model,the

estimate

of��

shouldnotchange

much.

Note

thatomitting

oneofthe

inputsfrom

them

odelisequiv-

alenttoforcing

thecorresponding

tobe

zero.

19

Exam

ple28

Turnipsfor

Winter

Fodder

continued.LetE�

tobe

them

odelat

(3.8),and E

tobe

thefollow

ingm

odel�� 66 �667� �666 �6667� �68 �687� � �� :9 ;�

(3.18)

So,E

isthe

specialcaseofE�

inw

hichallof3

,4

,...,5

arezero.

Table3.2

Df

Sumof

Sq

MeanSq

FValue

Pr(F)

block

3163.73754.57891

2.2780160.08867543

Residuals60

1437.53823.95897

Table3.3

Df

Sumof

Sq

MeanSq

FValue

Pr(F)

block

3163.73754.57891

5.6904300.002163810

treat

15

1005.92767.06182

6.9919060.000000171

Residuals45

431.611

9.59135

20

Table3.4

shows

theA

NO

VA

thatwould

usuallybe

producedfor

theturnip

data.N

oticethat

the‘block’and

‘Residuals’row

sare

thesam

eas

inTable

3.3.T

hebasic

differencebetw

eenTables

3.3and

3.4is

thatthe

treatment

information

isbroken

down

intoits

constituentpartsin

Table3.4.

Table3.4

DfSum

ofSq

Mean

Sq

FValue

Pr(F)

block

3163.7367

54.5789

5.690430.0021638

variety

183.9514

83.9514

8.752820.0049136

sowing

1233.7077233.707724.36650

0.0000114

density

3470.3780156.792716.34730

0.0000003

variety:sowing

136.4514

36.4514

3.800450.0574875

variety:density

38.6467

2.8822

0.300500.8248459

sowing:density

3154.7930

51.5977

5.379600.0029884

variety:sowing:density

317.9992

5.9997

0.625540.6022439

Residuals

45

431.6108

9.5914

21

3.2.3Log-linear

Models

The

datashow

nin

Table3.7

showthe

sortof

problemattacked

bylog-linear

modelling.

There

arefive

categoricalvariablesdisplayed

inTable

3.7:

centreone

ofthreehealth

centresfor

thetreatm

entofbreastcancer;

age

theage

ofthepatientw

henher

breastcancerw

asdiagnosed;

survivedw

hetherthe

patientsurvivedfor

atleastthreeyears

fromdiagnosis;

appearappearance

ofthepatient’s

tumour—

eitherm

alignantor

benign;

inflamam

ountofinflamm

ationofthe

tumour—

eitherm

inimal

orgreater.

22

Table3.7

State

ofTumour

Minim

alInflamm

ationG

reaterInflam

mation

Malignant

Benign

Malignant

Benign

Centre

Age

Survived

Appearance

Appearance

Appearance

Appearance

TokyoU

nder50

No

97

43

Yes26

6825

950–69

No

99

112

Yes20

4618

570

orover

No

23

10

Yes1

65

1

Boston

Under

50N

o6

76

0Yes

1124

40

50–69N

o8

203

2Yes

1858

103

70or

overN

o9

183

0Yes

1526

11

Glam

organU

nder50

No

167

30

Yes16

208

150–69

No

1412

30

Yes27

3910

470

orover

No

37

30

Yes12

114

1

For

thesedata,the

outputisthe

number

ofpatientsin

eachcell.

The

modelis

��GFHI JK�L� �M NOPIQ�L� � � ��

� ��

(3.21)

Since

allthevariables

ofinterestarecategorical,w

eneed

touse

indicatorvari-

ablesas

inputsin

thesam

ew

ayas

in(3.8).

24

Table3.8

Terms

added

sequentially(firsttolast)

Df

DevianceResid.Df

Resid.Dev

Pr(Chi)

NULL

71

860.0076

centre

29.3619

69

850.6457

0.0092701

age

2105.5350

67

745.1107

0.0000000

survived

1160.6009

66

584.5097

0.0000000

inflam

1291.1986

65

293.3111

0.0000000

appear

17.5727

64

285.7384

0.0059258

centre:age

476.9628

60

208.7756

0.0000000

centre:survived

211.2698

58

197.5058

0.0035711

centre:inflam

223.2484

56

174.2574

0.0000089

centre:appear

213.3323

54

160.9251

0.0012733

age:survived

23.5257

52

157.3995

0.1715588

age:inflam

20.2930

50

157.1065

0.8637359

age:appear

21.2082

48

155.8983

0.5465675

survived:inflam

10.9645

47

154.9338

0.3260609

survived:appear

19.6709

46

145.2629

0.0018721

inflam:appear

195.4381

45

49.8248

0.0000000

Tosum

marise

thism

odel,Iwould

constructitsconditionalindependence

graphand

presenttablescorresponding

tothe

interactions.Tables

arein

thebook.

The

conditionalindependencegraph

isshow

nin

Figure

3.2.

age

centre

survived

inflamappear

Figure

3.226

3.2.4Logistic

Regression

Inlogistic

regression,theoutputis

thenum

berofsuccesses

outofanum

berof

trials,eachtrialresulting

ineither

asuccess

orfailure.

For

thebreastcancer

data,we

canregard

eachpatientas

a‘trial’,w

ithsuccess

correspondingto

thepatientsurviving

forthree

years.

The

outputw

ouldsim

plybe

givenas

number

ofsuccesses,

either0

or1,

foreach

ofthe764

patientsinvolved

inthe

study.

The

modelthatw

ew

illfitis R�� $S� � R�� S� �L�

and

PIQ

S�� $S�

� ��

� ��

(3.22)

Again,

theinputs

herew

illbe

indicatorsfor

thebreast

cancerdata,

butthis

isnot

generallytrue;

thereis

noreason

why

anyof

theinputs

shouldnot

bequantitative.

27

Table3.15

Df

Deviance

Resid.Df

Resid.

Dev

Pr(Chi)

NULL

763

898.5279

centre

211.26979

761

887.2582

0.0035711

age

23.52566

759

883.7325

0.1715588

appear

19.69100

758

874.0415

0.0018517

inflam

10.00653

757

874.0350

0.9356046

centre:age

47.42101

753

866.6140

0.1152433

centre:appear

21.08077

751

865.5332

0.5825254

centre:inflam

23.39128

749

862.1419

0.1834814

age:appear

22.33029

747

859.8116

0.3118773

age:inflam

20.06318

745

859.7484

0.9689052

appear:inflam

10.24812

744

859.5003

0.6184041

centre:age:appear

42.04635

740

857.4540

0.7272344

centre:age:inflam

47.04411

736

850.4099

0.1335756

centre:appear:inflam

25.07840

734

845.3315

0.0789294

age:appear:inflam

24.34374

732

840.9877

0.1139642

centre:age:appear:inflam

30.01535

729

840.9724

0.999496428

The

fittedm

odelis

simple

enoughin

thiscase

forthe

parameter

estimates

tobe

includedhere;

theyare

shown

inthe

formthat

astatistical

packagew

ouldpresentthem

inTable

3.16.

Table3.16

Coefficients:

(Intercept)

centre2

centre3

appear

1.080257

-0.6589141

-0.49448460.5157151

Using

theestim

atesgiven

inTable

3.16,thefitted

modelis

PIQ JT� S� � ��U� �V W $� �9VUX �;��3� $� �;X ;;U ;9�Y� �� V �V W�V ��Z� �

(3.23)

29

3.2.5A

nalysisof

Surviv

alData

Survivaldata

aredata

concerninghow

longittakes

foraparticulareventto

hap-pen.

Inm

anym

edicalapplicationsthe

eventisdeath

ofapatientw

ithan

illness,and

sow

eare

analysingthe

patient’ssurvivaltim

e.In

industrialapplicationsthe

eventisoften

failureofa

componentin

am

achine.

The

outputin

thissort

ofproblem

isthe

survivaltim

e.A

sw

ithall

theother

problems

thatwe

haveseen

inthis

section,thetask

isto

fitaregression

model

todescribe

therelationship

between

theoutputand

some

inputs.In

them

edicalcontext,

theinputs

areusually

qualitiesof

thepatient,

suchas

ageand

sex,or

aredeterm

inedby

thetreatm

entgivento

thepatient.

We

willskip

thistopic.

30

3.3S

pecialTopics

inR

egressionM

odelling�

Multivariate

Analysis

ofVariance

�

Repeated

Measures

Data

�

Random

Effects

Models

The

topicsin

thissection

arespecial

inthe

sensethat

theyare

extensionsto

thebasic

ideaof

regressionm

odelling.T

hetechniques

havebeen

developedin

responseto

methods

ofdata

collectionin

which

theusual

assumptions

ofregression

modelling

arenotjustified.

31

3.3.1M

ultivariate

Analysis

ofV

ariance

Model

[�)+\] �/

�^ �^� �� ^� ��

^� �� _��

(3.26)

where

the_� ’sare

independentlyand


as

\�D`��a�

and�

isthe

number

ofdatapoints.

The�Db

c��

under[�

indicatesthe

dimensions

ofthe

vector,inthis

caseb

rows

and1

column;the^

’sare

also�Dbc��

vectors.

This

model

canbe

fittedin

exactlythe

same

way

asa

linearm

odel(by

leastsquares

estimation).

One

way

todo

thisfitting

would

beto

fita

linearm

odeltoeach

oftheb

dimensions

oftheoutput,one-at-a-tim

e.

32

Having

fittedthe

model,w

ecan

obtainfitted

values

=[� �=^d�

�� =^� ��

� ��

andhence

residuals

[� $=[�

� ��

The

analogueofthe

residualsumofsquares

fromthe

(univariate)linear

model

isthe

matrix

ofresidualsums

ofsquaresand

productsfor

them

ultivariatelinear

model.

This

matrix

isdefined

tobe

e� A� �

� �[� $=[� ��[� $=[� �gf

�33

3.3.2R

epeatedM

easuresD

ata

Repeated

measures

dataare

generatedw

henthe

outputvariable

isobserved

atseveral

pointsin

time,

onthe

same

individuals.U

sually,the

covariatesare

alsoobserved

atthe

same

time

pointsas

theoutput;

sothe

inputsare

time-

dependenttoo.

Thus,

asin

Section

3.3.1the

outputis

avector

ofm

easure-m

ents.In

principle,w

ecan

simply

applythe

techniquesof

Section

3.3.1to

analyserepeated

measures

data.Instead,

we

usuallytry

touse

thefact

thatw

ehave

thesam

eset

ofvariables

(outputand

inputs)at

severaltimes,

ratherthan

acollection

ofdifferentvariablesm

akingup

avector

output.

Repeated

measures

dataare

oftencalled

longitudinaldata,especiallyin

theso-

cialsciences.T

heterm

cross-sectionalis

oftenused

tom

ean‘notlongitudinal’.

34

3.3.3R

andomE

ffectsM

odels

Overdisper

sion

Ina

logisticregression

we

mightreplace

(3.22)w

ith

PIQ JT� S� � � ��

� �� h��

(3.29)

where

theh� ’sare

independentlyand


as

�� i � .

We

canthink

ofh�

asrepresenting

eitherthe

effectof

them

issinginput

onS�

orsim

plyas

randomvariation

inthe

successprobabilities

forindividuals

thathavethe

same

valuesfor

theinputvariables.

35

Hierarchical

models

Inthe

turnipexperim

ent,the

growth

ofthe

turnipsis

affectedby

thedifferent

blocks,buttheeffects

(the

’s)foreachblock

arelikely

tobe

differentindifferent

years.S

ow

ecould

thinkofthe

’sfor

eachblock

ascom

ingfrom

apopulation

of

’sfor

blocks.Ifw

edid

this,thenw

ecould

replacethe

modelin

(3.8)w

ith

��

�3 �3� �4 �4� ��

5 �5�

�j6 �67� �j66 �667� �j666 �6667� �j68 �687� � �� 9 ;(3.30)

where

j6 ,j66 ,j666

and

j68

areindependently

andidentically

distributedas

��k � .

36

3.4C

lassicalM

ultivariate

Analysis�

PrincipalC

omponents

Analysis

�

Correspondence

Analysis

�

Multidim

ensionalScaling

�

Cluster

Analysis

andM

ixtureD

ecomposition

�

LatentVariable

andC

ovarianceS

tructureM

odels

37

3.4.1P

rincipalC

omponents

Analysis

Principalcom

ponentsanalysis

isa

way

oftransform

inga

setofC

-dimensional

vectorobservations,l� ,l� ,...,lA

,intoanother

setofC

-dimensionalvectors,

[� ,[� ,...,[A.

The[

’shave

theproperty

thatmostoftheirinform

ationcontent

isstored

inthe

firstfewdim

ensions(features).

This

willallow

dimensionality

reduction,sothatw

ecan

dothings

like:

�

obtaining(inform

ative)graphicaldisplays

ofthedata

in2-D

;

�

carryingoutcom

puterintensive

methods

onreduced

data;

�

gaininginsight

intothe

structureof

thedata,

which

was

notapparent

inC

dimensions.

38

Sepal L.

2.02.5

3.03.5

4.0

••

•••

•

• ••

••

•••

••

••

•••

••••

•• ••

• ••

• •• •

•••

•••

••

••

••

•• •• •

•

•••

•

•

••

••

••

•••

••

••

•• • ••

•••

•• •

••

••

•••

••

••

•• •

••

••

•• • •

•

••

••

••

••

••

••

•

•

• ••• •

••

•

••

•

•••

•

••• •• •

•

•••

••

••

•••• • •• •• • •••• •••• •• •••

•• •••• • •• •• • ••• •••• • •• •• ••

•••

•

•• •

•

•

••

• ••

•

•••

••

••

••• •

•••

••• ••

• • ••••• ••

•• • •

••

••

•• •

•

•

•• •

•• •

• • • •

••

•

•

•

•

•• •

••• • • •

••

•

•

• ••

••••

•••• •••

0.51.0

1.52.0

2.5

5 6 7 8

•••• ••• •• • •••• •••• •• ••

••

• ••

•• ••

• •• • ••• •••••

•• •• ••

•• •

•

•

••

•

•••

••

•• •••

••

••

•• • • •

••

••• ••• • •

•••• ••

•• •• •

••

••

•

•• •

• •••

•• ••• •

•

••

•

•

• •

••

••••

•• •

•••

••••

•• •

•

••

•• •

••

2.02.53.03.54.0•• •••

••

••

•

•••

•

• ••

••

••

••

••• • ••••

• ••

• ••

•

•• •

• •• •

•

•

••

••

••

••

••

••

••

•• ••

•••

••

••• •• •• •

•••••

••

•

•

•• ••

••

• •••

••

•

••

• ••

••

•

•••

•• •

••

•••

••

••

••

• ••

••

•

•••

•••

••••

•• ••

• ••

•S

epal W.

•• •• • •••• • •••• • ••• ••• ••

• •• • •••• • • •• • • •• • •• • • •

• •• •••••

••• •

••

•

•

••

••

•••

••

••

• •• •

• ••

••• ••

• ••• ••• •

••

• ••••

••

••• •

••

•• •

•• •• • ••

•••

••

••

••• •

• •• •

••

••

•••

••••

• ••• • ••

•• •• ••••• • ••• •••• •••

••

•••

•••••

• •• • • •• • •• ••

•• •• •••••

••

••

•••

•

•

•

•• •••

••

••

•• • ••

••

••• •• • ••

• •• • ••

•• ••

••

•

••

•••

• ••

•••••

• ••

•••

••

•

••• •

••

• •

•••

• •••

•••

••

••

• ••

•

••• •

••

••

••

•••

•• •

••

••

••

••

••••••

•••••

••

•••

••• •

••

••

•

••

••

••

•

•

•••

•• ••

•••

••

•• ••• • • ••

• •• • ••

••

••• •

••

•••

•

••

••

•• •

•

•

••

•• •

•• •

• •• •

••

•

•

••

•• • •

• ••

•• ••

•••

• •••

•••• ••

•

••

• ••

•• •

••

••

•••

••

•••

•••

• ••

•••••

•••

••• •

•••

••

••

••

••

• •• ••

•••

•

••

••

•••

• ••

••

••

••• ••

•••

••• •

••

••

•• •

••

••

•• •

••

••

•• • •

•

••

••

••

••

••

••

••

• ••• •

••

••

••

•••

••

•• • •••

•••

••

••

Petal L.

1 2 3 4 5 6 7

•• ••••

•• •• ••• ••• • ••

••

••

•••

•• ••

•• •• • •• ••••

•• •• ••

•• ••

••

•

•

•••

••

•• • ••

••

••

••• • •

••

• •• •••• •

••• ••

••

••• •

••

••

••

• •

• •••

• • ••• •

•• •

••

• •

••

•• ••

•• •

•• •

••••

•••

••

••

• ••

•

56

78

0.51.01.52.02.5

••••

••

••

••

•••

•• •

••

••

••

••

•• ••••

•••

•••

••

• •••

•••

••

••

••

••

••

•

••

•••• •

••

•••

•

•• ••• •• •

•• •• • •

••

••

••••

••

•• ••

••

•

••

• ••

••

•

•••

•• •

••• •

•

••

••

••

•• •• •

••••

•••

•• ••

•• ••

• ••

•

••

•••

•••

••

••

••

••

••

•••

•••

•••••

•• •

••••

•• •

••

••

••

••

• • •••

•••

••

••

••

••• •

••

•

••

•• • ••

•••

••• •

••

••

•••

••

••

• ••

•

•

••• ••

••

•

••

••

••

••

••

•

•••

•••

••

••

•••

•

••

•• • •••

• ••

••

••

12

34

56

7•••• ••••• •••

• • •••••• ••

••• •••• •• ••••• •• •• •••••••

• •••

•• •

••

••

•••

•• •

••

•

••

••• •

• ••

• •• ••

• •••••

• ••

••• ••

••

•

••• •

••

•• •

•• •• •••

• •

•

••

••

••••

•• • ••

••

•••

•• •

••• •

•• • ••

Petal W

.

Figure

3.3F

isher’sIris

Data

(collectedby

Anderson)

39

The

main

ideabehind

principalcom

ponentsanalysis

isthat

highinform

ationcorresponds

tohigh

variance.

So,

ifwe

wanted

toreduce

thel

’sto

asingle

dimension

we

would

transform

l

to

� �m fl�

choosingm

sothat�

hasthe

largestvariancepossible.

Itturnsoutthatm

shouldbe

theeigenvector

correspondingto

thelargesteigen-

valueofthe

variance(covariance)

matrix

ofl

,a

.

Itisalso

possibleto

showthatofallthe

directionsorthogonalto

thedirection

ofhighestvariance,the

(second)highestvariance

isin

thedirection

paralleltothe

eigenvectorofthe

secondlargesteigenvalue

of a.

These

resultsextend

allthew

aytoC

dimensions.

40

Estim

ateof a

is

?) �]�/

�

�� $� A� �

� � l� $l�gf� l� $l��

(3.31)

wherel

��

A@� l� .

�

The

eigenvaluesof?

aren� on� o��on�o��

�

The

eigenvectorsof?

correspondington� ,n� ,...,n�

arep� ,p� ,...,p� ,respectively.

The

vectorsp� ,p� ,...,p�

arecalled

theprincipal

axes.(p�

isthe

firstprincipalaxis,etc.)

�

The�DCcC�

matrix

whoseq th

column

isp�w

illbedenoted

asr

.

41

The

principalaxes(can

beand)

arechosen

sothatthey

areoflength

1and

areorthogonal(perpendicular).

Algebraically,this

means

that

p f� p�ts �

�

ifq �qvu

�

ifqw �qxu�

(3.32)

The

vector[

definedas,[) �] �/

�yzzzz{ p f�p f�...p f�|}}}}~

) �]�/

l) �] �/

�r fl

iscalled

thevector

ofprincipalcom

ponentscores

ofl

.T

heq thprincipalcom

-ponentscore

ofl

is�� p f� l

;sometim

esthe

principalcomponentscores

arereferred

toas

theprincipalcom

ponents.

42

1.T

heelem

entsof[

areuncorrelated

andthe

sample

varianceof

the

q thprincipalcom

ponentscoreisn� .

Inother

words

thesam

plevariance

matrix

of[is

yzzz{n�

n�

�

�

...

n�|}}}~

) �]�/

�

2.T

hesum

ofthe

sample

variancesfor

theprincipalcom

ponentsis

equaltothe

sumofthe

sample

variancesfor

theelem

entsofl

.T

hatis,��

� n� ��

� 2��

�

where2

��

isthe

sample

varianceof�� .

43

y1

-6.5-6.0-5.5-5.0-4.5-4.0

••• •

••

••

••

••

••

• ••

••

• ••• ••••

••• •

••

••

••

••

•••

••

••

••

••

••

••

••

•

•

••

••

•••

•••

••••

•••

• • ••

•••

• •••

•••

• ••

••

•• •

•

••

••

•• •

•

•

••

••

••

••

• •

••

••

•

•

••

•• • •

• ••

•••

•••

••

••

••••

•• •

•

••

• • • ••••

••

••

••

••

••

••

••••

••

•••

• •• •

•••

•••

••

• • ••

••••

••

••

••

•

•

••

••

••

••

••

••

••

••• •

••

•• •• •

•• •

••

•• • •

••

•• ••

• •

••

••

••

•

•••

•• •

••

•• •

•

••

•

•

••

••

• •• •

••

• ••

••

••

••

• ••

••

••

•

-0.4-0.20.0

0.20.4

2 4 6 8

••

•• • •• •• • •

•••

••

•• •

• ••••

•• •

••

• ••

••

•• •

•• •

••

••

••

•• •

• •• •• •

••• ••• ••

•• •

•••

• ••

••

• • • ••

• •• •

••

••

••

••

•••

••

••

•• ••

••

•••

••

•• • •••

••

••

••

•

•

••

••

• ••

••

••

••

••

•• •

••

••

••

••

•

-6.5 -5.5 -4.5• •• •••

•• ••• • • •

•• • •• • •••

• • ••• • • •••• ••• • ••• ••••

•• •• •

• ••

•

• ••

•

•

••

• ••

•

• •• ••

••

•••• •••

• ••••

••• •• ••••

••• ••

••

••

•••

•

•

•

•

•• ••

••• •

• •

•

•

•

•

•••

•••• ••

•• •

•• •

•• ••

•

••• •••

•y2

••

• ••• •• •

•

•

••

•

••

••

••

•• •

• ••

•• •

•••

••

•••

• •

••

•

•••

•

• •

• •

••

•

•

••

•

•

•

••

••

••

•

••

••

••

••••

••

•• •••

•••

•

••

••••

••• •

•

•••

•

••

•

•

•

• •

••

••

••

•••

•

•

•

•

•

••

•

•• •

• •

•

•• •

••

••

••

•

•••

••

••

•

••

•••• •• •••

•• •

••

•••

• •• ••

•••

••

• •

••

•

•••

•

•••

•

••

•

•

•

•••• ••

•••

• •• • •• ••

••

•••

•••

••

•• •••

• ••••

•

••

••

••

•• •

••

••

••• •

••

•

• •

•

•

• • •• •

•

••

•

•

•

•

•

•

••

•

•• •

••

•

••

•

••

••

• ••

•

••

••

••

•

•• • • • ••• •• • •• •• •••• •• •

•• •• ••• ••• ••••• ••• •• •• •

• ••••• •

••

• • ••

•

•

•

••

••

• •••

•

•

•• •• •• •

•

• •• ••

••••

•• • ••

•• •••

••

•

•

•• •

•

•

••

••• •

• •••••

•

••

•

•

••• •

••• •

•

•••

•••

• •••

• •

•• • •

•

•• • •

••

••

••

••

••

• ••

••

•• ••• ••

•••

•••

••

••

•

••

• •

•

••

•

••

••

••

•

••

•

••

•

•

•

•

•

•

• •

•

••• •

••• •

••• •

•

•••

• • ••

•

• •• •

••

••

•••

••

•

•

•• •

•

•

••

••

••

•

•••

•

••

••

•

•

•

•• ••

••

•

•••

•

•••

••

••

• •••

• •

•

y3

-1.2-0.8-0.4 0.0

••

•• • • •• •• •

•• •

•

••

••

•• • ••

•

• ••

•••

•

••

•••

•••

•

•

••

•

••

•••• ••••

•••• •• ••

•••

•••

• •

••

•• •• •

•

• •• •

•

••

•

•

••

••

• ••

••

••

• ••

••

•

• •

••

•••• •

••

••

••

••

•

•

•

••

••

••

•

•

••

•

•

••

• ••

••

•

••

•

•

•

24

68

-0.4 0.00.20.4

• ••• • •• • •• •• •• •• •••• • •

••• • •• •• • •• • • • •• •• • ••

••

•• • • ••••

••• •

••

••

• •

•

•••• •

••

••• • •• •

•••• •

•• • • •

• •• ••

••• • •

•

••

••

• ••

••

••

•• •• •••

• ••

••

•

•

••

•••• ••

•

••

•

••

•• • •

•• •

•••••

••••

••

••

••

••

••

•••

••

• • •• •• ••

• •• •

•

••

••

•

••

• •

•

•

•

•

•

••

••

••

••

•

••

••

••

••

• ••

••

••• •

••

••• •

••

•••••

••

••••

••

••

• ••

•

••

••

• ••

••

••

••

••

•••

•

••

••

•

•

••

•• •

• •

•

••

•

•

••

••

•

•

•• • •

•••

•

-1.2-0.8

-0.40.0

••

•• • ••• •

••

••

••

••

••

••

••••

••

• ••

• •

• •• •

•

• ••

•

•

• ••

•

• •• •

••

••

•

••

••

••

••

•

••

••

•

••

••

• • ••

••

••• ••

• ••

•

••• •

••

•• ••

•••

••

•

••

•• •

••

• ••

••

••

••

••

•

•

••

•• •

• •

•

•

••

•

•

•

••

•

•

• ••

••

••

•y4

Figure

3.4P

rincipalcomponentscore

forF

isher’sIris

Data.

Com

parew

ithF

igure3.3

44

Effective

Dim

ensionality

1.T

heproportion

ofvariance

accountedfor

Takethe

first>

principalcom-

ponentsand

addup

theirvariances.

Divide

bythe

sumofallthe

variances,to

give

@B�� n�

@ �� n�

which

iscalled

theproportion

ofvarianceaccounted

forby

thefirst>

princi-palcom

ponents.

Usually,

projectionsaccounting

forover

75%of

thetotalvariance

arecon-

sideredto

begood.

Thus,

a2-D

picturew

illbe

considereda

reasonablerepresentation

if

n� �n�

@ �� n� �

� �WV �

45

2.T

hesiz

eof

impor

tantvariance

The

ideahere

isto

considerthe

varianceifalldirections

were

equallyim

portant.In

thiscase

thevariances

would

beapproxim

ately

n ��C��

� n� �

The

argumentruns

Ifn��n ,

thenthe

q thprincipal

directionis

lessinteresting

thanaverage.

andthis

leadsus

todiscard

principalcom

ponentsthat

havesam

plevari-

ancesbelow

n .

3.S

creedia

gramA

screediagram

isan

indexplotofthe

principalcomponent

variances.In

otherw

ordsitis

aplotofn�

againstq .A

nexam

pleofa

screediagram

,forthe

IrisD

ata,isshow

nin

Figure

3.5.

46

•

••

•

0 1 2 3 4�

12

34

�

λ

Figure

3.5W

elook

forthe

elbow;in

thiscase

we

onlyneed

thefirstcom

ponent.

47

Norm

alising

The

datacan

benorm

alisedby

carryingoutthe

following

steps.

�

Centre

eachvariable.

Inother

words

subtractthem

eanofeach

variableto

give

�l� �l� $l �

�

Divide

eachelem

entof �l�by

itsstandard

deviation;asa

formula

thism

eanscalculate

��

2�

�w

here2�

isthe

sample

standarddeviation

of�� .

48

Petal L.

Sepal W.

-10-5

05

1015

-10 -5 0 5 10 15

•• •• • •••• • •••• • ••• ••• ••• •• • •••• • • •• • • •• • •• • • •• •• •••••• •• •

•••

• •• •• •••• • ••• • • •• ••••• •

• • ••• •• • ••• • ••••

••• •• ••

••• •

•• •• • ••••

• •••

• ••• •• •• •••• • ••• •••• • ••• • ••

Mean C

entred Data

•• •• ••

••• • •••

• ••

•••

••

••

••

• • ••

•• • • ••• • •• ••• •

••

••

• •••

••

••• •

••

••

••

••

•••

••

••

••

• ••

••

••• •

•• •

••

•••

••

•• •••

••

••

••

••

••

••

••

•• •

••

••

••

••

••

•• •

••

••

••

••

•••

••

•••

••

••

••

5 x Petal L.

Sepal W.-10

-50

510

15

-10 -5 0 5 10 15

Scaled D

ata

Figure

3.6Ifw

edon’tnorm

alise.

49

Interpretation

The

finalpartofa

principalcomponents

analysisis

toinspect

theeigenvectors

inthe

hopeofidentifying

am

eaningfor

the(im

portant)principalcom

ponents.

See

thebook

foran

interpretationfor

Fisher’s

IrisD

ata.

50

3.4.2C

orrespondenceA

nalysis

Correspondence

isa

way

torepresent

thestructure

within

incidencem

atrices.Incidence

matrices

arealso

calledtw

o-way

contingencytables.

An

example

ofa

� Vc ;�

incidencem

atrix,w

ithm

arginaltotals

isshow

nin

Table3.17.

Table3.17

Sm

okingC

ategoryS

taffGroup

None

LightM

ediumH

eavyTotal

Senior

Managers

42

32

11Junior

Managers

43

74

18S

eniorE

mployees

2510

124

51Junior

Em

ployees18

2433

1388

Secretaries

106

72

25Total

6145

6225

193

51

Two

Stages

�

Transformthe

valuesin

aw

aythatrelates

toa

testforassociation

between

rows

andcolum

ns(chi-squared

test).

�

Use

adim

ensionalityreduction

method

toallow

usto

drawa

pictureof

therelationships

between

rows

andcolum

nsin

2-D.

Details

arelike

principalcomponents

analysism

athematically;see

thebook.

52

3.4.3M

ultidimensional

Scaling

Multidim

ensionalscalingis

theprocess

ofconverting

aset

ofpairw

isedissim

i-larities

fora

setofpoints,intoa

setofco-ordinatesfor

thepoints.

Exam

plesofdissim

ilaritiescould

be:

�

theprice

ofanairline

ticketbetween

pairsofcities;

�

roaddistances

between

towns

(asopposed

tostraight-line

distances);

�

acoefficient

indicatinghow

differentthe

artefactsfound

inpairs

oftom

bsw

ithina

graveyardare.

53

Classical

Scaling

Classicalscaling

isalso

known

asm

etricscaling

andas

principalco-ordinatesanalysis.

The

name

‘metric’scaling

isused

becausethe

dissimilarities

areas-

sumed

tobe

distances—or

inm

athematicalterm

sthe

measure

ofdissim

ilarityis

theeuclidean

metric.

The

name

‘principalco-ordinatesanalysis’is

usedbe-

causethere

isa

linkbetw

eenthis

techniqueand

principalcomponents

analysis.T

henam

e‘classical’

isused

becauseit

was

thefirst

widely

usedm

ethodof

multidim

ensionalscaling,andpre-dates

theavailability

ofelectroniccom

puters.

The

derivationof

them

ethodused

toobtain

theconfiguration

isgiven

inthe

book.

54

The

resultsof

applyingclassicalscaling

toB

ritishroad

distancesare

shown

inF

igure3.7.

These

roaddistances

correspondto

theroutes

recomm

endedby

theA

utomobile

Association

;these

recomm

endedroutes

areintended

togive

them

inimum

travellingtim

e,notthethe

minim

umjourney

distance.

�

An

effectofthis,thatisvisible

inF

igure3.7

isthatthe

towns

andcities

havelined

upin

positionsrelated

tothe

motorw

aynetw

ork.

�

The

map

alsofeatures

distortionsfrom

thegeographicalm

apsuch

asthe

positionof

Holyhead

(holy),

which

appearsto

bem

uchcloser

toLiverpool

(lver)andM

anchesterthanitreally

is,andthe

positionofC

ornishpeninsula

(thepartending

atPenzance,penz)is

furtherfromC

armarthen

(carm)than

itisphysically.

55

Com

ponent 1

Component 2

-400-200

0200

-200 0 200

abdn

abry

barn

bham

bton

btol

camb

card

carl

carmcolc

dorcdovr

edin

exet

fort�

glas

glou

gild

holyhull

invr

kend

leed

linc

lver

maid

manc

middnewc

norw

nott

oxfd

penz

prth

plym

shef

sotn

stra

taun�

york

lond

Figure

3.7

56

Ordinal

Scaling

Ordinalscaling

isused

forthe

same

purposesat

classicalscaling,but

fordis-

similarities

thatare

notm

etric,that

is,they

arenot

what

we

would

thinkof

asdistances.

Ordinalscaling

issom

etimes

callednon-m

etricscaling,because

thedissim

ilaritiesare

notm

etric.S

ome

peoplecallitS

hepard-Kruskalscaling,be-

causeS

hepardand

Kruskalare

thenam

esoftw

opioneers

ofordinalscaling.

Inordinalscaling,

we

seeka

configurationin

which

thepairw

isedistances

be-tw

eenpoints

havethe

same

rankorder

asthe

correspondingdissim

ilarities.S

o,if��

isthe

dissimilarity

between

points�

and� ,and��

isthe

distancebetw

eenthe

same

pointsin

thederived

configuration,then

we

seeka

configurationin

which

�� Zk

if

�� Zk �

57

3.4.4C

lusterA

nalysisand

Mixture

Decom

position

Clusteranalysis

andm

ixturedecom

positionare

bothtechniques

todo

with

iden-tification

ofconcentrationsofindividuals

ina

space.

58

Cluster

Analysis

Clusteranalysis

isused

toidentify

groupsofindividuals

ina

sample.

The

groupsare

notpre-defined,

nor,usually,

isthe

number

ofgroups.

The

groupsthat

areidentified

arereferred

toas

clusters.

�

hierarchical

–agglom

erative

–divisive

�

non-hierarchical

59

�M

inimum

distanceor

single-link

�M

aximum

distanceor

complete-link

�

Avera

ge

distance

�

Centroid

distancedefines

thedistance

between

two

clustersas

thesquared

distancebetw

eenthe

mean

vectors(that

is,the

centroids)of

thetw

oclus-

ters.

�

Sum

ofsquared

deviations

definesthe

distancebetw

eentw

oclusters

asthe

sumofthe

squareddistances

ofindividualsfrom

thejointcentroid

ofthethe

two

clustersm

inusthe

sumofthe

squareddistances

ofindividualsfrom

theirseparate

clusterm

eans.

60

12

34

56

78

9

0 1 2 3 4 5 6Distance between clusters

Figure

3.8U

sualway

topresentresults

ofhierarchicalclustering.

61

Non-hierarchicalclustering

isessentially

tryingto

partitionthe

sample

soas

tooptim

izesom

em

easureofclustering.

The

choiceof

measure

ofclustering

isusually

basedon

propertiesof

sums

ofsquares

andproducts

matrices,like

thosem

etinS

ection3.3.1,because

theaim

inthe

MA

NO

VA

isto

measure

differencesbetw

eengroups.

The

main

difficultyhere

isthatthere

aretoo

many

differentways

topartition

thesam

plefor

usto

trythem

all,unless

thesam

pleis

verysm

all(around

about

��

orsm

aller).T

husour

onlyw

ay,in

general,of

guaranteeingthat

theglobaloptim

umis

achievedis

touse

am

ethodsuch

asbranch-and-bound¿

One

ofthe

bestknow

nnon-hierarchical

clusteringm

ethodsis

the

� -means

method.

62

Mixture

Decom

position

Mixture

decomposition

isrelated

tocluster

analysisin

thatit

isused

toidentify

concentrationsofindividuals.

The

basicdifference

between

clusteranalysisand

mixture

decomposition

isthatthere

isan

underlyingstatisticalm

odelinm

ixturedecom

position,whereas

thereis

nosuch

modelin

clusteranalysis.

The

proba-bility

densitythathas

generatedthe

sample

datais

assumed

tobe

am

ixtureof

severalunderlyingdistributions.

So

we

have

�� l� ��

� �� l�d��

where

�

isthe

number

ofunderlying

distributions,the

�� ’sare

thedensities

ofthe

underlyingdistributions,

the�� ’s

arethe

parameters

ofthe

underlyingdistributions,

the�� ’s

arepositive

andsum

toone,

and

�

isthe

densityfrom

which

thesam

plehas

beengenerated.

Details

inone

ofHand’s

books.63

3.4.5Latent

Variab

leand

Covariance

Structure

Models

Ihave

neverused

thetechniques

inthis

section,so

Ido

notconssider

myself

expertenoughto

givea

presentationon

them.

Notenough

time

tocover

everything.

64

3.5S

umm

ary

The

techniquespresented

inthis

chapterdonotform

anythinglike

anexhaustive

listof

usefulstatisticalmethods.

These

techniquesw

erechosen

becausethey

areeither

widely

usedor

oughtto

bew

idelyused.

The

regressiontechniques

arew

idelyused,though

thereis

some

reluctanceam

ongstresearchersto

make

thejum

pfrom

linearm

odelsto

generalizedlinear

models.

The

multivariate

analysistechniques

oughttobe

usedm

orethan

theyare.

One

ofthem

ainobstacles

tothe

adoptionofthese

techniquesm

aybe

thattheirrootsare

inlinear

algebra.

Ifeelthetechniques

presentedin

thischapter,and

theirextensions,w

illremain

orbecom

ethe

most

widely

usedstatisticaltechniques.

This

isw

hythey

were

chosenfor

thischapter.

65

Date post:	08-Dec-2016
Category:	Documents
Upload:	truongduong
View:	219 times
Download:	2 times

Statistics - Statistical Methods for Data Analysis

Documents