Statistics - Statistical Methods for Data Analysis

transcript

Chapter

Statistical

Methods

.TaylorU

niversityofH

ertfordshire

arch2001

3.1Introduction

Generalized

LinearM

SpecialTopics

egressionM

odelling

ClassicalM

ultivariateA

nalysis

eneralized

LinearM

odels�

Regression

Analysis

ofVariance

Log-linearM

LogisticR

egression

Analysis

ofSurvivalD

fittingof

generalizedlinear

models

iscurrently

ostfrequently

appliedstatisticaltechnique.

Generalized

linearm

odelsare

usedto

describedthe

rela-tionship

between

ean,sometim

escalled

thetrend,ofone

variableand

thevalues

takenby

severalothervariables.

3.2.1R

egression

variable,� ,relatedto

one,orm

ore,othervariables,�� ,�� ,...,�� ?

esfor� :

response;dependentvariable

;output.

�� ’s:regressors;explanatory

variables;independentvariables;inputs.

Here,w

illusethe

outputandinputs.

reasonsfor

doinga

regressionanalysis

include:

�the

outputis

expensiveto

measure,

butthe

inputsare

not,and

socheap

predictionsofthe

outputaresought;

thevalues

oftheinputs

areknow

nearlier

thanthe

outputis,anda

working

predictionofthe

outputisrequired;

cancontrol

thevalues

inputs,w

ebelieve

thereis

acausal

linkbetw

eenthe

inputsand

theoutput,

toknow

valuesof

theinputs

shouldbe

chosento

obtaina

particulartarget

valuefor

theoutput;

itisbelieved

thatthereis

acausallink

between

oftheinputs

andthe

output,andw

identifyw

hichinputs

arerelated

output.

(general)linear

is��

� ��

the � ’sare

independentlyand

identicallydistributed

��

and�

number

ofdatapoints.

modelis

linearin

the’s.

��

eightedsum

’s.)

reasonsfor

theuse

ofthelinear

model.

aximum

likelihoodestim

atorsof

’sare

thesam

theleast

squaresestim

ators;seeS

ection2.4

ofChapter2.

Explicitform

ulaeand

rapid,reliablenum

ericalmethods

forfinding

theleast

squaresestim

atorsofthe

problems

framed

asgenerallinear

models.

example,

��

convertedby

setting�� ,��

and�� .

thelinear

modelis

notstrictly

appropriate,there

isoften

transformthe

outputand/orthe

inputs,sothata

linearm

odelcanprovide

usefulinformation.

Non-linear

Regression

examples

�� ! "��

��

�� # �%$&(' ")+*",�- ./0 � ��

the ’sand�

in(3.1).

Problem

stimation

iscarried

outusingiterative

methods

requiregood

choicesof

startingvalues,

notconverge,

convergeto

alocaloptim

umrather

thanthe

globaloptimum

illrequirehum

anintervention

toover-

thesedifficulties.

hestatisticalproperties

estimates

andpredictions

fromthe

arenot

known,

ecannot

performstatistical

inferencefor

non-linearregression.

Generaliz

edLinear

Models

generalizationis

oparts.

hedistribution

outputdoes

nothave

thenorm

al,but

anyofthe

distributionsin

theexponentialfam

2.Instead

expectedvalue

outputbeing

alinear

functionof

’s,we

have1# �� 0 � �

��

where1� ��

monotone

differentiablefunction.

function1� ��

iscalled

thelink

function.

reliablegeneralalgorithm

forfitting

generalizedlinear

models.

Generaliz

dditive

Models

Generalized

additivem

odelsare

ageneralization

ofgeneralizedlinear

models.

generalizationis

that1# �� 0

neednotbe

alinear

functionofa

’s,buthas

theform

1# �� 0 � �

�� 2

� � ��

the2� ’s

arearbitrary,usually

smooth,functions.

example

producedusing

ofscatterplot

smoother

isshow

Figure

• ••

••

• •

••

Diabetes D

ata --- Spline S

Log C-peptide

3 4 5 6

Figure

Methods

forfitting

generalizedadditive

models

existandare

generallyreliable.

drawback

isthat

thefram

ofstatistical

inferencethat

isavail-

ablefor

generalizedlinear

models

hasnot

yetbeen

developedfor

generalizedadditive

models.

Despite

thisdraw

back,generalized

additivem

odelscan

befitted

byseveralof

ajorstatisticalpackages

already.

3.2.2A

nalysisof

Variance

analysisof

variance,or

isprim

arilya

method

ofidentifying

’sin

alinear

modelare

non-zero.T

histechnique

developedfor

theanalysis

ofagriculturalfieldexperim

ents,butisnow

usedquite

generally.

Turnipsfor

Winter

Fodder.

datain

Table3.1

arefrom

periment

toinvestigate

thegrow

turnips.T

hesetypes

ofturnips

begrow

providefood

forfarm

animals

inter.T

heturnips

harvestedand

weighed

bystaff

andstudents

Departm

entsof

Agriculture

p-plied

Statistics

University

ofReading,in

October,1990.

Table3.1

Treatments

Blocks

Variety

Density

LabelI

Barkant

21/8/901

kg/haA

2.71.4

1.23.8

2kg/ha

3.83.0

kg/haC

6.54.6

4.70.8

8kg/ha

4.06.0

28/8/901

kg/haE

4.40.4

6.53.1

2kg/ha

7.17.0

kg/haG

24.014.9

14.62.6

8kg/ha

18.915.6

21/8/901

kg/haJ

1.21.3

1.51.0

2kg/ha

2.02.1

kg/haL

2.26.2

5.70.6

8kg/ha

2.810.8

28/8/901

kg/haN

2.51.6

1.30.3

2kg/ha

1.22.0

kg/haQ

4.713.2

9.02.9

8kg/ha

13.39.3

following

linearm

��

�3 �3� �4 �4� ��

5 �5�

�66 �667� �666 �6667� �68 �687� � �� :9 ;

equivalentonecould

befitted

tothese

data.T

heinputs

takethe

values0

andare

usuallycalled

indicatorvariables.

firstsight,(3.8)should

alsoinclude

anda6 ,butw

notneedthem

firstquestionthatw

ouldtry

toansw

eraboutthese

datais

achange

intreatm

entproducea

changein

theturnip

yield?

isequivalentto

asking

anyof3

,...,5non-zero?

sortofquestionthatcan

beansw

eredusing

works.

Recall,the

generallinearm

odelof(3.1),

��

� ��

estimate

is =� .

Fitted

values

=�� = �

�� =� ��

Residuals

>� �� $

=��

(3.10)

sizeof

theresiduals

isrelated

sizeof�

�,the

varianceof

the � ’s.It

turnsoutthatw

estimate�

? ��@BA� �� $

=��

� $�DC��

(3.11)

keyfacts

about? �

isthatallow

compare

differentlinearm

odelsare:

thefitted

modelis

adequate(‘the

rightone’),

then? �

goodestim

ateof�

fittedm

odelincludesredundant

(thatis

includessom

’sthat

arereally

zero),then? �

isstilla

goodestim

ateof�

ifthefitted

modeldoes

notincludeone

oreinputs

thatitoughtto,then

willtend

largerthan

thetrue

valueof�

usefulinput

fromour

model,

theestim

ateof�

shootup,

whereas

redundantinputfromour

model,the

estimate

of��

shouldnotchange

thatomitting

oneofthe

inputsfrom

odelisequiv-

alenttoforcing

thecorresponding

Turnipsfor

Winter

Fodder

continued.LetE�

odelat

(3.8),and E

thefollow

odel�� 66 �667� �666 �6667� �68 �687� � �� :9 ;�

(3.18)

specialcaseofE�

hichallof3

,...,5

arezero.

Table3.2

MeanSq

FValue

3163.73754.57891

2.2780160.08867543

Residuals60

1437.53823.95897

Table3.3

MeanSq

FValue

3163.73754.57891

5.6904300.002163810

1005.92767.06182

6.9919060.000000171

Residuals45

431.611

9.59135

Table3.4

thatwould

usuallybe

producedfor

theturnip

data.N

oticethat

the‘block’and

‘Residuals’row

thesam

inTable

hebasic

differencebetw

eenTables

3.3and

thatthe

treatment

information

isbroken

intoits

constituentpartsin

Table3.4.

Table3.4

FValue

3163.7367

54.5789

5.690430.0021638

variety

183.9514

83.9514

8.752820.0049136

sowing

1233.7077233.707724.36650

0.0000114

density

3470.3780156.792716.34730

0.0000003

variety:sowing

136.4514

36.4514

3.800450.0574875

variety:density

38.6467

2.8822

0.300500.8248459

sowing:density

3154.7930

51.5977

5.379600.0029884

variety:sowing:density

317.9992

5.9997

0.625540.6022439

Residuals

431.6108

9.5914

3.2.3Log-linear

Models

datashow

Table3.7

showthe

sortof

problemattacked

bylog-linear

modelling.

arefive

categoricalvariablesdisplayed

inTable

centreone

ofthreehealth

centresfor

thetreatm

entofbreastcancer;

theage

ofthepatientw

henher

breastcancerw

asdiagnosed;

survivedw

hetherthe

patientsurvivedfor

atleastthreeyears

fromdiagnosis;

appearappearance

ofthepatient’s

tumour—

eitherm

alignantor

benign;

inflamam

ountofinflamm

ationofthe

tumour—

eitherm

inimal

orgreater.

Table3.7

ofTumour

alInflamm

ationG

reaterInflam

mation

Malignant

Benign

Malignant

Benign

Centre

Survived

Appearance

TokyoU

nder50

950–69

orover

Boston

50–69N

organU

nder50

150–69

orover

thesedata,the

outputisthe

number

ofpatientsin

eachcell.

modelis

��GFHI JK�L� �M NOPIQ�L� � � ��

� ��

(3.21)

allthevariables

ofinterestarecategorical,w

indicatorvari-

ablesas

inputsin

thesam

in(3.8).

Table3.8

sequentially(firsttolast)

DevianceResid.Df

Resid.Dev

Pr(Chi)

860.0076

centre

29.3619

850.6457

0.0092701

2105.5350

745.1107

0.0000000

survived

1160.6009

584.5097

0.0000000

inflam

1291.1986

293.3111

0.0000000

appear

17.5727

285.7384

0.0059258

centre:age

476.9628

208.7756

0.0000000

centre:survived

211.2698

197.5058

0.0035711

centre:inflam

223.2484

174.2574

0.0000089

centre:appear

213.3323

160.9251

0.0012733

age:survived

23.5257

157.3995

0.1715588

age:inflam

20.2930

157.1065

0.8637359

age:appear

21.2082

155.8983

0.5465675

survived:inflam

10.9645

154.9338

0.3260609

survived:appear

19.6709

145.2629

0.0018721

inflam:appear

195.4381

49.8248

0.0000000

marise

odel,Iwould

constructitsconditionalindependence

graphand

presenttablescorresponding

interactions.Tables

thebook.

conditionalindependencegraph

isshow

Figure

centre

survived

inflamappear

Figure

3.2.4Logistic

Regression

Inlogistic

regression,theoutputis

thenum

berofsuccesses

outofanum

trials,eachtrialresulting

ineither

asuccess

orfailure.

thebreastcancer

data,we

canregard

eachpatientas

a‘trial’,w

ithsuccess

correspondingto

thepatientsurviving

forthree

years.

outputw

ouldsim

givenas

number

ofsuccesses,

either0

foreach

ofthe764

patientsinvolved

study.

modelthatw

illfitis R�� $S� � R�� S� �L�

S�� $S�

� ��

(3.22)

Again,

theinputs

indicatorsfor

thebreast

cancerdata,

butthis

generallytrue;

thereis

noreason

theinputs

shouldnot

bequantitative.

Table3.15

Deviance

Resid.Df

Resid.

Pr(Chi)

898.5279

centre

211.26979

887.2582

0.0035711

23.52566

883.7325

0.1715588

appear

19.69100

874.0415

0.0018517

inflam

10.00653

874.0350

0.9356046

centre:age

47.42101

866.6140

0.1152433

centre:appear

21.08077

865.5332

0.5825254

centre:inflam

23.39128

862.1419

0.1834814

age:appear

22.33029

859.8116

0.3118773

age:inflam

20.06318

859.7484

0.9689052

appear:inflam

10.24812

859.5003

0.6184041

centre:age:appear

42.04635

857.4540

0.7272344

centre:age:inflam

47.04411

850.4099

0.1335756

centre:appear:inflam

25.07840

845.3315

0.0789294

age:appear:inflam

24.34374

840.9877

0.1139642

centre:age:appear:inflam

30.01535

840.9724

0.999496428

fittedm

odelis

simple

enoughin

thiscase

forthe

parameter

estimates

includedhere;

theyare

formthat

astatistical

packagew

ouldpresentthem

inTable

Table3.16

Coefficients:

(Intercept)

centre2

centre3

appear

1.080257

-0.6589141

-0.49448460.5157151

theestim

atesgiven

inTable

3.16,thefitted

modelis

PIQ JT� S� � ��U� �V W $� �9VUX �;��3� $� �;X ;;U ;9�Y� �� V �V W�V ��Z� �

(3.23)

3.2.5A

nalysisof

Surviv

alData

Survivaldata

aredata

concerninghow

longittakes

foraparticulareventto

hap-pen.

edicalapplicationsthe

eventisdeath

ofapatientw

illness,and

analysingthe

patient’ssurvivaltim

industrialapplicationsthe

eventisoften

failureofa

componentin

achine.

outputin

thissort

ofproblem

survivaltim

ithall

theother

problems

thatwe

haveseen

inthis

section,thetask

fitaregression

todescribe

therelationship

between

theoutputand

inputs.In

edicalcontext,

theinputs

areusually

qualitiesof

thepatient,

suchas

ageand

sex,or

aredeterm

inedby

thetreatm

entgivento

thepatient.

willskip

thistopic.

pecialTopics

egressionM

odelling�

Multivariate

Analysis

ofVariance

Repeated

Measures

Random

Effects

Models

topicsin

thissection

arespecial

sensethat

theyare

extensionsto

thebasic

ideaof

regressionm

odelling.T

hetechniques

havebeen

developedin

responseto

methods

ofdata

collectionin

theusual

assumptions

ofregression

modelling

arenotjustified.

3.3.1M

ultivariate

Analysis

ariance

[�)+\] �/

�^ �^� �� ^� ��

^� �� _��

(3.26)

the_� ’sare

independentlyand

\�D`��a�

and�

number

ofdatapoints.

The�Db

c��

under[�

indicatesthe

dimensions

vector,inthis

column;the^

’sare

also�Dbc��

vectors.

fittedin

exactlythe

linearm

odel(by

leastsquares

estimation).

thisfitting

linearm

odeltoeach

oftheb

dimensions

oftheoutput,one-at-a-tim

Having

fittedthe

model,w

obtainfitted

values

=[� �=^d�

�� =^� ��

� ��

andhence

residuals

[� $=[�

� ��

analogueofthe

residualsumofsquares

fromthe

(univariate)linear

matrix

ofresidualsums

ofsquaresand

productsfor

ultivariatelinear

model.

matrix

isdefined

e� A� �

� �[� $=[� ��[� $=[� �gf

3.3.2R

epeatedM

easuresD

Repeated

measures

dataare

generatedw

henthe

outputvariable

isobserved

atseveral

pointsin

individuals.U

sually,the

covariatesare

alsoobserved

pointsas

theoutput;

inputsare

dependenttoo.

Section

3.3.1the

outputis

avector

easure-m

ents.In

principle,w

simply

applythe

techniquesof

Section

3.3.1to

analyserepeated

measures

data.Instead,

usuallytry

thefact

thesam

ofvariables

(outputand

inputs)at

severaltimes,

ratherthan

acollection

ofdifferentvariablesm

akingup

avector

output.

Repeated

measures

dataare

oftencalled

longitudinaldata,especiallyin

theso-

cialsciences.T

heterm

cross-sectionalis

oftenused

ean‘notlongitudinal’.

3.3.3R

andomE

ffectsM

Overdisper

logisticregression

mightreplace

(3.22)w

PIQ JT� S� � � ��

� �� h��

(3.29)

theh� ’sare

independentlyand

�� i � .

canthink

ofh�

asrepresenting

eitherthe

effectof

issinginput

onS�

randomvariation

successprobabilities

forindividuals

thathavethe

valuesfor

theinputvariables.

Hierarchical

models

turnipexperim

ent,the

growth

turnipsis

affectedby

thedifferent

blocks,buttheeffects

’s)foreachblock

arelikely

differentindifferent

years.S

ecould

thinkofthe

’sfor

eachblock

ingfrom

apopulation

’sfor

blocks.Ifw

this,thenw

ecould

replacethe

modelin

(3.8)w

��

�3 �3� �4 �4� ��

5 �5�

�j6 �67� �j66 �667� �j666 �6667� �j68 �687� � �� 9 ;(3.30)

j6 ,j66 ,j666

areindependently

andidentically

distributedas

��k � .

lassicalM

ultivariate

Analysis�

PrincipalC

omponents

Analysis

Correspondence

Analysis

Multidim

ensionalScaling

Cluster

Analysis

ixtureD

ecomposition

LatentVariable

ovarianceS

tructureM

3.4.1P

rincipalC

omponents

Analysis

Principalcom

ponentsanalysis

oftransform

setofC

-dimensional

vectorobservations,l� ,l� ,...,lA

,intoanother

setofC

-dimensionalvectors,

[� ,[� ,...,[A.

’shave

theproperty

thatmostoftheirinform

ationcontent

isstored

firstfewdim

ensions(features).

willallow

dimensionality

reduction,sothatw

dothings

obtaining(inform

ative)graphicaldisplays

ofthedata

carryingoutcom

puterintensive

methods

onreduced

gaininginsight

intothe

structureof

thedata,

notapparent

dimensions.

Sepal L.

2.02.5

3.03.5

••

•••

• ••

••

•••

••

•••

••••

•• ••

• ••

• •• •

•••

••

•• •• •

•••

••

•••

••

•• • ••

•••

•• •

••

•••

••

•• •

••

•• • •

••

• ••• •

••

•••

••• •• •

•••

••

•••• • •• •• • •••• •••• •• •••

•• •••• • •• •• • ••• •••• • •• •• ••

•••

•• •

••

• ••

•••

••

••• •

•••

••• ••

• • ••••• ••

•• • •

••

•• •

• • • •

••

•• •

••• • • •

••

• ••

••••

•••• •••

0.51.0

1.52.0

5 6 7 8

•••• ••• •• • •••• •••• •• ••

••

• ••

•• ••

• •• • ••• •••••

•• •• ••

•• •

••

•••

••

•• •••

••

•• • • •

••

••• ••• • •

•••• ••

•• •• •

••

•• •

• •••

•• ••• •

••

• •

••

••••

•• •

•••

••••

•• •

••

•• •

••

2.02.53.03.54.0•• •••

••

•••

• ••

••

••• • ••••

• ••

•• •

• •• •

••

•• ••

•••

••

••• •• •• •

•••••

••

•• ••

••

• •••

••

• ••

••

•••

•• •

••

•••

••

• ••

••

•••

••••

•• ••

• ••

epal W.

•• •• • •••• • •••• • ••• ••• ••

• •• • •••• • • •• • • •• • •• • • •

• •• •••••

••• •

••

•••

••

• •• •

• ••

••• ••

• ••• ••• •

••

• ••••

••

••• •

••

•• •

•• •• • ••

•••

••

••• •

• •• •

••

•••

••••

• ••• • ••

•• •• ••••• • ••• •••• •••

••

•••

•••••

• •• • • •• • •• ••

•• •• •••••

••

•••

•• •••

••

•• • ••

••

••• •• • ••

• •• • ••

•• ••

••

•••

• ••

•••••

• ••

•••

••

••• •

••

• •

•••

• •••

•••

••

• ••

••• •

••

•••

•• •

••

••••••

•••••

••

•••

••• •

••

•••

•• ••

•••

••

•• ••• • • ••

• •• • ••

••

••• •

••

•••

••

•• •

••

•• •

• •• •

••

•• • •

• ••

•• ••

•••

• •••

•••• ••

••

• ••

•• •

••

•••

••

•••

• ••

•••••

•••

••• •

•••

••

• •• ••

•••

••

•••

• ••

••

••• ••

•••

••• •

••

•• •

••

•• •

••

•• • •

••

• ••• •

••

•••

••

•• • •••

•••

••

Petal L.

1 2 3 4 5 6 7

•• ••••

•• •• ••• ••• • ••

••

•••

•• ••

•• •• • •• ••••

•• •• ••

•• ••

••

•••

••

•• • ••

••

••• • •

••

• •• •••• •

••• ••

••

••• •

••

• •

• •••

• • ••• •

•• •

••

• •

••

•• ••

•• •

••••

•••

••

• ••

0.51.01.52.02.5

••••

••

•••

•• •

••

•• ••••

•••

••

• •••

•••

••

•••• •

••

•••

•• ••• •• •

•• •• • •

••

••••

••

•• ••

••

• ••

••

•••

•• •

••• •

••

•• •• •

••••

•••

•• ••

• ••

••

•••

••

•••

•••••

•• •

••••

•• •

••

• • •••

•••

••

••• •

••

•• • ••

•••

••• •

••

•••

••

• ••

••• ••

••

•••

••

•••

••

•• • •••

• ••

••

7•••• ••••• •••

• • •••••• ••

••• •••• •• ••••• •• •• •••••••

• •••

•• •

••

•••

•• •

••

••• •

• ••

• •• ••

• •••••

• ••

••• ••

••

••• •

••

•• •

•• •• •••

• •

••

••••

•• • ••

••

•••

•• •

••• •

•• • ••

Petal W

Figure

isher’sIris

(collectedby

Anderson)

ideabehind

principalcom

ponentsanalysis

isthat

highinform

ationcorresponds

tohigh

variance.

wanted

toreduce

’sto

asingle

dimension

transform

� �m fl�

choosingm

sothat�

hasthe

largestvariancepossible.

Itturnsoutthatm

shouldbe

theeigenvector

correspondingto

thelargesteigen-

valueofthe

variance(covariance)

matrix

Itisalso

possibleto

showthatofallthe

directionsorthogonalto

thedirection

ofhighestvariance,the

(second)highestvariance

thedirection

paralleltothe

eigenvectorofthe

secondlargesteigenvalue

resultsextend

allthew

dimensions.

ateof a

?) �]�/

�� $� A� �

� � l� $l�gf� l� $l��

(3.31)

wherel

��

A@� l� .

eigenvaluesof?

aren� on� o��on�o��

eigenvectorsof?

correspondington� ,n� ,...,n�

arep� ,p� ,...,p� ,respectively.

vectorsp� ,p� ,...,p�

arecalled

theprincipal

axes.(p�

firstprincipalaxis,etc.)

The�DCcC�

matrix

whoseq th

column

isp�w

illbedenoted

principalaxes(can

beand)

arechosen

sothatthey

areoflength

areorthogonal(perpendicular).

Algebraically,this

p f� p�ts �

ifq �qvu

ifqw �qxu�

(3.32)

vector[

definedas,[) �] �/

�yzzzz{ p f�p f�...p f�|}}}}~

) �]�/

l) �] �/

�r fl

iscalled

thevector

ofprincipalcom

ponentscores

heq thprincipalcom

-ponentscore

is�� p f� l

;sometim

principalcomponentscores

arereferred

theprincipalcom

ponents.

heelem

entsof[

areuncorrelated

andthe

sample

varianceof

q thprincipalcom

ponentscoreisn� .

Inother

thesam

plevariance

matrix

yzzz{n�

n�|}}}~

) �]�/

sample

variancesfor

theprincipalcom

ponentsis

equaltothe

sumofthe

sample

variancesfor

theelem

entsofl

hatis,��

� n� ��

� 2��

where2

��

sample

varianceof�� .

-6.5-6.0-5.5-5.0-4.5-4.0

••• •

••

• ••

••

• ••• ••••

••• •

••

•••

••

•••

••••

•••

• • ••

•••

• •••

•••

• ••

••

•• •

••

•• •

••

• •

••

•• • •

• ••

•••

••

••••

•• •

••

• • • ••••

••

••••

••

•••

• •• •

•••

••

• • ••

••••

••

••• •

••

•• •• •

•• •

••

•• • •

••

•• ••

• •

••

•••

•• •

••

•• •

••

• •• •

••

• ••

••

• ••

••

-0.4-0.20.0

0.20.4

2 4 6 8

••

•• • •• •• • •

•••

••

•• •

• ••••

•• •

••

• ••

••

•• •

••

•• •

• •• •• •

••• ••• ••

•• •

•••

• ••

••

• • • ••

• •• •

••

•••

••

•• ••

••

•••

••

•• • •••

••

• ••

••

•• •

••

-6.5 -5.5 -4.5• •• •••

•• ••• • • •

•• • •• • •••

• • ••• • • •••• ••• • ••• ••••

•• •• •

• ••

••

• ••

• •• ••

••

•••• •••

• ••••

••• •• ••••

••• ••

••

•••

•• ••

••• •

• •

•••

•••• ••

•• •

•• ••

••• •••

••

• ••• •• •

••

•• •

• ••

•• •

•••

••

•••

• •

••

•••

• •

••

••••

••

•• •••

•••

••

••••

••• •

•••

••

• •

••

•••

••

•• •

• •

•• •

••

•••

••

•••• •• •••

•• •

••

•••

• •• ••

•••

••

• •

••

•••

••

•••• ••

•••

• •• • •• ••

••

•••

••

•• •••

• ••••

••

•• •

••

••• •

••

• •

• • •• •

••

•• •

••

• ••

••

•• • • • ••• •• • •• •• •••• •• •

•• •• ••• ••• ••••• ••• •• •• •

• ••••• •

••

• • ••

••

• •••

•• •• •• •

• •• ••

••••

•• • ••

•• •••

••

•• •

••

••• •

• •••••

••

••• •

•••

• •••

• •

•• • •

••

• ••

••

•• ••• ••

•••

••

• •

••

• •

••• •

•••

• • ••

• •• •

••

•••

••

•• •

••

•••

••

•• ••

••

•••

••

• •••

• •

-1.2-0.8-0.4 0.0

••

•• • • •• •• •

•• •

••

•• • ••

• ••

•••

••

•••

••

•••• ••••

•••• •• ••

•••

• •

••

•• •• •

• •• •

••

• ••

••

• ••

••

• •

••

•••• •

••

• ••

••

-0.4 0.00.20.4

• ••• • •• • •• •• •• •• •••• • •

••• • •• •• • •• • • • •• •• • ••

••

•• • • ••••

••• •

••

• •

•••• •

••

••• • •• •

•••• •

•• • • •

• •• ••

••• • •

••

• ••

••

•• •• •••

• ••

••

•••• ••

••

•• • •

•• •

•••••

••••

••

•••

••

• • •• •• ••

• •• •

••

• •

••

• ••

••

••• •

••

••• •

••

•••••

••

••••

••

• ••

••

• ••

••

•••

••

•• •

• •

••

•• • •

•••

-1.2-0.8

-0.40.0

••

•• • ••• •

••

••••

••

• ••

• •

• •• •

• ••

• •• •

••

• • ••

••

••• ••

• ••

••• •

••

•• ••

•••

••

•• •

••

• ••

••

•• •

• •

••

• ••

••

Figure

rincipalcomponentscore

isher’sIris

igure3.3

Effective

ensionality

heproportion

ofvariance

accountedfor

Takethe

first>

principalcom-

ponentsand

theirvariances.

Divide

sumofallthe

variances,to

@B�� n�

@ �� n�

iscalled

theproportion

ofvarianceaccounted

thefirst>

princi-palcom

ponents.

Usually,

projectionsaccounting

forover

thetotalvariance

arecon-

sideredto

begood.

picturew

considereda

reasonablerepresentation

n� �n�

@ �� n� �

� �WV �

tantvariance

ideahere

considerthe

varianceifalldirections

equallyim

portant.In

thiscase

thevariances

beapproxim

n ��C��

� n� �

argumentruns

Ifn��n ,

thenthe

q thprincipal

directionis

lessinteresting

thanaverage.

andthis

leadsus

todiscard

principalcom

ponentsthat

havesam

plevari-

ancesbelow

creedia

screediagram

indexplotofthe

principalcomponent

variances.In

otherw

ordsitis

aplotofn�

againstq .A

pleofa

screediagram

,forthe

ata,isshow

Figure

••

0 1 2 3 4�

Figure

forthe

elbow;in

thiscase

onlyneed

thefirstcom

ponent.

alising

datacan

benorm

alisedby

carryingoutthe

following

steps.

Centre

eachvariable.

Inother

subtractthem

eanofeach

variableto

�l� �l� $l �

Divide

eachelem

entof �l�by

itsstandard

deviation;asa

formula

eanscalculate

��

here2�

sample

standarddeviation

of�� .

Petal L.

Sepal W.

-10 -5 0 5 10 15

•• •• • •••• • •••• • ••• ••• ••• •• • •••• • • •• • • •• • •• • • •• •• •••••• •• •

•••

• •• •• •••• • ••• • • •• ••••• •

• • ••• •• • ••• • ••••

••• •• ••

••• •

•• •• • ••••

• •••

• ••• •• •• •••• • ••• •••• • ••• • ••

Mean C

entred Data

•• •• ••

••• • •••

• ••

•••

••

• • ••

•• • • ••• • •• ••• •

••

• •••

••

••• •

••

•••

••

• ••

••

••• •

•• •

••

•••

••

•• •••

••

•• •

••

•• •

••

•••

••

•••

••

5 x Petal L.

Sepal W.-10

-10 -5 0 5 10 15

Scaled D

Figure

3.6Ifw

edon’tnorm

alise.

Interpretation

finalpartofa

principalcomponents

analysisis

toinspect

theeigenvectors

hopeofidentifying

eaningfor

the(im

portant)principalcom

ponents.

thebook

interpretationfor

Fisher’s

3.4.2C

orrespondenceA

nalysis

Correspondence

torepresent

thestructure

within

incidencem

atrices.Incidence

matrices

arealso

calledtw

contingencytables.

example

� Vc ;�

incidencem

atrix,w

arginaltotals

isshow

Table3.17.

Table3.17

okingC

ategoryS

taffGroup

LightM

ediumH

eavyTotal

Senior

Managers

11Junior

Managers

eniorE

mployees

51Junior

ployees18

Secretaries

25Total

Stages

Transformthe

valuesin

aythatrelates

testforassociation

between

andcolum

ns(chi-squared

test).

ensionalityreduction

method

toallow

pictureof

therelationships

between

andcolum

Details

arelike

principalcomponents

analysism

athematically;see

thebook.

3.4.3M

ultidimensional

Scaling

Multidim

ensionalscalingis

theprocess

ofconverting

ofpairw

isedissim

i-larities

setofpoints,intoa

setofco-ordinatesfor

thepoints.

plesofdissim

ilaritiescould

theprice

ofanairline

ticketbetween

pairsofcities;

roaddistances

between

(asopposed

tostraight-line

distances);

acoefficient

indicatinghow

differentthe

artefactsfound

inpairs

ithina

graveyardare.

Classical

Scaling

Classicalscaling

isalso

etricscaling

principalco-ordinatesanalysis.

‘metric’scaling

isused

becausethe

dissimilarities

areas-

distances—or

athematicalterm

measure

ofdissim

ilarityis

theeuclidean

metric.

‘principalco-ordinatesanalysis’is

usedbe-

causethere

linkbetw

eenthis

techniqueand

principalcomponents

analysis.T

e‘classical’

isused

becauseit

thefirst

widely

ethodof

multidim

ensionalscaling,andpre-dates

theavailability

ofelectroniccom

puters.

derivationof

ethodused

toobtain

theconfiguration

isgiven

resultsof

applyingclassicalscaling

ritishroad

distancesare

igure3.7.

roaddistances

correspondto

theroutes

recomm

endedby

utomobile

Association

;these

recomm

endedroutes

areintended

togive

inimum

travellingtim

e,notthethe

umjourney

distance.

effectofthis,thatisvisible

igure3.7

isthatthe

andcities

havelined

positionsrelated

motorw

aynetw

alsofeatures

distortionsfrom

thegeographicalm

apsuch

positionof

Holyhead

(holy),

appearsto

uchcloser

toLiverpool

(lver)andM

anchesterthanitreally

is,andthe

positionofC

ornishpeninsula

(thepartending

atPenzance,penz)is

furtherfromC

armarthen

(carm)than

itisphysically.

ponent 1

Component 2

-400-200

-200 0 200

carmcolc

dorcdovr

fort�

holyhull

middnewc

taun�

Figure

Ordinal

Scaling

Ordinalscaling

isused

forthe

purposesat

classicalscaling,but

fordis-

similarities

thatare

etric,that

is,they

arenot

thinkof

asdistances.

Ordinalscaling

etimes

callednon-m

etricscaling,because

thedissim

ilaritiesare

etric.S

peoplecallitS

hepard-Kruskalscaling,be-

causeS

hepardand

Kruskalare

thenam

esoftw

opioneers

ofordinalscaling.

Inordinalscaling,

configurationin

thepairw

isedistances

eenpoints

havethe

rankorder

correspondingdissim

ilarities.S

o,if��

dissimilarity

between

points�

and� ,and��

distancebetw

eenthe

pointsin

thederived

configuration,then

configurationin

�� Zk

�� Zk �

3.4.4C

lusterA

nalysisand

Mixture

position

Clusteranalysis

ixturedecom

positionare

bothtechniques

iden-tification

ofconcentrationsofindividuals

space.

Cluster

Analysis

Clusteranalysis

isused

toidentify

groupsofindividuals

sample.

groupsare

notpre-defined,

nor,usually,

number

ofgroups.

groupsthat

areidentified

arereferred

clusters.

hierarchical

–agglom

erative

–divisive

non-hierarchical

inimum

distanceor

single-link

aximum

distanceor

complete-link

distance

Centroid

distancedefines

thedistance

between

clustersas

thesquared

distancebetw

eenthe

vectors(that

is,the

centroids)of

oclus-

ofsquared

deviations

definesthe

distancebetw

oclusters

sumofthe

squareddistances

ofindividualsfrom

thejointcentroid

ofthethe

clustersm

inusthe

sumofthe

squareddistances

ofindividualsfrom

theirseparate

clusterm

0 1 2 3 4 5 6Distance between clusters

Figure

sualway

topresentresults

ofhierarchicalclustering.

Non-hierarchicalclustering

isessentially

tryingto

partitionthe

sample

tooptim

izesom

easureofclustering.

choiceof

measure

ofclustering

isusually

basedon

propertiesof

ofsquares

andproducts

matrices,like

thosem

ection3.3.1,because

theaim

measure

differencesbetw

eengroups.

difficultyhere

isthatthere

aretoo

differentways

topartition

thesam

plefor

trythem

all,unless

thesam

verysm

all(around

��

aller).T

husour

general,of

guaranteeingthat

theglobaloptim

achievedis

ethodsuch

asbranch-and-bound¿

bestknow

nnon-hierarchical

clusteringm

ethodsis

� -means

method.

Mixture

position

Mixture

decomposition

isrelated

tocluster

analysisin

thatit

isused

toidentify

concentrationsofindividuals.

basicdifference

between

clusteranalysisand

mixture

decomposition

isthatthere

underlyingstatisticalm

odelinm

ixturedecom

position,whereas

thereis

nosuch

modelin

clusteranalysis.

proba-bility

densitythathas

generatedthe

sample

datais

assumed

ixtureof

severalunderlyingdistributions.

�� l� ��

� �� l�d��

number

ofunderlying

distributions,the

�� ’sare

thedensities

underlyingdistributions,

the�� ’s

arethe

parameters

underlyingdistributions,

the�� ’s

arepositive

andsum

toone,

densityfrom

thesam

plehas

beengenerated.

Details

ofHand’s

books.63

3.4.5Latent

Variab

Covariance

Structure

Models

neverused

thetechniques

inthis

section,so

notconssider

myself

expertenoughto

presentationon

Notenough

tocover

everything.

techniquespresented

inthis

chapterdonotform

anythinglike

anexhaustive

listof

usefulstatisticalmethods.

techniquesw

erechosen

becausethey

areeither

widely

usedor

oughtto

idelyused.

regressiontechniques

idelyused,though

thereis

reluctanceam

ongstresearchersto

thejum

linearm

odelsto

generalizedlinear

models.

multivariate

analysistechniques

oughttobe

orethan

theyare.

ofthem

ainobstacles

adoptionofthese

techniquesm

thattheirrootsare

inlinear

algebra.

Ifeelthetechniques

presentedin

thischapter,and

theirextensions,w

illremain

orbecom

widely

usedstatisticaltechniques.

hythey

chosenfor

thischapter.

Statistics - Statistical Methods for Data Analysis

Documents