Post on 08-Dec-2016
transcript
Chapter
3
Statistical
Methods
PaulC
.TaylorU
niversityofH
ertfordshire
28thM
arch2001
3.1Introduction
�
Generalized
LinearM
odels
�
SpecialTopics
inR
egressionM
odelling
�
ClassicalM
ultivariateA
nalysis
�
Sum
mary
1
3.2G
eneralized
LinearM
odels�
Regression
�
Analysis
ofVariance
�
Log-linearM
odels
�
LogisticR
egression
�
Analysis
ofSurvivalD
ata
2
The
fittingof
generalizedlinear
models
iscurrently
them
ostfrequently
appliedstatisticaltechnique.
Generalized
linearm
odelsare
usedto
describedthe
rela-tionship
between
them
ean,sometim
escalled
thetrend,ofone
variableand
thevalues
takenby
severalothervariables.
3
3.2.1R
egression
How
isa
variable,� ,relatedto
one,orm
ore,othervariables,�� ,�� ,...,�� ?
Nam
esfor� :
response;dependentvariable
;output.
Nam
esfor
the
�� ’s:regressors;explanatory
variables;independentvariables;inputs.
Here,w
ew
illusethe
terms
outputandinputs.
4
Com
mon
reasonsfor
doinga
regressionanalysis
include:
�the
outputis
expensiveto
measure,
butthe
inputsare
not,and
socheap
predictionsofthe
outputaresought;
�
thevalues
oftheinputs
areknow
nearlier
thanthe
outputis,anda
working
predictionofthe
outputisrequired;
�
we
cancontrol
thevalues
ofthe
inputs,w
ebelieve
thereis
acausal
linkbetw
eenthe
inputsand
theoutput,
andso
we
want
toknow
what
valuesof
theinputs
shouldbe
chosento
obtaina
particulartarget
valuefor
theoutput;
�
itisbelieved
thatthereis
acausallink
between
some
oftheinputs
andthe
output,andw
ew
ishto
identifyw
hichinputs
arerelated
tothe
output.
5
The
(general)linear
model
is�� � �� ��� �� ��� �����
� ��� � �� �������������
(3.1)
where
the � ’sare
independentlyand
identicallydistributed
as
��������
and�
isthe
number
ofdatapoints.
The
modelis
linearin
the’s.
���� � � �
�� �� � ��� �
(3.2)
(Aw
eightedsum
ofthe
’s.)
6
The
main
reasonsfor
theuse
ofthelinear
model.
�T
hem
aximum
likelihoodestim
atorsof
the
’sare
thesam
eas
theleast
squaresestim
ators;seeS
ection2.4
ofChapter2.
�
Explicitform
ulaeand
rapid,reliablenum
ericalmethods
forfinding
theleast
squaresestim
atorsofthe
’s.
�
Many
problems
canbe
framed
asgenerallinear
models.
For
example,
���� � �� �� �� �� �� �� �� �� � �� �� � ���
(3.3)
canbe
convertedby
setting�� ��� �� ,�� �� ��
and�� �� �� .
�
Even
when
thelinear
modelis
notstrictly
appropriate,there
isoften
aw
ayto
transformthe
outputand/orthe
inputs,sothata
linearm
odelcanprovide
usefulinformation.
7
Non-linear
Regression
Two
examples
are:
�� � �! "�� �
�� �������������
(3.4)
�� �# �%$&(' ")+*",�- ./0 � �� ������ ���� ��
(3.5)
where
the ’sand�
areas
in(3.1).
8
Problem
s
1.E
stimation
iscarried
outusingiterative
methods
which
requiregood
choicesof
startingvalues,
might
notconverge,
might
convergeto
alocaloptim
umrather
thanthe
globaloptimum
,andw
illrequirehum
anintervention
toover-
come
thesedifficulties.
2.T
hestatisticalproperties
ofthe
estimates
andpredictions
fromthe
model
arenot
known,
sow
ecannot
performstatistical
inferencefor
non-linearregression.
9
Generaliz
edLinear
Models
The
generalizationis
intw
oparts.
1.T
hedistribution
ofthe
outputdoes
nothave
tobe
thenorm
al,but
canbe
anyofthe
distributionsin
theexponentialfam
ily.
2.Instead
ofthe
expectedvalue
ofthe
outputbeing
alinear
functionof
the
’s,we
have1# ���� �0 � �
�� �� � ���
(3.6)
where1� ��
isa
monotone
differentiablefunction.
The
function1� ��
iscalled
thelink
function.
There
isa
reliablegeneralalgorithm
forfitting
generalizedlinear
models.
10
Generaliz
edA
dditive
Models
Generalized
additivem
odelsare
ageneralization
ofgeneralizedlinear
models.
The
generalizationis
that1# ���� �0
neednotbe
alinear
functionofa
setof
’s,buthas
theform
1# ���� �0 � �
�� �� 2
� � ��� �
(3.7)
where
the2� ’s
arearbitrary,usually
smooth,functions.
An
example
ofthe
model
producedusing
atype
ofscatterplot
smoother
isshow
nin
Figure
3.1.
11
•
•
• ••
••
•
•
• •
•
••
•
••
•
•
•
•
•
•
•
•
••
•
••
•
••
••
••
•
•
•
•
••
Diabetes D
ata --- Spline S
mooth
df=3
Age
Log C-peptide
510
15
3 4 5 6
Figure
3.1
12
Methods
forfitting
generalizedadditive
models
existandare
generallyreliable.
The
main
drawback
isthat
thefram
ework
ofstatistical
inferencethat
isavail-
ablefor
generalizedlinear
models
hasnot
yetbeen
developedfor
generalizedadditive
models.
Despite
thisdraw
back,generalized
additivem
odelscan
befitted
byseveralof
them
ajorstatisticalpackages
already.
13
3.2.2A
nalysisof
Variance
The
analysisof
variance,or
AN
OV
A,
isprim
arilya
method
ofidentifying
which
ofthe
’sin
alinear
modelare
non-zero.T
histechnique
was
developedfor
theanalysis
ofagriculturalfieldexperim
ents,butisnow
usedquite
generally.
Exam
ple27
Turnipsfor
Winter
Fodder.
The
datain
Table3.1
arefrom
anex-
periment
toinvestigate
thegrow
thof
turnips.T
hesetypes
ofturnips
would
begrow
nto
providefood
forfarm
animals
inw
inter.T
heturnips
were
harvestedand
weighed
bystaff
andstudents
ofthe
Departm
entsof
Agriculture
andA
p-plied
Statistics
ofThe
University
ofReading,in
October,1990.
14
Table3.1
Treatments
Blocks
Variety
Date
Density
LabelI
IIIII
IV
Barkant
21/8/901
kg/haA
2.71.4
1.23.8
2kg/ha
B7.3
3.83.0
1.24
kg/haC
6.54.6
4.70.8
8kg/ha
D8.2
4.06.0
2.5
28/8/901
kg/haE
4.40.4
6.53.1
2kg/ha
F2.6
7.17.0
3.24
kg/haG
24.014.9
14.62.6
8kg/ha
H12.2
18.915.6
9.9
Marco
21/8/901
kg/haJ
1.21.3
1.51.0
2kg/ha
K2.2
2.02.1
2.54
kg/haL
2.26.2
5.70.6
8kg/ha
M4.0
2.810.8
3.1
28/8/901
kg/haN
2.51.6
1.30.3
2kg/ha
P5.5
1.22.0
0.94
kg/haQ
4.713.2
9.02.9
8kg/ha
R14.9
13.39.3
3.615
The
following
linearm
odel
���
�3 �3� �4 �4� �����
5 �5�
�66 �667� �666 �6667� �68 �687� � �� ��� �� ����:9 ;
(3.8)
oran
equivalentonecould
befitted
tothese
data.T
heinputs
takethe
values0
or1
andare
usuallycalled
dumm
yor
indicatorvariables.
On
firstsight,(3.8)should
alsoinclude
a<
anda6 ,butw
edo
notneedthem
.
16
The
firstquestionthatw
ew
ouldtry
toansw
eraboutthese
datais
Does
achange
intreatm
entproducea
changein
theturnip
yield?
which
isequivalentto
asking
Are
anyof3
,4
,...,5non-zero?
which
isthe
sortofquestionthatcan
beansw
eredusing
AN
OV
A.
17
This
ishow
theA
NO
VA
works.
Recall,the
generallinearm
odelof(3.1),
�� � �� ��� �� ��� �����
� ��� � �� ��������������
The
estimate
of�
is =� .
Fitted
values
=�� �= �
�� �� =� ��� �
(3.9)
Residuals
>� ��� $
=�� �
(3.10)
The
sizeof
theresiduals
isrelated
tothe
sizeof�
�,the
varianceof
the � ’s.It
turnsoutthatw
ecan
estimate�
�by
? ��@BA� �� ��� $
=�� � �
� $�DC���
�
(3.11)
18
The
keyfacts
about? �
isthatallow
usto
compare
differentlinearm
odelsare:
�if
thefitted
modelis
adequate(‘the
rightone’),
then? �
isa
goodestim
ateof�
�;
�
ifthe
fittedm
odelincludesredundant
terms
(thatis
includessom
e
’sthat
arereally
zero),then? �
isstilla
goodestim
ateof�
�;
�
ifthefitted
modeldoes
notincludeone
orm
oreinputs
thatitoughtto,then
? �
willtend
tobe
largerthan
thetrue
valueof�
�.
So
ifw
eom
ita
usefulinput
fromour
model,
theestim
ateof�
�
will
shootup,
whereas
ifwe
omita
redundantinputfromour
model,the
estimate
of��
shouldnotchange
much.
Note
thatomitting
oneofthe
inputsfrom
them
odelisequiv-
alenttoforcing
thecorresponding
tobe
zero.
19
Exam
ple28
Turnipsfor
Winter
Fodder
continued.LetE�
tobe
them
odelat
(3.8),and E
tobe
thefollow
ingm
odel�� � �66 �667� �666 �6667� �68 �687� � �� ��� �������:9 ;�
(3.18)
So,E
isthe
specialcaseofE�
inw
hichallof3
,4
,...,5
arezero.
Table3.2
Df
Sumof
Sq
MeanSq
FValue
Pr(F)
block
3163.73754.57891
2.2780160.08867543
Residuals60
1437.53823.95897
Table3.3
Df
Sumof
Sq
MeanSq
FValue
Pr(F)
block
3163.73754.57891
5.6904300.002163810
treat
15
1005.92767.06182
6.9919060.000000171
Residuals45
431.611
9.59135
20
Table3.4
shows
theA
NO
VA
thatwould
usuallybe
producedfor
theturnip
data.N
oticethat
the‘block’and
‘Residuals’row
sare
thesam
eas
inTable
3.3.T
hebasic
differencebetw
eenTables
3.3and
3.4is
thatthe
treatment
information
isbroken
down
intoits
constituentpartsin
Table3.4.
Table3.4
DfSum
ofSq
Mean
Sq
FValue
Pr(F)
block
3163.7367
54.5789
5.690430.0021638
variety
183.9514
83.9514
8.752820.0049136
sowing
1233.7077233.707724.36650
0.0000114
density
3470.3780156.792716.34730
0.0000003
variety:sowing
136.4514
36.4514
3.800450.0574875
variety:density
38.6467
2.8822
0.300500.8248459
sowing:density
3154.7930
51.5977
5.379600.0029884
variety:sowing:density
317.9992
5.9997
0.625540.6022439
Residuals
45
431.6108
9.5914
21
3.2.3Log-linear
Models
The
datashow
nin
Table3.7
showthe
sortof
problemattacked
bylog-linear
modelling.
There
arefive
categoricalvariablesdisplayed
inTable
3.7:
centreone
ofthreehealth
centresfor
thetreatm
entofbreastcancer;
age
theage
ofthepatientw
henher
breastcancerw
asdiagnosed;
survivedw
hetherthe
patientsurvivedfor
atleastthreeyears
fromdiagnosis;
appearappearance
ofthepatient’s
tumour—
eitherm
alignantor
benign;
inflamam
ountofinflamm
ationofthe
tumour—
eitherm
inimal
orgreater.
22
Table3.7
State
ofTumour
Minim
alInflamm
ationG
reaterInflam
mation
Malignant
Benign
Malignant
Benign
Centre
Age
Survived
Appearance
Appearance
Appearance
Appearance
TokyoU
nder50
No
97
43
Yes26
6825
950–69
No
99
112
Yes20
4618
570
orover
No
23
10
Yes1
65
1
Boston
Under
50N
o6
76
0Yes
1124
40
50–69N
o8
203
2Yes
1858
103
70or
overN
o9
183
0Yes
1526
11
Glam
organU
nder50
No
167
30
Yes16
208
150–69
No
1412
30
Yes27
3910
470
orover
No
37
30
Yes12
114
1
For
thesedata,the
outputisthe
number
ofpatientsin
eachcell.
The
modelis
��GFHI JK�L� �M NOPIQ�L� � � �� ��� �� ��� �����
� ��� �
(3.21)
Since
allthevariables
ofinterestarecategorical,w
eneed
touse
indicatorvari-
ablesas
inputsin
thesam
ew
ayas
in(3.8).
24
Table3.8
Terms
added
sequentially(firsttolast)
Df
DevianceResid.Df
Resid.Dev
Pr(Chi)
NULL
71
860.0076
centre
29.3619
69
850.6457
0.0092701
age
2105.5350
67
745.1107
0.0000000
survived
1160.6009
66
584.5097
0.0000000
inflam
1291.1986
65
293.3111
0.0000000
appear
17.5727
64
285.7384
0.0059258
centre:age
476.9628
60
208.7756
0.0000000
centre:survived
211.2698
58
197.5058
0.0035711
centre:inflam
223.2484
56
174.2574
0.0000089
centre:appear
213.3323
54
160.9251
0.0012733
age:survived
23.5257
52
157.3995
0.1715588
age:inflam
20.2930
50
157.1065
0.8637359
age:appear
21.2082
48
155.8983
0.5465675
survived:inflam
10.9645
47
154.9338
0.3260609
survived:appear
19.6709
46
145.2629
0.0018721
inflam:appear
195.4381
45
49.8248
0.0000000
Tosum
marise
thism
odel,Iwould
constructitsconditionalindependence
graphand
presenttablescorresponding
tothe
interactions.Tables
arein
thebook.
The
conditionalindependencegraph
isshow
nin
Figure
3.2.
age
centre
survived
inflamappear
Figure
3.226
3.2.4Logistic
Regression
Inlogistic
regression,theoutputis
thenum
berofsuccesses
outofanum
berof
trials,eachtrialresulting
ineither
asuccess
orfailure.
For
thebreastcancer
data,we
canregard
eachpatientas
a‘trial’,w
ithsuccess
correspondingto
thepatientsurviving
forthree
years.
The
outputw
ouldsim
plybe
givenas
number
ofsuccesses,
either0
or1,
foreach
ofthe764
patientsinvolved
inthe
study.
The
modelthatw
ew
illfitis R��� ��� �� $S� � R��� ��� �S� �L�
and
PIQ
S�� $S�
� �� ��� �� ��� �����
� ��� �
(3.22)
Again,
theinputs
herew
illbe
indicatorsfor
thebreast
cancerdata,
butthis
isnot
generallytrue;
thereis
noreason
why
anyof
theinputs
shouldnot
bequantitative.
27
Table3.15
Df
Deviance
Resid.Df
Resid.
Dev
Pr(Chi)
NULL
763
898.5279
centre
211.26979
761
887.2582
0.0035711
age
23.52566
759
883.7325
0.1715588
appear
19.69100
758
874.0415
0.0018517
inflam
10.00653
757
874.0350
0.9356046
centre:age
47.42101
753
866.6140
0.1152433
centre:appear
21.08077
751
865.5332
0.5825254
centre:inflam
23.39128
749
862.1419
0.1834814
age:appear
22.33029
747
859.8116
0.3118773
age:inflam
20.06318
745
859.7484
0.9689052
appear:inflam
10.24812
744
859.5003
0.6184041
centre:age:appear
42.04635
740
857.4540
0.7272344
centre:age:inflam
47.04411
736
850.4099
0.1335756
centre:appear:inflam
25.07840
734
845.3315
0.0789294
age:appear:inflam
24.34374
732
840.9877
0.1139642
centre:age:appear:inflam
30.01535
729
840.9724
0.999496428
The
fittedm
odelis
simple
enoughin
thiscase
forthe
parameter
estimates
tobe
includedhere;
theyare
shown
inthe
formthat
astatistical
packagew
ouldpresentthem
inTable
3.16.
Table3.16
Coefficients:
(Intercept)
centre2
centre3
appear
1.080257
-0.6589141
-0.49448460.5157151
Using
theestim
atesgiven
inTable
3.16,thefitted
modelis
PIQ JT� S� � ����U� �V W $� �9VUX �;��3� $� �;X ;;U ;9�Y� �� �V �V W�V ��Z� �
(3.23)
29
3.2.5A
nalysisof
Surviv
alData
Survivaldata
aredata
concerninghow
longittakes
foraparticulareventto
hap-pen.
Inm
anym
edicalapplicationsthe
eventisdeath
ofapatientw
ithan
illness,and
sow
eare
analysingthe
patient’ssurvivaltim
e.In
industrialapplicationsthe
eventisoften
failureofa
componentin
am
achine.
The
outputin
thissort
ofproblem
isthe
survivaltim
e.A
sw
ithall
theother
problems
thatwe
haveseen
inthis
section,thetask
isto
fitaregression
model
todescribe
therelationship
between
theoutputand
some
inputs.In
them
edicalcontext,
theinputs
areusually
qualitiesof
thepatient,
suchas
ageand
sex,or
aredeterm
inedby
thetreatm
entgivento
thepatient.
We
willskip
thistopic.
30
3.3S
pecialTopics
inR
egressionM
odelling�
Multivariate
Analysis
ofVariance
�
Repeated
Measures
Data
�
Random
Effects
Models
The
topicsin
thissection
arespecial
inthe
sensethat
theyare
extensionsto
thebasic
ideaof
regressionm
odelling.T
hetechniques
havebeen
developedin
responseto
methods
ofdata
collectionin
which
theusual
assumptions
ofregression
modelling
arenotjustified.
31
3.3.1M
ultivariate
Analysis
ofV
ariance
Model
[�)+\] �/
�^ �^� ��� �^� ��� �����
^� ��� �_�� ������ ���� �
(3.26)
where
the_� ’sare
independentlyand
identicallydistributed
as
\�D`��a�
and�
isthe
number
ofdatapoints.
The�Db
c��
under[�
indicatesthe
dimensions
ofthe
vector,inthis
caseb
rows
and1
column;the^
’sare
also�Dbc��
vectors.
This
model
canbe
fittedin
exactlythe
same
way
asa
linearm
odel(by
leastsquares
estimation).
One
way
todo
thisfitting
would
beto
fita
linearm
odeltoeach
oftheb
dimensions
oftheoutput,one-at-a-tim
e.
32
Having
fittedthe
model,w
ecan
obtainfitted
values
=[� �=^d�
�� �� =^� ���
� ��� �� ������
andhence
residuals
[� $=[�
� ��� �� ���� ��
The
analogueofthe
residualsumofsquares
fromthe
(univariate)linear
model
isthe
matrix
ofresidualsums
ofsquaresand
productsfor
them
ultivariatelinear
model.
This
matrix
isdefined
tobe
e� A� �
� �[� $=[� ��[� $=[� �gf
�33
3.3.2R
epeatedM
easuresD
ata
Repeated
measures
dataare
generatedw
henthe
outputvariable
isobserved
atseveral
pointsin
time,
onthe
same
individuals.U
sually,the
covariatesare
alsoobserved
atthe
same
time
pointsas
theoutput;
sothe
inputsare
time-
dependenttoo.
Thus,
asin
Section
3.3.1the
outputis
avector
ofm
easure-m
ents.In
principle,w
ecan
simply
applythe
techniquesof
Section
3.3.1to
analyserepeated
measures
data.Instead,
we
usuallytry
touse
thefact
thatw
ehave
thesam
eset
ofvariables
(outputand
inputs)at
severaltimes,
ratherthan
acollection
ofdifferentvariablesm
akingup
avector
output.
Repeated
measures
dataare
oftencalled
longitudinaldata,especiallyin
theso-
cialsciences.T
heterm
cross-sectionalis
oftenused
tom
ean‘notlongitudinal’.
34
3.3.3R
andomE
ffectsM
odels
Overdisper
sion
Ina
logisticregression
we
mightreplace
(3.22)w
ith
PIQ JT� S� � � �� ��� �� ��� �����
� ��� �h��
(3.29)
where
theh� ’sare
independentlyand
identicallydistributed
as
���� ��i � .
We
canthink
ofh�
asrepresenting
eitherthe
effectof
them
issinginput
onS�
orsim
plyas
randomvariation
inthe
successprobabilities
forindividuals
thathavethe
same
valuesfor
theinputvariables.
35
Hierarchical
models
Inthe
turnipexperim
ent,the
growth
ofthe
turnipsis
affectedby
thedifferent
blocks,buttheeffects
(the
’s)foreachblock
arelikely
tobe
differentindifferent
years.S
ow
ecould
thinkofthe
’sfor
eachblock
ascom
ingfrom
apopulation
of
’sfor
blocks.Ifw
edid
this,thenw
ecould
replacethe
modelin
(3.8)w
ith
���
�3 �3� �4 �4� �����
5 �5�
�j6 �67� �j66 �667� �j666 �6667� �j68 �687� � �� ��� �� ���� 9 ;(3.30)
where
j6 ,j66 ,j666
and
j68
areindependently
andidentically
distributedas
�������k � .
36
3.4C
lassicalM
ultivariate
Analysis�
PrincipalC
omponents
Analysis
�
Correspondence
Analysis
�
Multidim
ensionalScaling
�
Cluster
Analysis
andM
ixtureD
ecomposition
�
LatentVariable
andC
ovarianceS
tructureM
odels
37
3.4.1P
rincipalC
omponents
Analysis
Principalcom
ponentsanalysis
isa
way
oftransform
inga
setofC
-dimensional
vectorobservations,l� ,l� ,...,lA
,intoanother
setofC
-dimensionalvectors,
[� ,[� ,...,[A.
The[
’shave
theproperty
thatmostoftheirinform
ationcontent
isstored
inthe
firstfewdim
ensions(features).
This
willallow
dimensionality
reduction,sothatw
ecan
dothings
like:
�
obtaining(inform
ative)graphicaldisplays
ofthedata
in2-D
;
�
carryingoutcom
puterintensive
methods
onreduced
data;
�
gaininginsight
intothe
structureof
thedata,
which
was
notapparent
inC
dimensions.
38
Sepal L.
2.02.5
3.03.5
4.0
••
•••
•
• ••
••
•••
••
••
•••
••••
•• ••
• ••
• •• •
•••
•••
••
••
••
•• •• •
•
•••
•
•
••
••
••
•••
••
••
•• • ••
•••
•• •
••
••
•••
••
••
•• •
••
••
•• • •
•
••
••
••
••
••
••
•
•
• ••• •
••
•
••
•
•••
•
••• •• •
•
•••
••
••
•••• • •• •• • •••• •••• •• •••
•• •••• • •• •• • ••• •••• • •• •• ••
•••
•
•• •
•
•
••
• ••
•
•••
••
••
••• •
•••
••• ••
• • ••••• ••
•• • •
••
••
•• •
•
•
•• •
•• •
• • • •
••
•
•
•
•
•• •
••• • • •
••
•
•
• ••
••••
•••• •••
0.51.0
1.52.0
2.5
5 6 7 8
•••• ••• •• • •••• •••• •• ••
••
• ••
•• ••
• •• • ••• •••••
•• •• ••
•• •
•
•
••
•
•••
••
•• •••
••
••
•• • • •
••
••• ••• • •
•••• ••
•• •• •
••
••
•
•• •
• •••
•• ••• •
•
••
•
•
• •
••
••••
•• •
•••
••••
•• •
•
••
•• •
••
2.02.53.03.54.0•• •••
••
••
•
•••
•
• ••
••
••
••
••• • ••••
• ••
• ••
•
•• •
• •• •
•
•
••
••
••
••
••
••
••
•• ••
•••
••
••• •• •• •
•••••
••
•
•
•• ••
••
• •••
••
•
••
• ••
••
•
•••
•• •
••
•••
••
••
••
• ••
••
•
•••
•••
••••
•• ••
• ••
•S
epal W.
•• •• • •••• • •••• • ••• ••• ••
• •• • •••• • • •• • • •• • •• • • •
• •• •••••
••• •
••
•
•
••
••
•••
••
••
• •• •
• ••
••• ••
• ••• ••• •
••
• ••••
••
••• •
••
•• •
•• •• • ••
•••
••
••
••• •
• •• •
••
••
•••
••••
• ••• • ••
•• •• ••••• • ••• •••• •••
••
•••
•••••
• •• • • •• • •• ••
•• •• •••••
••
••
•••
•
•
•
•• •••
••
••
•• • ••
••
••• •• • ••
• •• • ••
•• ••
••
•
••
•••
• ••
•••••
• ••
•••
••
•
••• •
••
• •
•••
• •••
•••
••
••
• ••
•
••• •
••
••
••
•••
•• •
••
••
••
••
••••••
•••••
••
•••
••• •
••
••
•
••
••
••
•
•
•••
•• ••
•••
••
•• ••• • • ••
• •• • ••
••
••• •
••
•••
•
••
••
•• •
•
•
••
•• •
•• •
• •• •
••
•
•
••
•• • •
• ••
•• ••
•••
• •••
•••• ••
•
••
• ••
•• •
••
••
•••
••
•••
•••
• ••
•••••
•••
••• •
•••
••
••
••
••
• •• ••
•••
•
••
••
•••
• ••
••
••
••• ••
•••
••• •
••
••
•• •
••
••
•• •
••
••
•• • •
•
••
••
••
••
••
••
••
• ••• •
••
••
••
•••
••
•• • •••
•••
••
••
Petal L.
1 2 3 4 5 6 7
•• ••••
•• •• ••• ••• • ••
••
••
•••
•• ••
•• •• • •• ••••
•• •• ••
•• ••
••
•
•
•••
••
•• • ••
••
••
••• • •
••
• •• •••• •
••• ••
••
••• •
••
••
••
• •
• •••
• • ••• •
•• •
••
• •
••
•• ••
•• •
•• •
••••
•••
••
••
• ••
•
56
78
0.51.01.52.02.5
••••
••
••
••
•••
•• •
••
••
••
••
•• ••••
•••
•••
••
• •••
•••
••
••
••
••
••
•
••
•••• •
••
•••
•
•• ••• •• •
•• •• • •
••
••
••••
••
•• ••
••
•
••
• ••
••
•
•••
•• •
••• •
•
••
••
••
•• •• •
••••
•••
•• ••
•• ••
• ••
•
••
•••
•••
••
••
••
••
••
•••
•••
•••••
•• •
••••
•• •
••
••
••
••
• • •••
•••
••
••
••
••• •
••
•
••
•• • ••
•••
••• •
••
••
•••
••
••
• ••
•
•
••• ••
••
•
••
••
••
••
••
•
•••
•••
••
••
•••
•
••
•• • •••
• ••
••
••
12
34
56
7•••• ••••• •••
• • •••••• ••
••• •••• •• ••••• •• •• •••••••
• •••
•• •
••
••
•••
•• •
••
•
••
••• •
• ••
• •• ••
• •••••
• ••
••• ••
••
•
••• •
••
•• •
•• •• •••
• •
•
••
••
••••
•• • ••
••
•••
•• •
••• •
•• • ••
Petal W
.
Figure
3.3F
isher’sIris
Data
(collectedby
Anderson)
39
The
main
ideabehind
principalcom
ponentsanalysis
isthat
highinform
ationcorresponds
tohigh
variance.
So,
ifwe
wanted
toreduce
thel
’sto
asingle
dimension
we
would
transform
l
to
� �m fl�
choosingm
sothat�
hasthe
largestvariancepossible.
Itturnsoutthatm
shouldbe
theeigenvector
correspondingto
thelargesteigen-
valueofthe
variance(covariance)
matrix
ofl
,a
.
Itisalso
possibleto
showthatofallthe
directionsorthogonalto
thedirection
ofhighestvariance,the
(second)highestvariance
isin
thedirection
paralleltothe
eigenvectorofthe
secondlargesteigenvalue
of a.
These
resultsextend
allthew
aytoC
dimensions.
40
Estim
ateof a
is
?) �]�/
�
�� $� A� �
� � l� $l�gf� l� $l��
(3.31)
wherel
��
A@� l� .
�
The
eigenvaluesof?
aren� on� o���on�o��
�
The
eigenvectorsof?
correspondington� ,n� ,...,n�
arep� ,p� ,...,p� ,respectively.
The
vectorsp� ,p� ,...,p�
arecalled
theprincipal
axes.(p�
isthe
firstprincipalaxis,etc.)
�
The�DCcC�
matrix
whoseq th
column
isp�w
illbedenoted
asr
.
41
The
principalaxes(can
beand)
arechosen
sothatthey
areoflength
1and
areorthogonal(perpendicular).
Algebraically,this
means
that
p f� p�ts �
�
ifq �qvu
�
ifqw �qxu�
(3.32)
The
vector[
definedas,[) �] �/
�yzzzz{ p f�p f�...p f�|}}}}~
) �]�/
l) �] �/
�r fl
iscalled
thevector
ofprincipalcom
ponentscores
ofl
.T
heq thprincipalcom
-ponentscore
ofl
is�� �p f� l
;sometim
esthe
principalcomponentscores
arereferred
toas
theprincipalcom
ponents.
42
1.T
heelem
entsof[
areuncorrelated
andthe
sample
varianceof
the
q thprincipalcom
ponentscoreisn� .
Inother
words
thesam
plevariance
matrix
of[is
yzzz{n�
n�
�
�
...
n�|}}}~
) �]�/
�
2.T
hesum
ofthe
sample
variancesfor
theprincipalcom
ponentsis
equaltothe
sumofthe
sample
variancesfor
theelem
entsofl
.T
hatis,�� �
� n� ��� �
� 2��
�
where2
��
isthe
sample
varianceof�� .
43
y1
-6.5-6.0-5.5-5.0-4.5-4.0
••• •
••
••
••
••
••
• ••
••
• ••• ••••
••• •
••
••
••
••
•••
••
••
••
••
••
••
••
•
•
••
••
•••
•••
••••
•••
• • ••
•••
• •••
•••
• ••
••
•• •
•
••
••
•• •
•
•
••
••
••
••
• •
••
••
•
•
••
•• • •
• ••
•••
•••
••
••
••••
•• •
•
••
• • • ••••
••
••
••
••
••
••
••••
••
•••
• •• •
•••
•••
••
• • ••
••••
••
••
••
•
•
••
••
••
••
••
••
••
••• •
••
•• •• •
•• •
••
•• • •
••
•• ••
• •
••
••
••
•
•••
•• •
••
•• •
•
••
•
•
••
••
• •• •
••
• ••
••
••
••
• ••
••
••
•
-0.4-0.20.0
0.20.4
2 4 6 8
••
•• • •• •• • •
•••
••
•• •
• ••••
•• •
••
• ••
••
•• •
•• •
••
••
••
•• •
• •• •• •
••• ••• ••
•• •
•••
• ••
••
• • • ••
• •• •
••
••
••
••
•••
••
••
•• ••
••
•••
••
•• • •••
••
••
••
•
•
••
••
• ••
••
••
••
••
•• •
••
••
••
••
•
-6.5 -5.5 -4.5• •• •••
•• ••• • • •
•• • •• • •••
• • ••• • • •••• ••• • ••• ••••
•• •• •
• ••
•
• ••
•
•
••
• ••
•
• •• ••
••
•••• •••
• ••••
••• •• ••••
••• ••
••
••
•••
•
•
•
•
•• ••
••• •
• •
•
•
•
•
•••
•••• ••
•• •
•• •
•• ••
•
••• •••
•y2
••
• ••• •• •
•
•
••
•
••
••
••
•• •
• ••
•• •
•••
••
•••
• •
••
•
•••
•
• •
• •
••
•
•
••
•
•
•
••
••
••
•
••
••
••
••••
••
•• •••
•••
•
••
••••
••• •
•
•••
•
••
•
•
•
• •
••
••
••
•••
•
•
•
•
•
••
•
•• •
• •
•
•• •
••
••
••
•
•••
••
••
•
••
•••• •• •••
•• •
••
•••
• •• ••
•••
••
• •
••
•
•••
•
•••
•
••
•
•
•
•••• ••
•••
• •• • •• ••
••
•••
•••
••
•• •••
• ••••
•
••
••
••
•• •
••
••
••• •
••
•
• •
•
•
• • •• •
•
••
•
•
•
•
•
•
••
•
•• •
••
•
••
•
••
••
• ••
•
••
••
••
•
•• • • • ••• •• • •• •• •••• •• •
•• •• ••• ••• ••••• ••• •• •• •
• ••••• •
••
• • ••
•
•
•
••
••
• •••
•
•
•• •• •• •
•
• •• ••
••••
•• • ••
•• •••
••
•
•
•• •
•
•
••
••• •
• •••••
•
••
•
•
••• •
••• •
•
•••
•••
• •••
• •
•• • •
•
•• • •
••
••
••
••
••
• ••
••
•• ••• ••
•••
•••
••
••
•
••
• •
•
••
•
••
••
••
•
••
•
••
•
•
•
•
•
•
• •
•
••• •
••• •
••• •
•
•••
• • ••
•
• •• •
••
••
•••
••
•
•
•• •
•
•
••
••
••
•
•••
•
••
••
•
•
•
•• ••
••
•
•••
•
•••
••
••
• •••
• •
•
y3
-1.2-0.8-0.4 0.0
••
•• • • •• •• •
•• •
•
••
••
•• • ••
•
• ••
•••
•
••
•••
•••
•
•
••
•
••
•••• ••••
•••• •• ••
•••
•••
• •
••
•• •• •
•
• •• •
•
••
•
•
••
••
• ••
••
••
• ••
••
•
• •
••
•••• •
••
••
••
••
•
•
•
••
••
••
•
•
••
•
•
••
• ••
••
•
••
•
•
•
24
68
-0.4 0.00.20.4
• ••• • •• • •• •• •• •• •••• • •
••• • •• •• • •• • • • •• •• • ••
••
•• • • ••••
••• •
••
••
• •
•
•••• •
••
••• • •• •
•••• •
•• • • •
• •• ••
••• • •
•
••
••
• ••
••
••
•• •• •••
• ••
••
•
•
••
•••• ••
•
••
•
••
•• • •
•• •
•••••
••••
••
••
••
••
••
•••
••
• • •• •• ••
• •• •
•
••
••
•
••
• •
•
•
•
•
•
••
••
••
••
•
••
••
••
••
• ••
••
••• •
••
••• •
••
•••••
••
••••
••
••
• ••
•
••
••
• ••
••
••
••
••
•••
•
••
••
•
•
••
•• •
• •
•
••
•
•
••
••
•
•
•• • •
•••
•
-1.2-0.8
-0.40.0
••
•• • ••• •
••
••
••
••
••
••
••••
••
• ••
• •
• •• •
•
• ••
•
•
• ••
•
• •• •
••
••
•
••
••
••
••
•
••
••
•
••
••
• • ••
••
••• ••
• ••
•
••• •
••
•• ••
•••
••
•
••
•• •
••
• ••
••
••
••
••
•
•
••
•• •
• •
•
•
••
•
•
•
••
•
•
• ••
••
••
•y4
Figure
3.4P
rincipalcomponentscore
forF
isher’sIris
Data.
Com
parew
ithF
igure3.3
44
Effective
Dim
ensionality
1.T
heproportion
ofvariance
accountedfor
Takethe
first>
principalcom-
ponentsand
addup
theirvariances.
Divide
bythe
sumofallthe
variances,to
give
@B�� �� n�
@ �� �� n�
which
iscalled
theproportion
ofvarianceaccounted
forby
thefirst>
princi-palcom
ponents.
Usually,
projectionsaccounting
forover
75%of
thetotalvariance
arecon-
sideredto
begood.
Thus,
a2-D
picturew
illbe
considereda
reasonablerepresentation
if
n� �n�
@ �� �� n� �
� �WV �
45
2.T
hesiz
eof
impor
tantvariance
The
ideahere
isto
considerthe
varianceifalldirections
were
equallyim
portant.In
thiscase
thevariances
would
beapproxim
ately
n ��C�� �
� n� �
The
argumentruns
Ifn���n ,
thenthe
q thprincipal
directionis
lessinteresting
thanaverage.
andthis
leadsus
todiscard
principalcom
ponentsthat
havesam
plevari-
ancesbelow
n .
3.S
creedia
gramA
screediagram
isan
indexplotofthe
principalcomponent
variances.In
otherw
ordsitis
aplotofn�
againstq .A
nexam
pleofa
screediagram
,forthe
IrisD
ata,isshow
nin
Figure
3.5.
46
•
••
•
0 1 2 3 4�
12
34
�
λ
Figure
3.5W
elook
forthe
elbow;in
thiscase
we
onlyneed
thefirstcom
ponent.
47
Norm
alising
The
datacan
benorm
alisedby
carryingoutthe
following
steps.
�
Centre
eachvariable.
Inother
words
subtractthem
eanofeach
variableto
give
�l� �l� $l �
�
Divide
eachelem
entof �l�by
itsstandard
deviation;asa
formula
thism
eanscalculate
��� �����
2�
�w
here2�
isthe
sample
standarddeviation
of�� .
48
Petal L.
Sepal W.
-10-5
05
1015
-10 -5 0 5 10 15
•• •• • •••• • •••• • ••• ••• ••• •• • •••• • • •• • • •• • •• • • •• •• •••••• •• •
•••
• •• •• •••• • ••• • • •• ••••• •
• • ••• •• • ••• • ••••
••• •• ••
••• •
•• •• • ••••
• •••
• ••• •• •• •••• • ••• •••• • ••• • ••
Mean C
entred Data
•• •• ••
••• • •••
• ••
•••
••
••
••
• • ••
•• • • ••• • •• ••• •
••
••
• •••
••
••• •
••
••
••
••
•••
••
••
••
• ••
••
••• •
•• •
••
•••
••
•• •••
••
••
••
••
••
••
••
•• •
••
••
••
••
••
•• •
••
••
••
••
•••
••
•••
••
••
••
5 x Petal L.
Sepal W.-10
-50
510
15
-10 -5 0 5 10 15
Scaled D
ata
Figure
3.6Ifw
edon’tnorm
alise.
49
Interpretation
The
finalpartofa
principalcomponents
analysisis
toinspect
theeigenvectors
inthe
hopeofidentifying
am
eaningfor
the(im
portant)principalcom
ponents.
See
thebook
foran
interpretationfor
Fisher’s
IrisD
ata.
50
3.4.2C
orrespondenceA
nalysis
Correspondence
isa
way
torepresent
thestructure
within
incidencem
atrices.Incidence
matrices
arealso
calledtw
o-way
contingencytables.
An
example
ofa
� Vc ;�
incidencem
atrix,w
ithm
arginaltotals
isshow
nin
Table3.17.
Table3.17
Sm
okingC
ategoryS
taffGroup
None
LightM
ediumH
eavyTotal
Senior
Managers
42
32
11Junior
Managers
43
74
18S
eniorE
mployees
2510
124
51Junior
Em
ployees18
2433
1388
Secretaries
106
72
25Total
6145
6225
193
51
Two
Stages
�
Transformthe
valuesin
aw
aythatrelates
toa
testforassociation
between
rows
andcolum
ns(chi-squared
test).
�
Use
adim
ensionalityreduction
method
toallow
usto
drawa
pictureof
therelationships
between
rows
andcolum
nsin
2-D.
Details
arelike
principalcomponents
analysism
athematically;see
thebook.
52
3.4.3M
ultidimensional
Scaling
Multidim
ensionalscalingis
theprocess
ofconverting
aset
ofpairw
isedissim
i-larities
fora
setofpoints,intoa
setofco-ordinatesfor
thepoints.
Exam
plesofdissim
ilaritiescould
be:
�
theprice
ofanairline
ticketbetween
pairsofcities;
�
roaddistances
between
towns
(asopposed
tostraight-line
distances);
�
acoefficient
indicatinghow
differentthe
artefactsfound
inpairs
oftom
bsw
ithina
graveyardare.
53
Classical
Scaling
Classicalscaling
isalso
known
asm
etricscaling
andas
principalco-ordinatesanalysis.
The
name
‘metric’scaling
isused
becausethe
dissimilarities
areas-
sumed
tobe
distances—or
inm
athematicalterm
sthe
measure
ofdissim
ilarityis
theeuclidean
metric.
The
name
‘principalco-ordinatesanalysis’is
usedbe-
causethere
isa
linkbetw
eenthis
techniqueand
principalcomponents
analysis.T
henam
e‘classical’
isused
becauseit
was
thefirst
widely
usedm
ethodof
multidim
ensionalscaling,andpre-dates
theavailability
ofelectroniccom
puters.
The
derivationof
them
ethodused
toobtain
theconfiguration
isgiven
inthe
book.
54
The
resultsof
applyingclassicalscaling
toB
ritishroad
distancesare
shown
inF
igure3.7.
These
roaddistances
correspondto
theroutes
recomm
endedby
theA
utomobile
Association
;these
recomm
endedroutes
areintended
togive
them
inimum
travellingtim
e,notthethe
minim
umjourney
distance.
�
An
effectofthis,thatisvisible
inF
igure3.7
isthatthe
towns
andcities
havelined
upin
positionsrelated
tothe
motorw
aynetw
ork.
�
The
map
alsofeatures
distortionsfrom
thegeographicalm
apsuch
asthe
positionof
Holyhead
(holy),
which
appearsto
bem
uchcloser
toLiverpool
(lver)andM
anchesterthanitreally
is,andthe
positionofC
ornishpeninsula
(thepartending
atPenzance,penz)is
furtherfromC
armarthen
(carm)than
itisphysically.
55
Com
ponent 1
Component 2
-400-200
0200
-200 0 200
abdn
abry
barn
bham
bton
btol
camb
card
carl
carmcolc
dorcdovr
edin
exet
fort�
glas
glou
gild
holyhull
invr
kend
leed
linc
lver
maid
manc
middnewc
norw
nott
oxfd
penz
prth
plym
shef
sotn
stra
taun�
york
lond
Figure
3.7
56
Ordinal
Scaling
Ordinalscaling
isused
forthe
same
purposesat
classicalscaling,but
fordis-
similarities
thatare
notm
etric,that
is,they
arenot
what
we
would
thinkof
asdistances.
Ordinalscaling
issom
etimes
callednon-m
etricscaling,because
thedissim
ilaritiesare
notm
etric.S
ome
peoplecallitS
hepard-Kruskalscaling,be-
causeS
hepardand
Kruskalare
thenam
esoftw
opioneers
ofordinalscaling.
Inordinalscaling,
we
seeka
configurationin
which
thepairw
isedistances
be-tw
eenpoints
havethe
same
rankorder
asthe
correspondingdissim
ilarities.S
o,if���
isthe
dissimilarity
between
points�
and� ,and���
isthe
distancebetw
eenthe
same
pointsin
thederived
configuration,then
we
seeka
configurationin
which
��� ��Zk
if
��� ��Zk �
57
3.4.4C
lusterA
nalysisand
Mixture
Decom
position
Clusteranalysis
andm
ixturedecom
positionare
bothtechniques
todo
with
iden-tification
ofconcentrationsofindividuals
ina
space.
58
Cluster
Analysis
Clusteranalysis
isused
toidentify
groupsofindividuals
ina
sample.
The
groupsare
notpre-defined,
nor,usually,
isthe
number
ofgroups.
The
groupsthat
areidentified
arereferred
toas
clusters.
�
hierarchical
–agglom
erative
–divisive
�
non-hierarchical
59
�M
inimum
distanceor
single-link
�M
aximum
distanceor
complete-link
�
Avera
ge
distance
�
Centroid
distancedefines
thedistance
between
two
clustersas
thesquared
distancebetw
eenthe
mean
vectors(that
is,the
centroids)of
thetw
oclus-
ters.
�
Sum
ofsquared
deviations
definesthe
distancebetw
eentw
oclusters
asthe
sumofthe
squareddistances
ofindividualsfrom
thejointcentroid
ofthethe
two
clustersm
inusthe
sumofthe
squareddistances
ofindividualsfrom
theirseparate
clusterm
eans.
60
12
34
56
78
9
0 1 2 3 4 5 6Distance between clusters
Figure
3.8U
sualway
topresentresults
ofhierarchicalclustering.
61
Non-hierarchicalclustering
isessentially
tryingto
partitionthe
sample
soas
tooptim
izesom
em
easureofclustering.
The
choiceof
measure
ofclustering
isusually
basedon
propertiesof
sums
ofsquares
andproducts
matrices,like
thosem
etinS
ection3.3.1,because
theaim
inthe
MA
NO
VA
isto
measure
differencesbetw
eengroups.
The
main
difficultyhere
isthatthere
aretoo
many
differentways
topartition
thesam
plefor
usto
trythem
all,unless
thesam
pleis
verysm
all(around
about
����
orsm
aller).T
husour
onlyw
ay,in
general,of
guaranteeingthat
theglobaloptim
umis
achievedis
touse
am
ethodsuch
asbranch-and-bound¿
One
ofthe
bestknow
nnon-hierarchical
clusteringm
ethodsis
the
� -means
method.
62
Mixture
Decom
position
Mixture
decomposition
isrelated
tocluster
analysisin
thatit
isused
toidentify
concentrationsofindividuals.
The
basicdifference
between
clusteranalysisand
mixture
decomposition
isthatthere
isan
underlyingstatisticalm
odelinm
ixturedecom
position,whereas
thereis
nosuch
modelin
clusteranalysis.
The
proba-bility
densitythathas
generatedthe
sample
datais
assumed
tobe
am
ixtureof
severalunderlyingdistributions.
So
we
have
�� l� ��� �
� �� �� � l�d�� ��
where
�
isthe
number
ofunderlying
distributions,the
�� ’sare
thedensities
ofthe
underlyingdistributions,
the�� ’s
arethe
parameters
ofthe
underlyingdistributions,
the�� ’s
arepositive
andsum
toone,
and
�
isthe
densityfrom
which
thesam
plehas
beengenerated.
Details
inone
ofHand’s
books.63
3.4.5Latent
Variab
leand
Covariance
Structure
Models
Ihave
neverused
thetechniques
inthis
section,so
Ido
notconssider
myself
expertenoughto
givea
presentationon
them.
Notenough
time
tocover
everything.
64
3.5S
umm
ary
The
techniquespresented
inthis
chapterdonotform
anythinglike
anexhaustive
listof
usefulstatisticalmethods.
These
techniquesw
erechosen
becausethey
areeither
widely
usedor
oughtto
bew
idelyused.
The
regressiontechniques
arew
idelyused,though
thereis
some
reluctanceam
ongstresearchersto
make
thejum
pfrom
linearm
odelsto
generalizedlinear
models.
The
multivariate
analysistechniques
oughttobe
usedm
orethan
theyare.
One
ofthem
ainobstacles
tothe
adoptionofthese
techniquesm
aybe
thattheirrootsare
inlinear
algebra.
Ifeelthetechniques
presentedin
thischapter,and
theirextensions,w
illremain
orbecom
ethe
most
widely
usedstatisticaltechniques.
This
isw
hythey
were
chosenfor
thischapter.
65