Week 8: Paired sites tests, gene frequencies, continuouscharacters
Genome 570
March, 2012
Week 8: Paired sites tests, gene frequencies, continuous characters – p.1/47
An example – two trees
MouseBovine
GibbonOrang
GorillaChimp
Human
MouseBovine
GibbonOrangGorilla
ChimpHuman
Tree I
Tree II
Week 8: Paired sites tests, gene frequencies, continuous characters – p.2/47
The differences of log likelihoods
site1 2 3 4 5 6 ln L
Tree
I
II
231 232
−1405.61
−1408.80 ...
Diff ... +3.19
−2.971 −4.483 −5.673 −5.883 −2.691 ...−8.003 −2.971 −2.691
−2.983 −4.494 −5.685 −5.898 −2.700 −7.572 −2.987 −2.705
+0.012 +0.013 +0.010 −0.431+0.015+0.111 +0.012 +0.010
Week 8: Paired sites tests, gene frequencies, continuous characters – p.3/47
The histogram of differences of log-likelihoods
−0.50 0.0 0.50 1.0 1.5 2.0
Difference in log likelihood at site
Week 8: Paired sites tests, gene frequencies, continuous characters – p.4/47
Paired sites testsWinning sites test (Prager and Wilson, 1988). Do a sign test on thesigns of the differences.
z test (me, 1993 in PHYLIP documentation). Assume differencesare normal, do z test of whether mean (hence sum) difference issignificant.
t test. Swofford et. al., 1996: do a t test (paired)
Wilcoxon ranked sums test (Templeton, 1983).
RELL test (Kishino and Hasegawa, 1989 per my suggestion).Bootstrap resample sites, get distribution of difference of totals.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.5/47
In our example ...
Winning sites test. 160 of 232 sites favor tree I. P < 3.279 × 10−9
z test. Difference of log-likeihood totals is 0.948104 standarddeviations from 0, P = 0.343077. Not significant.
t test. Same as z test for this large a number of sites.
Wilcoxon ranked sums test. Rank sum is 4.82805 standarddeviations below its expected value, P = 0.000001378765
RELL test. 8,326 out of 10,000 samples have a positive sum,P = 0.3348 (two-sided)
Week 8: Paired sites tests, gene frequencies, continuous characters – p.6/47
The Shimodaira-Hasegawa test
Starts with a set of user-specified trees
Gets the sitewise log-likelihoods
adjusts each trees’ log-likelihoods to add up to same value
then resamples columns (sites) from these
asks how often a tree will get more than X worse then the best
for each X (the log-likelihood difference between one tree and thebest one) can get a P value.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.7/47
An outcome of Brownian motion on a 5-species tree
Week 8: Paired sites tests, gene frequencies, continuous characters – p.8/47
An outcome of Brownian motion on a 5-species tree
Week 8: Paired sites tests, gene frequencies, continuous characters – p.9/47
An outcome of Brownian motion on a 5-species tree
Week 8: Paired sites tests, gene frequencies, continuous characters – p.10/47
An outcome of Brownian motion on a 5-species tree
Week 8: Paired sites tests, gene frequencies, continuous characters – p.11/47
Brownian motion along a tree
x
x
x2
x3
x4
x5
x6x7
x8
x9
x
x1
x0
x x8
v1v2
v3
v8
v9
v4
v6 v7
v10
v5
v11
v12
x x82
x1− x8x x3 9
x x12
x x11
x x6 10 x x7 10
x x10 11
x x11 12
x x12 0
x x8 9
x x0
1 −
−
−
−
4 −
9 −
5 −
−
−
−
−
−
10
11
12
Week 8: Paired sites tests, gene frequencies, continuous characters – p.12/47
Distribution of tips on a tree under Brownian Motion
‘
3
1
20root v3
v
v
1
2
Tip 1 is the sum of two independent changes each of which is drawnfrom a normal distribution (with mean 0 and variances v3 and v1)so it is normally distributed with mean 0 and variance v3 + v1.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.13/47
Distribution of tips on a tree under Brownian Motion
‘
3
1
20root v3
v
v
1
2
Tip 1 is the sum of two independent changes each of which is drawnfrom a normal distribution (with mean 0 and variances v3 and v1)so it is normally distributed with mean 0 and variance v3 + v1.
Similarly for tip 2 (variance is v3 + v2).
Week 8: Paired sites tests, gene frequencies, continuous characters – p.13/47
Distribution of tips on a tree under Brownian Motion
‘
3
1
20root v3
v
v
1
2
Tip 1 is the sum of two independent changes each of which is drawnfrom a normal distribution (with mean 0 and variances v3 and v1)so it is normally distributed with mean 0 and variance v3 + v1.
Similarly for tip 2 (variance is v3 + v2).
They share branch 3, and the change there affects both randomvariables. So they are not independent or uncorrelated.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.13/47
Distribution of tips on a tree under Brownian Motion
‘
3
1
20root v3
v
v
1
2
Tip 1 is the sum of two independent changes each of which is drawnfrom a normal distribution (with mean 0 and variances v3 and v1)so it is normally distributed with mean 0 and variance v3 + v1.
Similarly for tip 2 (variance is v3 + v2).
They share branch 3, and the change there affects both randomvariables. So they are not independent or uncorrelated.
Variance is the expectation of the square (of deviation from themean), and covariance is the expectation of the product of thosedeviations, for the two variables.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.13/47
Distribution of tips on a tree under Brownian Motion
‘
3
1
20root v3
v
v
1
2
Tip 1 is the sum of two independent changes each of which is drawnfrom a normal distribution (with mean 0 and variances v3 and v1)so it is normally distributed with mean 0 and variance v3 + v1.
Similarly for tip 2 (variance is v3 + v2).
They share branch 3, and the change there affects both randomvariables. So they are not independent or uncorrelated.
Variance is the expectation of the square (of deviation from themean), and covariance is the expectation of the product of thosedeviations, for the two variables.
In fact the covariance of the values at tip 1 and tip 2 is the varianceof the shared term that is the same in both of them, so it is v3.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.13/47
Covariances of species on the tree
2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
v1 + v8 + v9 v8 + v9 v9 0 0 0 0
v8 + v9 v2 + v8 + v9 v9 0 0 0 0
v9 v9 v3 + v9 0 0 0 0
0 0 0 v4 + v12 v12 v12 v12
0 0 0 v12 v5 + v11 + v12 v11 + v12 v11 + v12
0 0 0 v12 v11 + v12 v6 + v10 + v11 + v12 v10 + v11 + v12
0 0 0 v12 v11 + v12 v10 + v11 + v12 v7 + v10 + v11 + v12
3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
Week 8: Paired sites tests, gene frequencies, continuous characters – p.14/47
Covariances are of form
a b c 0 0 0 0
b d c 0 0 0 0
c c e 0 0 0 0
0 0 0 f g g g
0 0 0 g h i i
0 0 0 g i j k
0 0 0 g i k l
Week 8: Paired sites tests, gene frequencies, continuous characters – p.15/47
Likelihood under Brownian motion with two species
f(x; µ, σ2
)=
1
σ√
2πexp
(−
(x − µ)2
2σ2
)
L =
p∏
i=1
1
(2π)√
v1v2
exp
(
−1
2
[(x1i − x0i)
2
v1
+(x2i − x0i)
2
v2
])
Week 8: Paired sites tests, gene frequencies, continuous characters – p.16/47
Minimizing for each character i
Q =(x1i − x0i)
2
v1
+(x2i − x0i)
2
v2
so:dQ
dx0i
= −2(x1i − x0i)
v1
− 2(x2i − x0i)
v2
= 0
and then:
x0i =1v1
x1i + 1v2
x2i
1v1
+ 1v2
So that we have a maximum likelihood estimate of the starting value x0i foreach character.
The result is that
Q =(x1i − x2i)
2
v1 + v2
Week 8: Paired sites tests, gene frequencies, continuous characters – p.17/47
Likelihood after estimating initial coordinates
Substituting in our estimates of x0i, we end up with
L =1
(2π)p (v1v2)12p
exp
(−
1
2
p∑
i=1
(x1i − x2i)2
v1 + v2
)
and this finally turns into:
ln L = −p ln(2π) −1
2p ln (v1v2) −
1
2
p∑
i=1
(x1i − x2i)2
v1 + v2
This actually goes to infinity as either v1 or v2 goes to zero! This is relatedto the problem that Edwards and Cavalli-Sforza had with their maximumlikelihood method in 1964.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.18/47
If there is a clock ...If instead we constrain v1 = v2 because assume a clock:
ln L = K′ − p ln(v1 + v2) −1
2
D2
(v1 + v2)
which leads tov1 = v2 = D2/(4p)
(which is half as big as it should be!)
The number of parameters being estimated is p + 1, which rises as weconsider more characters. The fact that the ratio of data to parametersdoes not rise without limit is the reason why likelihood misbehaves in thiscase.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.19/47
The difference between ML and REML
Information we use for ML inference:
1.0 2.0 3.0 4.0
species 1species 2 species 3species 4
species 1species 2 species 3species 4
Information we use for REML inference:
1.0+x 2.0+x 3.0+x 4.0+x
Does it matter that we don’t know x ? It makes it unnecessary to estimatethe starting value x0, and that eliminates p parameters. It means that theratio of data to parameters does then rise as we add characters.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.20/47
Using only differences between populations (REML)
We assume that we have observed only the differences x1i − x2i, and notthe actual locations on the phenotype scale. Then
L =
p∏
i=1
1√
2π√
v1 + v2
exp
(−
1
2
(x1i − x2i)2
v1 + v2
)
ln L = K −p
2ln (v1 + v2) +
1
2 (v1 + v2)
n∑
i=1
(xi1 − xi2)2
Week 8: Paired sites tests, gene frequencies, continuous characters – p.21/47
Likelihood with two species using REML
ln L = K −p
2ln (v1 + v2) +
D2
2 (v1 + v2)
ln L = K −p
2ln (vT) +
D2
2 vT
vT = D2/p
The number of parameters being estimated is 1 (it is the sum v1 + v2).The number of parameters does not rise as we consider more characters.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.22/47
“Pruning” a tree in the Brownian motion case
+v1 v2
v3 v4v5
v6
x1 x
2x 3
x 4
v1 v2
x1 x
2
v3 v4v5
v6
δ
x 3x 4
x12
v1 v2δ =v2v1
+
x1
x2x
12 v2v1+
=
v1v2+
Week 8: Paired sites tests, gene frequencies, continuous characters – p.23/47
What about quantitative characters?
For neutral mutation and genetic drift, can show that for a quantitativecharacter with additive genetic variance VA and population size N thegenetic (additive) value of the population mean is:
Var(∆g) = VA/N
If mutation and drift are at equilibrium:
E[V
(t+1)A
]= V
(t)A
(1 −
1
2N
)+ VM
Week 8: Paired sites tests, gene frequencies, continuous characters – p.24/47
In neutral traits additive genetic variance rules
so thatE [VA] = 2NVM
wherebyVar[∆g] = (2NVM) /N = 2VM
an analogue of Kimura’s result for neutral mutation.There is a precise analogue of this for multiple characters.
Thus to transform characters to independent Brownian motions of equalevolutionary variance, we could use the additive genetic variance VA.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.25/47
With selection ... life is harder
There is the quantitative genetics formula of Wright and Fisher (1920’s)
∆z = h2S
and Russ Lande’s (1976) recasting of that in terms of slopes of meanfitness surfaces:
S = VP
d log (w)
dx
∆z = (VA/VP) VP
d log (w)
dx= VA
d log (w)
dx
Week 8: Paired sites tests, gene frequencies, continuous characters – p.26/47
Selection towards an optimum
P
Vs
Fit
nes
sPhenotype
If fitness as a function of phenotype is:
w(x) = exp
[−
(x − p)2
2Vs
]
Then the change of mean phenotype “chases” the optimum:
m′ − m =VA
Vs + VP
(p − m)
Week 8: Paired sites tests, gene frequencies, continuous characters – p.27/47
A character changing by “chasing” an adaptive peak
time
The course of change of the population mean is expected to be somewhatsmoother than the changes of the peak of the fitness surface.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.28/47
Sources of evolutionary correlation among characters
Variation (and covariation) in change of characters occurs for two reasons:
1. Genetic drift, with the covariances being proportional to the additivegenetic covariances
2. Selection, with the covariances being affected by both the additivegenetic covariances and the covariation of the selection pressures.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.29/47
A simple example of selective covariance
a simple example:
(temperate) (arctic) (arctic)(temperate) (temperate)
sizecolorlimblength
size
color
limblength
covariation due not to genetic correlationbut to covariation of the selection pressure
These are Bergmann’s, Allen’s and Glogler’s Rules
not They are presumably the result of genetic correlationsbut result from patterns of selection
Variation and evolutionin plants. Columbia Univ. Press, New York.page 121
G. L. Stebbins. 1950.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.30/47
A simulated example with two characters
After 100 generations:
−30 −20 −10 0 10 20 30−30
−20
−10
0
10
20
30
Genetic covariances are negative, but the wanderings of the adaptivepeak in the two characters is positively correlated.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.31/47
A simulated example with two characters
After 1000 generations:
−30 −20 −10 0 10 20 30−30
−20
−10
0
10
20
30
Genetic covariances are negative, but the wanderings of the adaptivepeak in the two characters is positively correlated.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.32/47
A simulated example with two characters
After 10,000 generations:
−30 −20 −10 0 10 20 30−30
−20
−10
0
10
20
30
Genetic covariances are negative, but the wanderings of the adaptivepeak in the two characters is positively correlated.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.33/47
Correcting for correlations among characters
Can we transform the set of characters to remove their correlations andthus end up with independent Brownian motions of equal variance?
We might hope to infer additive genetic covariances by doingquantitive genetics breeding experiments to infer them fromcovariances among relatives.
There is little or no hope of inferring “selective correlations” without acomplete understanding of the functional ecology.
If we are given the tree from molecular data (and are willing toassume that the branch lengths are proportional to those that applyto the morphological characters), we can hope to use the tree toinfer the covariation of the characters.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.34/47
Correlation of states in a discrete-state model
#2
#1
#2
#1
species states branch changes
change incharacter 2
change incharacter 1
0 6
4 0
Y N
Y
N
1 0
0 18
character 1:
character 2:
Week 8: Paired sites tests, gene frequencies, continuous characters – p.35/47
A simple case to show effects of phylogeny
Week 8: Paired sites tests, gene frequencies, continuous characters – p.36/47
Two uncorrelated characters evolving on that tree
Week 8: Paired sites tests, gene frequencies, continuous characters – p.37/47
Identifying the two clades
Week 8: Paired sites tests, gene frequencies, continuous characters – p.38/47
A tree on which we are to observe two characters
0.3
0.1
0.25
0.65
0.1 0.1a
b
cd e
(0.7)
(0.2) 0.9
Week 8: Paired sites tests, gene frequencies, continuous characters – p.39/47
Decomposing it into two-species contrasts ...
0.25
0.65
0.3
0.1
a
b
c0.1 0.1
d e
(0.7)
(0.2) 0.9
(de)(ab)0.075
0.05
(abc)0.1666
Week 8: Paired sites tests, gene frequencies, continuous characters – p.40/47
Contrasts on that tree
Varianceproportional
Contrast to
y1 = xa − xb 0.4
y2 = 14
xa + 34
xb − xc 0.975
y3 = xd − xe 0.2
y4 = 16
xa + 12
xb + 13
xc − 12
xd − 12
xe 1.11666
Week 8: Paired sites tests, gene frequencies, continuous characters – p.41/47
Contrasts for the 20-species two-clade example
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Week 8: Paired sites tests, gene frequencies, continuous characters – p.42/47
An example: Smith and Cheverud 2002Smith, R. J. and J. M. Cheverud. 2002. Scaling of sexual dimorphism in body mass: Aphylogenetic analysis of Rensch’s Rule in primates. International Journal of Primatology 23(5):1095-1135.
Fig. 1. The interspecific allometric equation (specific regression, identified as IA) and theindependent contrasts equation (identified as IC) plotted for 105 primate species in raw dataspace, transformed to natural logarithms. The interspecific allometric equation islny = 0.139 + 0.080(lnx), with r = 0.53. The phylogenetically corrected form of thisequation, taken from the independent contrasts analysis, is lny = 0.160 + 0.056(lnx), withr = 0.26. The two equations are not significantly different from each other. The identifiedspecies are Mandrillus sphinx (M), Pongo pygmaeus (O), Gorilla gorilla (G), Pan troglodytes (P),and Homo sapiens (H).
Week 8: Paired sites tests, gene frequencies, continuous characters – p.43/47
A tree with punctuated equilibrium
Y
GA
FA
I
R
G
E
U
L
LA
V
N
KA
O
MA
CA
B
T
D
C
Z
X
JA
DA
BA
J
HA
A
K
F
M
P
OA
EA
IA
NA
W
H
Q
S
Week 8: Paired sites tests, gene frequencies, continuous characters – p.44/47
The punctuated tree when we sample 10 species
I
G
E
B
D
C
J
A
F
HWeek 8: Paired sites tests, gene frequencies, continuous characters – p.45/47
Two-species paired comparisons
AB CD EF G H
Week 8: Paired sites tests, gene frequencies, continuous characters – p.46/47
Pagel’s (1994) test for correlation with discrete 0/1 trait s
When character 1 has state Rates of change incharacter 2 are:
0 1α
β0 0
0
0 1α
β1 1
1
When character 2 has state Rates of change incharacter 1 are:
0 100
0
0 1
γ
δ
γ1
1
1δ
Week 8: Paired sites tests, gene frequencies, continuous characters – p.47/47
Pagel’s (1994) test for correlation with discrete 0/1 trait s
To : 00 01 10 11
From :
00 −− α0 γ0 0
01 β0 −− 0 γ1
10 δ0 0 −− α1
11 0 δ1 β1 −−
This can be set up as a 4 × 4 model of change with four states, 00, 01,10, and 11, and likelihood ratio tests used.
Complete independence of the changes in the two characters involvesrestricting the parameters so that α1 = α0, β1 = β0, γ1 = γ0, and δ1 = δ0.
Week 8: Paired sites tests, gene frequencies, continuous characters – p.48/47