guenomu software -- model and agorithm in 2013

Post on 03-Jul-2015

97 views 4 download

description

This is a progress report presented to the Phylogenomics Group at UVigo in May 2013, about the current status of the software guenomu and the Bayesian model implemented. At that time I was experimenting with a mixture model, that has been since then abandoned, and the Hdist that is still experimental. The presentation also describes the exhange algorithm to solve doubly-intractable distributions, the generalized Multiple-Try Metropolis, and the parallel PRNG used to minimize communication between jobs.

transcript

guenomu

Software and Model

Leonardo de O. Martins

University of Vigo

May, 16th 2013

Leo Martins (U Vigo) guenomu software 2013/5/16 1 / 15

Outline

1 The Model

2 The Sampling

3 The Code

Leo Martins (U Vigo) guenomu software 2013/5/16 2 / 15

Hierarchical Bayesian model

P(S ,Θ | D) ∝ P(θ0)P( ~λ0)P(α0)P(S)×

×N∏i=1

P(Di | Gi , ~θi )P(~θi | θ0)P(Gi | ~λi , ~wi ,S)P(~λi | ~λ0)P(~wi | αi )P(αi | α0)

Leo Martins (U Vigo) guenomu software 2013/5/16 3 / 15

The mixture of distance distributions

P(G | ~λ, ~w , S) =

w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )

Z(~λ, ~w , S)

wi ∼ Gamma(αgene , 1)

λx ∼ Exp(Λx )

each gene has its own set of wi and λi

the distances dx (G , S) are scaled to account for different gene family sizes

Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15

The mixture of distance distributions

P(G | ~λ, ~w , S) =

w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )

Z(~λ, ~w , S)

wi ∼ Gamma(αgene , 1)

λx ∼ Exp(Λx )

each gene has its own set of wi and λi

the distances dx (G , S) are scaled to account for different gene family sizes

Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15

The mixture of distance distributions

P(G | ~λ, ~w , S) =

w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )

Z(~λ, ~w , S)

wi ∼ Gamma(αgene , 1)

λx ∼ Exp(Λx )

each gene has its own set of wi and λi

the distances dx (G , S) are scaled to account for different gene family sizes

Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15

The mixture of distance distributions

P(G | ~λ, ~w , S) =

w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )

Z(~λ, ~w , S)

wi ∼ Gamma(αgene , 1)

λx ∼ Exp(Λx )

each gene has its own set of wi and λi

the distances dx (G , S) are scaled to account for different gene family sizes

Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15

The mixture of distance distributions

P(G | ~λ, ~w , S) =

w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )

Z(~λ, ~w , S)

wi ∼ Gamma(αgene , 1)

λx ∼ Exp(Λx )

each gene has its own set of wi and λi

the distances dx (G , S) are scaled to account for different gene family sizes

Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15

Outline

1 The Model

2 The Sampling

3 The Code

Leo Martins (U Vigo) guenomu software 2013/5/16 5 / 15

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)

Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)

exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)

exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)

II. draw y ′ ∼ π(· | θ′)exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)

exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)

exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)

exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Species tree proposal with the exchange algorithm

Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15

Species tree proposal with the exchange algorithm

Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15

Species tree proposal with the exchange algorithm

Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15

Species tree proposal with the exchange algorithm

Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15

Generalized Multiple-Try Metropolis

MH: sample y , decide if accept it with probability r

r =π(y)

π(x)

q(y , x)

q(x , y)=π(y)

π(x)

p(x | y)

p(y | x)

MTM: choose y among several samples, according to their relative weights

r =w(y1, x) + · · ·+ w(yk , x)

w(x∗1 , y) + · · ·+ w(x∗k , y)

where w(x , y) = π(x)q(x , y)λ(x , y) = π(x)p(y | x)λ(x , y)

GMTM: weights w(.) do not need to represent probability distributions.

r =π(y)pk(x | y)

π(x)pk(y | x)

Wx

Wy

where Wy = wi (yi ,x)∑kj=1 wj (yj ,x)

for the chosen element i

Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15

Generalized Multiple-Try Metropolis

MH: sample y , decide if accept it with probability r

r =π(y)

π(x)

q(y , x)

q(x , y)=π(y)

π(x)

p(x | y)

p(y | x)

MTM: choose y among several samples, according to their relative weights

r =w(y1, x) + · · ·+ w(yk , x)

w(x∗1 , y) + · · ·+ w(x∗k , y)

where w(x , y) = π(x)q(x , y)λ(x , y) = π(x)p(y | x)λ(x , y)

GMTM: weights w(.) do not need to represent probability distributions.

r =π(y)pk(x | y)

π(x)pk(y | x)

Wx

Wy

where Wy = wi (yi ,x)∑kj=1 wj (yj ,x)

for the chosen element i

Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15

Generalized Multiple-Try Metropolis

MH: sample y , decide if accept it with probability r

r =π(y)

π(x)

q(y , x)

q(x , y)=π(y)

π(x)

p(x | y)

p(y | x)

MTM: choose y among several samples, according to their relative weights

r =w(y1, x) + · · ·+ w(yk , x)

w(x∗1 , y) + · · ·+ w(x∗k , y)

where w(x , y) = π(x)q(x , y)λ(x , y) = π(x)p(y | x)λ(x , y)

GMTM: weights w(.) do not need to represent probability distributions.

r =π(y)pk(x | y)

π(x)pk(y | x)

Wx

Wy

where Wy = wi (yi ,x)∑kj=1 wj (yj ,x)

for the chosen element i

Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15

gene tree proposal with GMTM or MTM

Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15

gene tree proposal with GMTM or MTM

Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15

gene tree proposal with GMTM or MTM

Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15

Outline

1 The Model

2 The Sampling

3 The Code

Leo Martins (U Vigo) guenomu software 2013/5/16 10 / 15

RF distance, Assignment cost (Hdist)

Leo Martins (U Vigo) guenomu software 2013/5/16 11 / 15

RF distance, Assignment cost (Hdist)

Leo Martins (U Vigo) guenomu software 2013/5/16 11 / 15

A parallel pseudo-random number generator (PRNG)

Given a seed and an algorithm, we have a stream of PRNs.

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

Leo Martins (U Vigo) guenomu software 2013/5/16 12 / 15

A parallel pseudo-random number generator (PRNG)

Given a seed and an algorithm, we have a stream of PRNs.

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

Using a second algorithm, the firststream will give us a sequence ofseeds. We use the 150 parametersets for the Tausworthe (LFSR)generators (L’ecuyer, Maths Comput1999, pp.261).Therefore, given the seed, we canpredict all states of all streams.

Leo Martins (U Vigo) guenomu software 2013/5/16 12 / 15

A parallel pseudo-random number generator (PRNG)

In our gene/species model:

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

we split gene families among jobs

all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.

each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.

the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.

Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15

A parallel pseudo-random number generator (PRNG)

In our gene/species model:

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

we split gene families among jobs

all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.

each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.

the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.

Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15

A parallel pseudo-random number generator (PRNG)

In our gene/species model:

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

we split gene families among jobs

all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.

each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.

the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.

Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15

A parallel pseudo-random number generator (PRNG)

In our gene/species model:

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

we split gene families among jobs

all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.

each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.

the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.

Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15

Each job looks like an independent analysis

Leo Martins (U Vigo) guenomu software 2013/5/16 14 / 15

https://bitbucket.org/leomrtns/guenomu

Leo Martins (U Vigo) guenomu software 2013/5/16 15 / 15