+ All Categories
Home > Documents > Statistical binning enables an accurate coalescent-based...

Statistical binning enables an accurate coalescent-based...

Date post: 26-Sep-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
50
Statistical binning enables an accurate coalescent-based estimation of the avian tree Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. Science (2014)
Transcript
Page 1: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Statistical binning enables an accurate coalescent-based estimation of the avian tree

Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, and Tandy Warnow. Science (2014)

Page 2: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Avian whole genomes phylogenies [Jarvis, Mirarab, et al., Science, 2014]

2

48 re

pres

enta

tive

bird

s

Data (i.e., # of genes)

Species tree error

Hope!

Page 3: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Gene tree discordance

3

Eagle  Owl Falcon  Finch Eagle  Owl Falcon  Finch

gene 1000gene 1 gene 999gene 2

gene: recombination-free orthologous regions in genomes

Page 4: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Gene tree discordance

3

       Eagle  Owl Falcon  Finch

A gene tree

The species tree

Eagle  Owl Falcon  Finch Eagle  Owl Falcon  Finch

gene 1000gene 1 gene 999gene 2

Page 5: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Gene tree discordance

3

       Eagle  Owl Falcon  Finch

A gene tree

The species tree

Eagle  Owl Falcon  Finch Eagle  Owl Falcon  Finch

Causes of gene tree discordance:• Incomplete Lineage Sorting (ILS) • Duplication and loss • Horizontal Gene Transfer (HGT)

• Modeled by multi-species coalescent

• Highly probable for radiations (e.g., short branches) such as the bird radiation; 60 mya

• The species is identifiable from the gene tree distribution [Degnan and Salter, 2005]

gene 1000gene 1 gene 999gene 2

Page 6: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

4

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

Species tree estimation from phylogenomic data (approach 1: concatenation)

Page 7: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

4

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATGCGATG AGCAGC-TGCGATG AGCAGC-TGC-ATG C-TA-CAC-GGATG

CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT

Concatenation

gene 1000gene 1

Species tree estimation from phylogenomic data (approach 1: concatenation)

Page 8: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

4

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATGCGATG AGCAGC-TGCGATG AGCAGC-TGC-ATG C-TA-CAC-GGATG

CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT

Concatenation

       Eagle

 Owl

     Falcon      

 Finch

81%

gene 1000gene 1

Species tree estimation from phylogenomic data (approach 1: concatenation)

ML

Page 9: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

4

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATGCGATG AGCAGC-TGCGATG AGCAGC-TGC-ATG C-TA-CAC-GGATG

CAGAGCACGCACGAA AGCA-CACGC-CATA ATGAGCACGC-C-TA AGC-TAC-CACGGAT

Concatenation

       Eagle

 Owl

     Falcon      

 Finch

81%

gene 1000gene 1

- Statistically inconsistent & positively misleading

[Roch and Steel, Theo. Pop. Gen., 2014]

- Mixed accuracy in simulations

[Kubatko and Degnan, Systematic Biology, 2007] [Mirarab, et al., Systematic Biology, 2014]

Data

Error

Species tree estimation from phylogenomic data (approach 1: concatenation)

ML

Page 10: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Species tree estimation from phylogenomic data (approach 2: summary methods)

5

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

Page 11: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Species tree estimation from phylogenomic data (approach 2: summary methods)

5

Eagle

 OwlFalcon

 Finch Eagle

 Owl Falcon

 Finch Eagle

 OwlFalcon

 Finch

Eagle

 Owl

Falcon

 Finch

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

Page 12: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Species tree estimation from phylogenomic data (approach 2: summary methods)

5

       Eagle

 Owl

     Falcon      

 Finch

78%

Summary methodEagle

 OwlFalcon

 Finch Eagle

 Owl Falcon

 Finch Eagle

 OwlFalcon

 Finch

Eagle

 Owl

Falcon

 Finch

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

Page 13: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Species tree estimation from phylogenomic data (approach 2: summary methods)

5

       Eagle

 Owl

     Falcon      

 Finch

78%

Summary methodEagle

 OwlFalcon

 Finch Eagle

 Owl Falcon

 Finch Eagle

 OwlFalcon

 Finch

Eagle

 Owl

Falcon

 Finch

Data

ErrorCan be statistically consistent

• MP-EST (maximum pseudo-likelihood) [Liu, Yu, Edwards, BMC Evol. Bio., 2010] • BUCKy-pop., NJst, STAR, ASTRAL, …

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

Page 14: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Species tree estimation from phylogenomic data (approach 2: summary methods)

5

       Eagle

 Owl

     Falcon      

 Finch

78%

Summary methodEagle

 OwlFalcon

 Finch Eagle

 Owl Falcon

 Finch Eagle

 OwlFalcon

 Finch

Eagle

 Owl

Falcon

 Finch

Data

ErrorCan be statistically consistent

• MP-EST (maximum pseudo-likelihood) [Liu, Yu, Edwards, BMC Evol. Bio., 2010] • BUCKy-pop., NJst, STAR, ASTRAL, …

True gene trees

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

Page 15: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

medianmean

0

5%

10%

15%

20%

0% 25% 50% 75% 100%branch bootstrap support

bran

ches

(per

cent

age)

Gene trees on the avian dataset

6

A measure of confidence in estimated gene tree branches

14,000 “genes”: 8,000 exons and 2,500 introns 3,500 Ultra-Conserved Elements

Page 16: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

medianmean

0

5%

10%

15%

20%

0% 25% 50% 75% 100%branch bootstrap support

bran

ches

(per

cent

age)

Gene trees on the avian dataset

6

A measure of confidence in estimated gene tree branches

14,000 noisy gene trees

14,000 “genes”: 8,000 exons and 2,500 introns 3,500 Ultra-Conserved Elements

Page 17: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

medianmean

0

5%

10%

15%

20%

0% 25% 50% 75% 100%branch bootstrap support

bran

ches

(per

cent

age)

Gene trees on the avian dataset

6

A measure of confidence in estimated gene tree branches

88

54

96

95

85

74

32

Cho

rdat

es

45

84

72

70

52

60

71

Arthr

opod

a

Hym

enop

tera

Pro

tost

omia

Cra

niat

es

B) M

etaz

oan

97

Cur

sore

s

Col

umbe

a

Otid

imor

phae

Aust

rala

ves

80

BinnedMP-EST

Unbinned MP-EST

73

67

92

79

94

99

68

88

A) A

vian

87

9888

50

88

68

86

95

BinnedMP-EST

Unbinned MP-EST

Con

flict

with

othe

rlin

esof

stro

ngev

iden

ce

Uro

chor

date

s

Cep

halo

chor

date

s

H.rob

usta

C.in

test

inal

is

C.e

lega

ns

S.p

urpu

ratu

s

B.m

ori

T.a

dhae

rens

G.g

allu

s

L.gi

gant

ea

S.m

anso

ni

T.c

asta

neum

D.p

ulex

D.m

elan

ogas

ter

X.tr

opic

alis

B.fl

orid

ae

N.v

ecte

nsis

A.m

ellif

era

D.rer

io

I.sca

pula

ris

M.m

uscu

lus

H.s

apie

ns

M.b

revi

colli

s

46

Uro

chor

date

s

Cep

halo

chor

date

s

77

C.e

lega

ns

T.c

asta

neum

G.g

allu

s

M.b

revi

colli

s

I.sca

pula

risL.

giga

ntea

D.rer

io

S.m

anso

ni

A.m

ellif

era

N.v

ecte

nsis

H.s

apie

ns

C.in

test

inal

is

S.p

urpu

ratu

s

D.m

elan

ogas

ter

H.rob

usta

B.m

ori

X.tr

opic

alis

T.a

dhae

rens

M.m

uscu

lus

B.fl

orid

ae

D.p

ulex

Bila

teria

Podic

eps

cris

tatu

s97

Pass

eri

form

es

Psi

ttaci

form

es

Falc

o p

ere

gri

nus

Cari

am

a c

rist

ata

Cora

ciim

orp

hae

Acc

ipit

rifo

rmes

Tyto

alb

a

Cari

am

a c

rist

ata

Cora

ciim

orp

hae

Pele

canus

cris

pus

Egre

tt a

garz

ett

aN

ipponia

nip

pon

Phala

croco

rax c

arb

oPro

cella

riim

orp

hae

Gavia

ste

llata

Phaeth

on leptu

rus

Eury

pyga h

elia

sB

ale

ari

ca r

egulo

rum

Chara

dri

us

voci

feru

sO

pis

thoco

mus

hoazi

n

Caly

pte

anna

Chaetu

ra p

ela

gic

aA

ntr

ost

om

us

caro

linensi

s

Taura

co e

ryth

rolo

phus

Chla

mydoti

s m

acq

ueenii

Cucu

lus

canoru

s

Colu

mbal iv

iaPte

rocl

es

gutt

ura

lisM

esi

torn

is u

nic

olo

r

Phoenic

opte

rus

ruber

Mele

agri

s gallo

pavo

Gallu

s gallu

sA

nas

pla

tyrh

ynch

os

Str

uth

io c

am

elu

sTi

nam

us

gutt

atu

s

91

58

59

99

Podic

eps

cris

tatu

sPhoenic

opte

rus

ruber

Cucu

lus

canoru

s

Pass

eri

form

es

Psi

ttaci

form

es

Falc

o p

ere

gri

nus

Acc

ipit

rifo

rmes

Tyto

alb

aPe

leca

nus

cris

pus

Egre

tt a

garz

ett

aN

ipponia

nip

pon

Phala

croco

rax c

arb

oPro

cella

riim

orp

hae

Gavia

ste

llata

Phaeth

on leptu

rus

Eury

pyga h

elia

sB

ale

ari

ca r

egulo

rum

Chara

dri

us

voci

feru

sO

pis

thoco

mus

hoazi

n

Caly

pte

anna

Chaetu

ra p

ela

gic

aA

ntr

ost

om

us

caro

linensi

s

Colu

mbal iv

iaPte

rocl

es

gutt

ura

lisM

esi

torn

is u

nic

olo

rM

ele

agri

s gallo

pavo

Gallu

s gallu

sA

nas

pla

tyrh

ynch

os

Str

uth

io c

am

elu

sTi

nam

us

gutt

atu

s

Taura

co e

ryth

rolo

phus

Chla

mydoti

s m

acq

ueenii

14,000 noisy gene trees

14,000 “genes”: 8,000 exons and 2,500 introns 3,500 Ultra-Conserved Elements

Page 18: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Simulation studies

Error metric: percentage of branches in true tree that are missing from the estimated tree

7

Truegenetrees Sequencedata

Es�matedspeciestree

Finch Falcon Owl Eagle Pigeon

Es�matedgenetreesFinch Owl Falcon Eagle Pigeon

True(model)speciestree

Page 19: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

5%

10%

15%

20%

1,500 1,000 500 250Gene sequence length

Spec

ies

tree

topo

logi

cal e

rror (

FN) MP−EST

Gene trees on the avian dataset

8

Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014]

A statistically consistent summary method

more gene tree error

Page 20: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

5%

10%

15%

20%

1,500 1,000 500 250Gene sequence length

Spec

ies

tree

topo

logi

cal e

rror (

FN) MP−EST

Gene trees on the avian dataset

8

Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014]

A statistically consistent summary method

more gene tree error

Gene tree error matters

[Ané, et al, MBE, 2007][Patel, et al, MBE, 2013] [Gatesy, Springer, MPE, 2014] [Mirarab, et al., Systematic Biology, 2014]

Page 21: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

5%

10%

15%

20%

1,500 1,000 500 250Gene sequence length

Spec

ies

tree

topo

logi

cal e

rror (

FN) MP−EST

Gene trees on the avian dataset

8

Avian-like simulations (1000 genes) [Mirarab, et al., Science, 2014]

5%

10%

15%

20%

1,500 1,000 500 250Gene sequence length

Spec

ies

tree

topo

logi

cal e

rror (

FN)

MP−ESTConcatenation (ML)

more gene tree error

Page 22: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Statistical binning: idea

• Concatenation has good accuracy with low levels of ILS

• Some pairs of genes are concordant (at least in topology)

9

Summary methods: All “genes”

independent

Concatenation: All “genes” put togetherBinning

Page 23: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Statistical binning: idea

• Concatenation has good accuracy with low levels of ILS

• Some pairs of genes are concordant (at least in topology)

• Concatenate “combinable” sets of genes into “supergenes” to increase the phylogenetic signal

9

Summary methods: All “genes”

independent

Concatenation: All “genes” put togetherBinning

Page 24: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Statistical binning: idea

• Concatenation has good accuracy with low levels of ILS

• Some pairs of genes are concordant (at least in topology)

• Concatenate “combinable” sets of genes into “supergenes” to increase the phylogenetic signal

• How combinable genes are found gene tree estimation is hard?

9

Summary methods: All “genes”

independent

Concatenation: All “genes” put togetherBinning

Page 25: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Statistical tests of combinability

10

A

B

CD

E

F

G

A

B

C D

E

FG

g1

g2

Page 26: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Statistical tests of combinability

10

40%

70%

85%

20%A

B

CD

E

F

G

65%

25%90%70%

A

B

C D

E

FG

g1

g2

Page 27: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Statistical tests of combinability

10

40%

70%

85%

20%A

B

CD

E

F

G

70%

85%

A

B

CD

E

F

G

65%

25%90%70%

65%90% 70%A

B

C DE

FG

A

B

C D

E

FG

• Restrict genes to parts that have a minimum support

<50%

<50%

g1

g2

Page 28: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Statistical tests of combinability

10

40%

70%

85%

20%A

B

CD

E

F

G

70%

85%

A

B

CD

E

F

G

65%

25%90%70%

65%90% 70%A

B

C DE

FG

A

B

C D

E

FG

A

B

CD

E

F

G

Compatible

• Restrict genes to parts that have a minimum support

• Test combinability based on the supported parts of gene trees

<50%

<50%

g1

g2

Page 29: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Incompatibility graph

11

12 DECEMBER 2014 • VOL 346 ISSUE 6215 1337SCIENCE sciencemag.org

INTRODUCTION: Reconstructing species

trees for rapid radiations, as in the early

diversification of birds, is complicated by

biological processes such as incomplete

lineage sorting (ILS)

that can cause differ-

ent parts of the ge-

nome to have different

evolutionary histories.

Statistical methods,

based on the multispe-

cies coalescent model and that combine

gene trees, can be highly accurate even

in the presence of massive ILS; however,

these methods can produce species trees

that are topologically far from the species

tree when estimated gene trees have error.

We have developed a statistical binning

technique to address gene tree estimation

error and have explored its use in genome-

scale species tree estimation with MP-EST,

a popular coalescent-based species tree

estimation method.

Statistical binning enables an

accurate coalescent-based estimation

of the avian tree

AVIAN GENOMICS

Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, Tandy Warnow*

RESEARCH ARTICLE SUMMARY

The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for

combinabilty, before estimating gene trees.

Statistical binning technique

Statistical binning pipeline

Traditional pipeline (unbinned)

Sequence data

Incompatibility graph

Gene alignments

Binned supergene alignments

Estimated gene trees

Supergene trees

Species tree

Species tree

RATIONALE: In statistical binning, phy-

logenetic trees on different genes are es-

timated and then placed into bins, so that

the differences between trees in the same

bin can be explained by estimation error

(see the figure). A new tree is then esti-

mated for each bin by applying maximum

likelihood to a concatenated alignment of

the multiple sequence alignments of its

genes, and a species tree is estimated us-

ing a coalescent-based species tree method

from these supergene trees.

RESULTS: Under realistic conditions in

our simulation study, statistical binning

reduced the topological error of species

trees estimated using MP-EST and enabled

a coalescent-based analysis that was more

accurate than concatenation even when

gene tree estimation error was relatively

high. Statistical binning also reduced the

error in gene tree topology and species

tree branch length estimation, especially

when the phylogenetic signal in gene se-

quence alignments was low. Species trees

estimated using MP-EST with statisti-

cal binning on four biological data sets

showed increased concordance with the

biological literature. When MP-EST was

used to analyze 14,446 gene trees in the

avian phylogenomics project, it produced

a species tree that was discordant with the

concatenation analysis and conflicted with

prior literature. However, the statistical

binning analysis produced a tree that was

highly congruent with the concatenation

analysis and was consistent with the prior

scientific literature.

CONCLUSIONS: Statistical binning re-

duces the error in species tree topology

and branch length estimation because

it reduces gene tree estimation error.

These improvements are greatest when

gene trees have reduced bootstrap sup-

port, which was the case for the avian

phylogenomics project. Because using

unbinned gene trees can result in over-

estimation of ILS, statistical binning may

be helpful in providing more accurate

estimations of ILS levels in biological

data sets. Thus, statistical binning enables

highly accurate species tree estimations,

even on genome-scale data sets. �

The list of author affiliations is available in the full article online.

*Corresponding author. E-mail: [email protected] this article as S. Mirarab et al., Science 346, 1250463 (2014). DOI: 10.1126/science.1250463

Read the full article

at http://dx.doi

.org/10.1126/

science.1250463

ON OUR WEB SITE

Published by AAAS

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

A gene tree Incompatibility between two gene trees

Page 30: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Incompatibility graph

11

12 DECEMBER 2014 • VOL 346 ISSUE 6215 1337SCIENCE sciencemag.org

INTRODUCTION: Reconstructing species

trees for rapid radiations, as in the early

diversification of birds, is complicated by

biological processes such as incomplete

lineage sorting (ILS)

that can cause differ-

ent parts of the ge-

nome to have different

evolutionary histories.

Statistical methods,

based on the multispe-

cies coalescent model and that combine

gene trees, can be highly accurate even

in the presence of massive ILS; however,

these methods can produce species trees

that are topologically far from the species

tree when estimated gene trees have error.

We have developed a statistical binning

technique to address gene tree estimation

error and have explored its use in genome-

scale species tree estimation with MP-EST,

a popular coalescent-based species tree

estimation method.

Statistical binning enables an

accurate coalescent-based estimation

of the avian tree

AVIAN GENOMICS

Siavash Mirarab, Md. Shamsuzzoha Bayzid, Bastien Boussau, Tandy Warnow*

RESEARCH ARTICLE SUMMARY

The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for

combinabilty, before estimating gene trees.

Statistical binning technique

Statistical binning pipeline

Traditional pipeline (unbinned)

Sequence data

Incompatibility graph

Gene alignments

Binned supergene alignments

Estimated gene trees

Supergene trees

Species tree

Species tree

RATIONALE: In statistical binning, phy-

logenetic trees on different genes are es-

timated and then placed into bins, so that

the differences between trees in the same

bin can be explained by estimation error

(see the figure). A new tree is then esti-

mated for each bin by applying maximum

likelihood to a concatenated alignment of

the multiple sequence alignments of its

genes, and a species tree is estimated us-

ing a coalescent-based species tree method

from these supergene trees.

RESULTS: Under realistic conditions in

our simulation study, statistical binning

reduced the topological error of species

trees estimated using MP-EST and enabled

a coalescent-based analysis that was more

accurate than concatenation even when

gene tree estimation error was relatively

high. Statistical binning also reduced the

error in gene tree topology and species

tree branch length estimation, especially

when the phylogenetic signal in gene se-

quence alignments was low. Species trees

estimated using MP-EST with statisti-

cal binning on four biological data sets

showed increased concordance with the

biological literature. When MP-EST was

used to analyze 14,446 gene trees in the

avian phylogenomics project, it produced

a species tree that was discordant with the

concatenation analysis and conflicted with

prior literature. However, the statistical

binning analysis produced a tree that was

highly congruent with the concatenation

analysis and was consistent with the prior

scientific literature.

CONCLUSIONS: Statistical binning re-

duces the error in species tree topology

and branch length estimation because

it reduces gene tree estimation error.

These improvements are greatest when

gene trees have reduced bootstrap sup-

port, which was the case for the avian

phylogenomics project. Because using

unbinned gene trees can result in over-

estimation of ILS, statistical binning may

be helpful in providing more accurate

estimations of ILS levels in biological

data sets. Thus, statistical binning enables

highly accurate species tree estimations,

even on genome-scale data sets. �

The list of author affiliations is available in the full article online.

*Corresponding author. E-mail: [email protected] this article as S. Mirarab et al., Science 346, 1250463 (2014). DOI: 10.1126/science.1250463

Read the full article

at http://dx.doi

.org/10.1126/

science.1250463

ON OUR WEB SITE

Published by AAAS

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

• Find independent sets: sets with no edges between any pairs of nodes

• Genes in each “bin” are all pairwise compatible

• Minimum vertex coloring (NP-hard)

• Brélaz heuristics

• Modified the heuristic to produce balanced bins where possible

A gene tree Incompatibility between two gene trees

Page 31: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Statistical binning: overview

12

Original version: unweighted [Mirarab, et al., Science, 2014]

Gene sequence data Estimated initial gene trees

Incompatibility Graph Binned supergene alignments Supergene trees (weighted) Species tree

g1 g2 g3 gk

(partitioned)

support threshold

MP-EST

Page 32: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Statistical binning: overview

12

Original version: unweighted [Mirarab, et al., Science, 2014]

New version: weighted [Bayzid, Mirarab, Warnow, arXiv, 2015]

Gene sequence data Estimated initial gene trees

Incompatibility Graph Binned supergene alignments Supergene trees (weighted) Species tree

g1 g2 g3 gk

(partitioned)

support threshold

MP-EST

Page 33: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Avian-like simulation results

13

48 avian-like species, 1000 genes

5%

10%

15%

20%

1,500 1,000 500 250Gene sequence length

Spec

ies

tree

topo

logi

cal e

rror (

FN)

MP−ESTMP−EST − Binned

more gene tree error

Page 34: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Avian-like simulation results

14

5%

10%

15%

20%

1,500 1,000 500 250Gene sequence length

Spec

ies

tree

topo

logi

cal e

rror (

FN)

MP−ESTMP−EST − BinnedCA−ML

more gene tree error

48 avian-like species, 1000 genes

Page 35: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Binning also improves other measures of accuracy

15

• More accurate gene tree distributions

Page 36: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Binning also improves other measures of accuracy

15

• More accurate gene tree distributions

• Better species tree bootstrap support (i.e., fewer highly supported false positives)

Page 37: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Binning also improves other measures of accuracy

15

• More accurate gene tree distributions

• Better species tree bootstrap support (i.e., fewer highly supported false positives)

• More accurate species tree branch lengths

Page 38: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Binning on the avian dataset

The binned tree was highly supported and was largely congruent with concatenation

16

and binning reduces this noise, which suggeststhat the overall impact of binning is beneficial.These results are also consistent with the obser-vation that coalescent-based summary methodscan be robust to recombination (49).Our study explored gene tree–estimation error

arising from insufficient phylogenetic signal inthe gene sequences; however, gene tree–estimationerror can also come from poorly estimated align-ments (50) or errors introduced during the treeinference (51, 52). Because our studies focusedon insufficient phylogenetic signal, we have no

evidence that binning could reduce phylogeneticerror due to alignment error or misspecificationfor the sequence evolution model. Consequently,appropriate care should be devoted to obtain-ing good alignments and choosing an adequatemodel of sequence evolution to reconstruct bothgene and supergene trees.In our simulation, we only allowed ILS as a

source of discord between true gene trees andtrue species trees; hence, these model condi-tions favor MP-EST (which is based on the samemodel used for simulations) over concatenation

(which assumes no ILS is present). Given this,the fact that unbinned MP-EST is less accuratethan concatenation inmany conditions is note-worthy. Future studies based on model condi-tions in which other sources of gene tree discord(e.g., duplication and loss, incorrect orthologyassessments, recombination, introgression, hori-zontal gene transfer, and hybridization) are in-cluded would enable a better understandingof the relative accuracy of concatenation andcoalescent-based species tree estimation and theimpact of using binning under those conditions.

SCIENCE sciencemag.org 00 MONTH 2014 • VOL 000 ISSUE 0000 1250463-7

Fig. 5. Results on the (A) avian and (B) metazoan biological data sets using binned and unbinned MP-EST. Branches without designation represent100% support.

RESEARCH | RESEARCH ARTICLE

MS no: RA1250463/BPO/GENETICS

90. J. F. Storz, J. C. Opazo, F. G. Hoffmann, Mol. Phylogenet. Evol.66, 469–478 (2013).

91. F. G. Hoffmann, J. F. Storz, T. A. Gorr, J. C. Opazo, Mol. Biol.Evol. 27, 1126–1138 (2010).

ACKNOWLEDGMENTS

Genome assemblies and annotations of avian genomes in thisstudy are available on the avian phylogenomics website(http://phybirds.genomics.org.cn), GigaDB (http://dx.doi.org/10.5524/101000), National Center for Biotechnology Information(NCBI), and ENSEMBL (NCBI and Ensembl accession numbersare provided in table S2). The majority of this study wassupported by an internal funding from BGI. In addition, G.Z. wassupported by a Marie Curie International Incoming Fellowshipgrant (300837); M.T.P.G. was supported by a Danish NationalResearch Foundation grant (DNRF94) and a Lundbeck Foundationgrant (R52-A5062); C.L. and Q.L. were partially supported by aDanish Council for Independent Research Grant (10-081390);and E.D.J. was supported by the Howard Hughes Medical Instituteand NIH Directors Pioneer Award DP1OD000448.

The Avian Genome ConsortiumChen Ye,1 Shaoguang Liang,1 Zengli Yan,1 M. Lisandra Zepeda,2

Paula F. Campos,2 Amhed Missael Vargas Velazquez,2

José Alfredo Samaniego,2 María Avila-Arcos,2 Michael D. Martin,2

Ross Barnett,2 Angela M. Ribeiro,3 Claudio V. Mello,4 Peter V. Lovell,4

Daniela Almeida,3,5 Emanuel Maldonado,3 Joana Pereira,3

Kartik Sunagar,3,5 Siby Philip,3,5 Maria Gloria Dominguez-Bello,6

Michael Bunce,7 David Lambert,8 Robb T. Brumfield,9

Frederick H. Sheldon,9 Edward C. Holmes,10 Paul P. Gardner,11

Tammy E. Steeves,11 Peter F. Stadler,12 Sarah W. Burge,13

Eric Lyons,14 Jacqueline Smith,15 Fiona McCarthy,16

Frederique Pitel,17 Douglas Rhoads,18 David P. Froman19

1China National GeneBank, BGI-Shenzhen, Shenzhen 518083,China. 2Centre for GeoGenetics, Natural History Museum ofDenmark, University of Copenhagen, Øster Voldgade 5-7, 1350Copenhagen, Denmark. 3CIMAR/CIIMAR, Centro Interdisciplinar deInvestigação Marinha e Ambiental, Universidade do Porto, Ruados Bragas, 177, 4050-123 Porto, Portugal. 4Department ofBehavioral Neuroscience Oregon Health & Science UniversityPortland, OR 97239, USA. 5Departamento de Biologia, Faculdadede Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal. 6Department of Biology, University of PuertoRico, Av Ponce de Leon, Rio Piedras Campus, JGD 224, San Juan,PR 009431-3360, USA. 7Trace and Environmental DNA laboratory,Department of Environment and Agriculture, Curtin University, Perth,Western Australia 6102, Australia. 8Environmental Futures ResearchInstitute, Griffith University, Nathan, Queensland 4121, Australia.9Museum of Natural Science, Louisiana State University, BatonRouge, LA 70803, USA. 10Marie Bashir Institute for InfectiousDiseases and Biosecurity, Charles Perkins Centre, School ofBiological Sciences and Sydney Medical School, The University ofSydney, Sydney NSW 2006, Australia. 11School of BiologicalSciences, University of Canterbury, Christchurch 8140, New Zealand.12Bioinformatics Group, Department of Computer Science, andInterdisciplinary Center for Bioinformatics, University of Leipzig,Hr̈telstrasse 16-18, D-04107 Leipzig, Germany. 13European MolecularBiology Laboratory, European Bioinformatics Institute, Hinxton,Cambridge CB10 1SD, UK. 14School of Plant Sciences, BIO5 Institute,University of Arizona, Tucson, AZ 85721, USA. 15Division of Geneticsand Genomics, The Roslin Institute and Royal (Dick) School ofVeterinary Studies, The Roslin Institute Building, University ofEdinburgh, Easter Bush Campus, Midlothian EH25 9RG, UK.16Department of Veterinary Science and Microbiology, University ofArizona, 1117 E Lowell Street, Post Office Box 210090-0090, Tucson,AZ 85721, USA. 17Laboratoire de Génétique Cellulaire, INRA Cheminde Borde-Rouge, Auzeville, BP 52627 , 31326 CASTANET-TOLOSANCEDEX, France. 18Department of Biological Sciences, Science andEngineering 601, University of Arkansas, Fayetteville, AR 72701, USA.19Department of Animal Sciences, Oregon State University, Corvallis,OR 97331, USA.

SUPPLEMENTARY MATERIALS

www.sciencemag.org/content/346/6215/1311/suppl/DC1Supplementary TextFigs. S1 to S42Tables S1 to S51References (92–192)

27 January 2014; accepted 6 November 201410.1126/science.1251385

RESEARCH ARTICLE

Whole-genome analyses resolveearly branches in the tree of lifeof modern birdsErich D. Jarvis,1*† Siavash Mirarab,2* Andre J. Aberer,3 Bo Li,4,5,6 Peter Houde,7

Cai Li,4,6 Simon Y. W. Ho,8 Brant C. Faircloth,9,10 Benoit Nabholz,11

Jason T. Howard,1 Alexander Suh,12 Claudia C. Weber,12 Rute R. da Fonseca,6

Jianwen Li,4 Fang Zhang,4 Hui Li,4 Long Zhou,4 Nitish Narula,7,13 Liang Liu,14

Ganesh Ganapathy,1 Bastien Boussau,15 Md. Shamsuzzoha Bayzid,2

Volodymyr Zavidovych,1 Sankar Subramanian,16 Toni Gabaldón,17,18,19

Salvador Capella-Gutiérrez,17,18 Jaime Huerta-Cepas,17,18 Bhanu Rekepalli,20

Kasper Munch,21 Mikkel Schierup,21 Bent Lindow,6 Wesley C. Warren,22

David Ray,23,24,25 Richard E. Green,26 Michael W. Bruford,27 Xiangjiang Zhan,27,28

Andrew Dixon,29 Shengbin Li,30 Ning Li,31 Yinhua Huang,31

Elizabeth P. Derryberry,32,33 Mads Frost Bertelsen,34 Frederick H. Sheldon,33

Robb T. Brumfield,33 Claudio V. Mello,35,36 Peter V. Lovell,35 Morgan Wirthlin,35

Maria Paula Cruz Schneider,36,37 Francisco Prosdocimi,36,38 José Alfredo Samaniego,6

Amhed Missael Vargas Velazquez,6 Alonzo Alfaro-Núñez,6 Paula F. Campos,6

Bent Petersen,39 Thomas Sicheritz-Ponten,39 An Pas,40 Tom Bailey,41 Paul Scofield,42

Michael Bunce,43 David M. Lambert,16 Qi Zhou,44 Polina Perelman,45,46

Amy C. Driskell,47 Beth Shapiro,26 Zijun Xiong,4 Yongli Zeng,4 Shiping Liu,4

Zhenyu Li,4 Binghang Liu,4 Kui Wu,4 Jin Xiao,4 Xiong Yinqi,4 Qiuemei Zheng,4

Yong Zhang,4 Huanming Yang,48 Jian Wang,48 Linnea Smeds,12 Frank E. Rheindt,49

Michael Braun,50 Jon Fjeldsa,51 Ludovic Orlando,6 F. Keith Barker,52

Knud Andreas Jønsson,51,53,54 Warren Johnson,55 Klaus-Peter Koepfli,56

Stephen O’Brien,57,58 David Haussler,59 Oliver A. Ryder,60 Carsten Rahbek,51,54

Eske Willerslev,6 Gary R. Graves,51,61 Travis C. Glenn,62 John McCormack,63

Dave Burt,64 Hans Ellegren,12 Per Alström,65,66 Scott V. Edwards,67

Alexandros Stamatakis,3,68 David P. Mindell,69 Joel Cracraft,70 Edward L. Braun,71

Tandy Warnow,2,72† Wang Jun,48,73,74,75,76† M. Thomas P. Gilbert,6,43† Guojie Zhang4,77†

To better determine the history of modern birds, we performed a genome-scale phylogeneticanalysis of 48 species representing all orders of Neoaves using phylogenomic methodscreated to handle genome-scale data. We recovered a highly resolved tree that confirmspreviously controversial sister or close relationships. We identified the first divergence inNeoaves, two groups we named Passerea and Columbea, representing independent lineagesof diverse and convergently evolved land and water bird species. Among Passerea, we inferthe common ancestor of core landbirds to have been an apex predator and confirm independentgains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging tosister clades. Even with whole genomes, some of the earliest branches in Neoaves provedchallenging to resolve, which was best explained by massive protein-coding sequenceconvergence and high levels of incomplete lineage sorting that occurred during a rapidradiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago.

The diversification of species is not alwaysgradual but can occur in rapid radiations,especially aftermajor environmental changes(1, 2). Paleobiological (3–7) and molecular (8)evidence suggests that such “big bang” radia-

tions occurred for neoavian birds (e.g., songbirds,parrots, pigeons, and others) and placental mam-mals, representing 95% of extant avian and mam-malian species, after the Cretaceous to Paleogene(K-Pg)mass extinction event about 66million yearsago (Ma). However, other nuclear (9–12) and mito-chondrial (13, 14) DNA studies propose an earlier,more gradual diversification, beginning withinthe Cretaceous 80 to 125 Ma. This debate is con-founded by findings that different data sets (15–19)and analytical methods (20, 21) often yield con-

trasting species trees. Resolving such timing andphylogenetic relationships is important for com-parative genomics,which can informabout humantraits and diseases (22).Recent avian studies based on fragments of 5

[~5000 base pairs (bp) (8)] and 19 [31,000 bp (17)]genes recovered some relationships inferred frommorphological data (15, 23) and DNA-DNA hy-bridization (24), postulated new relationships,and contradicted many others. Consistent withmost previous molecular and contemporary mor-phological studies (15), they divided modernbirds (Neornithes) into Palaeognathae (tinamousand flightless ratites), Galloanseres [Galliformes(landfowl) and Anseriformes (waterfowl)], andNeoaves (all other extant birds). Within Neoaves,

1320 12 DECEMBER 2014 • VOL 346 ISSUE 6215 sciencemag.org SCIENCE

A FLOCK OF GENOMES

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Dec

embe

r 11,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n D

ecem

ber 1

1, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

[Jarvis, Mirarab, et al., Science, 2014]

Page 39: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Summary• Low phylogenetic signal per gene prevented accurate

coalescent-based analyses of the avian dataset

• Statistical binning groups sets of genes based on statistical measures of combinability

• Statistical binning improves accuracy compared to both unbinned summary methods and concatenation

• Statistical binning enabled a coalescent-based analyses of the avian dataset; results were largely congruent with concatenation

17

Page 40: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

More generally …• Genome-scale data provides a wealth of information

• Yet, reconstruction of species phylogenies remains challenging • Limited data per gene • Scalability to many species: ASTRAL-II (ISMB 2015) • Impact of model violations, missing data, etc. • Multiple sources of gene tree discordance

• Many interesting statistical and computational questions and a need for method development

18

Page 41: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Acknowledgments

Jim  Leebens-­‐mack  (UGA)

Norman  Wickett  (U  Chicago)

Gane  Wong  (U  of  Alberta)

Keshav  Pingali

 S.M.  Bayzid   Nam  Nguyen    (now  at  UIUC)

Tandy  Warnow

Théo    Zimmermann

Bastien  Boussau  (Université  Lyon)

Erich  Jarvis  (Duke,  HMMI)

Tom  Gilbert  (U  Copenhagen)

Guojie  Zhang  (BGI,  China)

Ed  Braun  (U  Florida)

……

HMMI  international  student  fellowship  

Page 42: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

0%

25%

50%

75%

100%

100 1,000 10,000 100,000Sequence length (log)

Aver

age

bran

ch b

oots

trap

supp

ort

Lack of phylogenetic signal1. Limited sequence length for each gene

2. Insufficient variation in each gene

20

Page 43: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Increasing the number of genes

21

[Mirarab, et al., Science, 2014]

Page 44: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Incomplete Lineage Sorting (ILS)• A population level process related to

inheritance and maintenance of alleles

• Omnipresent; most likely for short times between speciation events and/or large population size

22

Tracing alleles through

generations

Page 45: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Incomplete Lineage Sorting (ILS)• A population level process related to

inheritance and maintenance of alleles

• Omnipresent; most likely for short times between speciation events and/or large population size

• We have statistical models of ILS (multi-species coalescent)

• The species tree defines a probability distribution on the gene trees, and is identifiable from the distribution on gene trees [Degnan and Salter, Int. J. Org. Evolution, 2005]

22

Tracing alleles through

generations

Page 46: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Avian-like simulation results• Avian-like simulation; 1000 genes, 48 taxa, high levels of ILS

23

More information per gene

[Mirarab, et al., Science, 2014]

MP-ESTMP-EST

More information per gene

Branch length accuracy

Page 47: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Gene tree distribution error• We can quantify gene tree distribution error using

triplet frequency:

• We can compare triplet frequencies obtained from true gene trees and from the estimated gene trees (for all triplets of taxa)

24

A B C B A C70% 15%

C A B15%

A B C B A C C A B65% 25% 15%

true distribution estimated distribution

Compare

Page 48: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Binning improves gene tree distribution

Empirical commutative distribution

Page 49: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Binning improves gene tree distribution

Empirical commutative distribution

More information per gene

Page 50: Statistical binning enables an accurate coalescent-based ...eceweb.ucsd.edu/~smirarab/assets/presentation-recomb.pdf• Concatenate “combinable” sets of genes into “supergenes”

Binning improves gene tree distribution

Supergene trees represent the true gene tree distribution much better than the estimated gene trees without binning.

Empirical commutative distribution

More information per gene


Recommended