+ All Categories
Home > Documents > The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University...

The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University...

Date post: 29-Dec-2015
Category:
Upload: claud-virgil-parker
View: 218 times
Download: 2 times
Share this document with a friend
Popular Tags:
54
The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand
Transcript
Page 1: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

The Incompatible Desiderata of Gene Cluster Properties

Rose Hoberman Carnegie Mellon University

joint work with Dannie Durand

Page 2: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

How to detect segmental homology?

Intuitive notions of what gene clusters look like Enriched for homologous gene pairs Neither gene content nor order is perfectly

preserved

How can we define a gene cluster formally?

Page 3: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Definitions will be application-dependent

If the goal is to estimate the number of inversions, then gene order should be preserved

If the goal is to find duplicated segments, allow some disorder

Page 4: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Gene Clusters DefinitionsLarge-Scale DuplicationsVandepoele et al 02

McLysaght et al 02

Hampson et al 03

Panopoulou et al 03

Guyot & Keller, 04

Kellis et al, 04

...

Genome rearrangements Bourque et al, 05

Pevzner & Tesler 03

Coghlan and Wolfe 02

...

Functional Associations between GenesTamames 01

Wolf et al 01

Chen et al 04

Westover et al 05

...

Algorithmic and Statistical CommunitiesBergeron et al 02

Calabrese et al 03

Heber & Stoye 01

...

Page 5: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Groups find very different clusters when analyzing the same data

0 20 40 60 80

Vandepoele et al, 03

Simillion et al, 04

Wang et al, 05

Guyot et al, 04

Paterson et al, 04

Yu et al, 05

Percent Coverage of Rice Genome

Page 6: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Cluster locations differ from study to

study

Inference of duplication

mechanism for individual

genes varies greatly

The Genomes of Oryza sativa: A History of Duplications Yu et al, PLoS Biology 2005

Page 7: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Goals:

Characterizing existing definitions

Formal properties form a basis for comparison

Gene cluster desiderata

Page 8: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Outline

Introduction Brief overview of gene cluster identification Proposed properties for comparison Analysis of data: nested property

Page 9: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Detecting Homologous Chromosomal Segments (a marker-based approach)

1. Find homologous genes2. Formally define a “gene cluster” 3. Devise an algorithm to identify clusters4. Statistically verify that clusters indicate common ancestry

Page 10: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Cluster definitions in the literature

Descriptive:

r-windows connected

components (Pevzner & Tesler 03)

common intervals (Uno and Tagiura 00)

max-gap…

Constructive:LineUp (Hampson et al 03)CloseUp (Hampson et al 05)

FISH (Calabrese et al 03)

AdHoRe (Vandepoele et al 02)

Gene teams (Bergeron et al 02)

greedy max-gap (Hokamp 01)

…Require search algorithms Harder to reason about formally

Page 11: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Cluster definitions in the literature

Descriptive:

r-windows connected

components (Pevzner & Tesler 03)

common intervals (Uno and Tagiura 00)

max-gap…

Constructive:LineUp (Hampson et al 03)CloseUp (Hampson et al 05)

FISH (Calabrese et al 03)

AdHoRe (Vandepoele et al 02)

Gene teams (Bergeron et al 02)

greedy max-gap (Hokamp 01)

…I illustrate properties with a few definitions

Page 12: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

r-windows

Two windows of size r that share at least m homologous gene pairs

(Calvacanti et al 03, Durand and Sankoff 03, Friedman &

Hughes 01, Raghupathy and Durand 05)

r =4, m ≥ 2

Page 13: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

max-gap cluster

A set of genes form a max-gap cluster if the gap between adjacent genes is never greater than g on either genome

Widely used definition in genomic studies

g 2 g 3

Page 14: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Outline

Introduction Brief overview of existing approaches Proposed properties for comparison Analysis of data: nested property

Page 15: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Proposed Cluster Properties

Symmetry Size Density Order Orientation Nestedness Disjointness Isolation Temporal Coherence

Page 16: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Symmetry

Many existing cluster algorithms are not symmetric with respect to chromosome

=?clusters found clusters found

Page 17: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Asymmetry: an example

FISH (Calabrese et al, 2003) Constructive cluster definition: clusters correspond

to paths through a dot-plot Publicly available software

Statistical model

Page 18: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

1 2 3 6 5 99 4 7 8 9123456789

Asymmetry: an example

FISH

Euclidian distance between gene pairs is constrained

Paths in the dot-plot must always move to the right

Page 19: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

8

987654321

9

74

99

56321

Switching the axes yields different clusters

FISH

Euclidian distance between markers is constrained

Paths in the dot-plot must always move to the right

Page 20: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

8

987654321

9749956321

Ways to regain symmetry

1. Paths in the dot-plot must always move down and to the right

miss the inversion

2. Paths can move in any direction

statistics becomes difficult

Regaining symmetry entails some tradeoffs

Page 21: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Proposed Cluster Properties

Symmetry Size Density Order Orientation Nestedness Disjointness Isolation Temporal Coherence

Page 22: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Cluster Parameters size: number of homologous pairs in the

cluster length: total number of genes in the cluster density: proportion of homologous pairs

(size/length)

size = 5, length = 12

density = 5/12 gap ≤ 3 genes

Page 23: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

• cluster grows to its natural size • cluster of size m may be of length m to g(m -1)+ m• maximal length grows as size grows

gap ggap ggap g

length r

• cluster size is constrained• cluster of size m may be of length m to r• maximal length is fixed, regardless of cluster size

max-gap clusters

r-windows

Page 24: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

A tradeoff: local vs global density

max-gap constrains local density only weakly constrains global density (≥

1/(g+1))

r-window constrains global density only weakly constrains local density

(maximum possible gap ≤ r-m)

Page 25: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Even when global density is high,

Density = 12/18

a region may not be locally dense

Page 26: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Size vs Density: An example

Maximum Gap

Cluster Size

Post-Processing

McLysaght et al, 2002

constrainedtest

statistic

Panopoulou et al, 2003

test statisticconstraine

d

merged nearby clusters

Application: all-against-all comparison of human chromosomes to find duplicated

blocks

Page 27: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Panopoulou et al 2003

Size >= 2 Gap ≤ 10

1

10

20

30

0 5 10 15 20 25 30

Gap

Siz

e

Large and Dense

Small but dense

Largebut less dense

McLysaght et al, 2002Gap ≤ 30, Size ≥ 6

A Tradeoff in Parameter Space

Page 28: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Proposed Cluster Properties

Symmetry Size Density Order Orientation Disjointntess Isolation Nestedness Temporal Coherence

Page 29: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Order and Orientation

Local rearrangements will cause both gene order and orientation to diverge Overly stringent order constraints could lead

to false negatives Partial conservation of order and orientation

provide additional evidence of regional homology

density = 6/8density = 6/8

Page 30: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Wide Variation in Order Constraints

None (r-windows, max-gap, ...) Explicit constraints:

Limited number of order violations (Hampson et al, 03)

Near-diagonals in the dot-plot (Calabrese et al 03, ...)

Test statistic (Sankoff and Haque, 05)

Implicit constraints: via the search algorithm (Hampson et al 05, ...)

Page 31: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Proposed Cluster Properties

Symmetry Size Density Disjointness Isolation Order Orientation Nestedness Temporal Coherence

Page 32: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Nestedness

In particular, implicit ordering constraints are imposed by many greedy, agglomerative search algorithms

Formally, such search algorithms will find only nested clusters

A cluster of size m is nested if

it contains sub-clusters of size m-1,...,1

Page 33: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Greedy Algorithms Impose Order Constraints

A greedy, agglomerative algorithm initializes a cluster as a single homologous pair searches for a gene in proximity on both

chromosomes either extends the cluster and repeats, or

terminates

g = 2

Page 34: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Greediness: an example (Bergeron et al, 02)

No greedy, agglomerative algorithm will find this cluster There is no max-gap cluster of size 2 (or 3)

In other words, the cluster is not nested

g = 2

A max-gap cluster of size four

Page 35: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Thus: different results when searching for max-gap clusters

Greedy algorithms agglomerative find nested max-gap clusters

Gene Teams algorithm (Bergeron et al 02; Beal et al 03,...)

divide-and-conquer finds all max-gap clusters, nested or not

Page 36: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

An example of a greedy search: CloseUp (Hampson et al, Bioinformatics, 2005)

Software tool to find clusters

Goal: statistical detection of chromosomal homology using density alone

Method: greedy search for nearby matches terminates when density is low randomization to statistically verify clusters

Page 37: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

A comparative study (Hampson et al, 05)

Empirical comparison: CloseUp: “density alone”, but greedy LineUp and ADHoRe: density + order

information evaluated accuracy on synthetic data

Is order information necessary or even helpful for cluster

detection?

Page 38: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Result: CloseUp had comparable performance

Their conclusion: order is not particularly helpful

My conclusion: results are actually inconclusive, since CloseUp implicitly constrains order

A comparative study (Hampson et al, 05)

Is order information necessary or even helpful for cluster

detection?

Page 39: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Proposed Cluster Properties

Symmetry Size Density Order Orientation Nestedness Disjointness Isolation Temporal Coherence

Page 40: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Gene clusters: islands of homology in a sea of interlopers

How can we formally describe this intuitive notion?

Page 41: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Islands of Homology

Disjoint: A homologous gene pair should be a member of at most one cluster

Isolated: The minimum distance between clusters should be larger than the maximum distance between homologous gene pairs within the cluster

Page 42: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Various types of constraints lead to overlapping (or nearby) clusters that cannot be merged

If we search for clusters with density ≥ ½:

If we search for nested max-gap clusters, g=1:

Page 43: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Our Proposed Cluster Properties

Symmetry Size Density Disjointness Isolation Order Orientation Nestedness Temporal Coherence

Page 44: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Temporal coherence

Divergence times of homologous pairs within a block should agree

now

before

time

Page 45: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Outline Introduction Brief overview of existing approaches Proposed properties for comparison My analysis of data: nested property

Many groups use a greedy, agglomerative search to find gene clusters

Does a greedy search have a large effect on the set of clusters identified in real data?

Page 46: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Data

Gene orthology data: bacterial: GOLDIE database

http://www.intellibiosoft.com/academic.html eukaryotes: InParanoid database

http://inparanoid.cgb.ki.se

10,33817,70922,216Human & Chicken

14,76825,38322,216 Human & Mouse

1,3154,2454,108 E. coli & B. subtilis

orthologsgenes (2)genes (1)

pairwise genome comparisons

Page 47: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Methods

Maximal max-gap clusters Gene Teams software

http://.www-igm.univ-mlv.fr/~raffinot/geneteam.html

Maximal nested max-gap clusters simple greedy heuristic (no merging)

For each genome comparison and gap size:

Page 48: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Percent of gene teams that are nested

98

99

100

Per

cen

tag

e N

este

d

Page 49: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Number of genes in some gene team of size 7 or greater that are not in any nested cluster of 7 or

greater

Chicken/Human

Page 50: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Results

For the datasets analyzed, a nestedness constraint does not appear too conservative

However, we didn’t survey a wide range of evolutionary distances expect nestedness to decrease with evolutionary

distance open question: are there more rearranged

datasets for which the proportion of nested clusters is much smaller?

Page 51: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Is nestedness desirable? A nestedness constraint:

offers a middle ground between no order constraints and strict order

However, nestedness provides no formal description of order constraints is restrictive rather than descriptive

We may instead prefer methods that allow for parameterization of degree of disorder consider order conservation in the statistical tests

Page 52: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Conclusion Proposed 9 properties to compare and

evaluate methods for identifying gene clusters

Illustrated cluster differences due to cluster definition search algorithm statistics

Incompatible Desiderata: these properties are intuitively natural yet many

are surprisingly difficult to satisfy with the same definition

Page 53: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Acknowledgements

David Sankoff The Durand Lab Barbara Lazarus Women@IT Fellowship Sloan Foundation NHGRI, Packard Foundation

Page 54: The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.

Discussion

are our intuitions about clusters reasonable?

which cluster properties are important or desirable?

how can we quantitatively evaluate cluster definitions?

what are the tradeoffs between methods? how can better definitions be designed?


Recommended