Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy...

Post on 18-Jan-2018

216 views 0 download

description

What do we mean by a “Family”? Ideally: A group of sequences that have arisen from a common ancestor In practice: families are most often defined based on Similar structure Similar sequence

transcript

Gene Family Size Distributions

Brought to You By Your Neighorhood Durand Lab

Narayanan RaghupathyNan Song

Rose Hoberman

Why Are We Interested in Gene Family Size Distributions?

Want to find homologous chromosomal regions• Genes as markers• Matches between genes indicate possible regional

homology

Cluster statistics depend on• The total number of matches• The distribution of matches

What do we mean by a “Family”? Ideally: A group of sequences that have

arisen from a common ancestor

In practice: families are most often defined based on• Similar structure• Similar sequence

Families can be defined at many

levels

Either domains or whole

proteins can be grouped

Protein families and their evolution--a structural perspective.Orengo CA, Thornton JM.

Why are other people interested in gene family sizes? To understand protein family evolution

• Fit birth/death model to the data To predict how many more genes there

are in certain families

How Can Genes Be Grouped Into Families? Construct and analyze gene trees:

• Slow, requires manual supervision• Tree construction is error-prone

Group based on structural similarity• Structure may be similar even if not homologous• Structure is generally not known

Cluster genes based on sequence similarity• Heuristic• Fast and comprehensive, even for large datasets

Clustering Group together genes with similar E-

values (or other sequence-based score)• Many heuristics have been proposed

Why bother with clustering heuristics? May not find true “gene families”

• May be throwing away true matches• May be including extra noise

However, may still be preferable to allowing only 1-to-1 matches

Chromosome 5

Chromosome 3

Existing Gene Family Data

Data for individual species• Recent data is only for bacteria

Data from multiple species• Large sets of species: eukaryotes +

prokaryotes

The properties of protein family space depend on experimental design Kunin et al, Bioinformatics 2005

Our Questions What does the GFS distribution look

like?• How much does the clustering method affect

the GFSD?• How much does the cluster E-value threshold

affect the GFSD?• How much does the GFSD vary across

species?• Can we fit the GFSD to a particular function?

Our Analysis Species:

• Yeast vs Yeast (5131 Genes)• Mouse vs Mouse (7343 Genes)• Human vs Human (10610 Genes) IN PROGRESS

Clustering Methods• Hierarchical Clustering

• Multiple variants• 5 E-value thresholds

• TribeMCL• 5 inflation parameters

Hierarchical Clustering Method Threshold Complete linkage

Average linkage

Single Linkage

TribeMCL

Inflation parameter (but is difficult to understand)• 4-5: small clusters• 1.1-3: larger clusters

However, clusters do not strictly increase in size when inflation value is reduced• e.g., clusters are not hierarchical

http://micans.org/mcl

Markov clustering• More flow across higher weight

edges• How much total flow between each

gene? Handles multi-domain proteins? Very Efficient

Mouse Complete-Linkage 10-10

Log (gene family size)

Gene family size

Yeast Complete Linkage