Post on 18-Jan-2018
description
transcript
Gene Family Size Distributions
Brought to You By Your Neighorhood Durand Lab
Narayanan RaghupathyNan Song
Rose Hoberman
Why Are We Interested in Gene Family Size Distributions?
Want to find homologous chromosomal regions• Genes as markers• Matches between genes indicate possible regional
homology
Cluster statistics depend on• The total number of matches• The distribution of matches
What do we mean by a “Family”? Ideally: A group of sequences that have
arisen from a common ancestor
In practice: families are most often defined based on• Similar structure• Similar sequence
Families can be defined at many
levels
Either domains or whole
proteins can be grouped
Protein families and their evolution--a structural perspective.Orengo CA, Thornton JM.
Why are other people interested in gene family sizes? To understand protein family evolution
• Fit birth/death model to the data To predict how many more genes there
are in certain families
How Can Genes Be Grouped Into Families? Construct and analyze gene trees:
• Slow, requires manual supervision• Tree construction is error-prone
Group based on structural similarity• Structure may be similar even if not homologous• Structure is generally not known
Cluster genes based on sequence similarity• Heuristic• Fast and comprehensive, even for large datasets
Clustering Group together genes with similar E-
values (or other sequence-based score)• Many heuristics have been proposed
Why bother with clustering heuristics? May not find true “gene families”
• May be throwing away true matches• May be including extra noise
However, may still be preferable to allowing only 1-to-1 matches
Chromosome 5
Chromosome 3
Existing Gene Family Data
Data for individual species• Recent data is only for bacteria
Data from multiple species• Large sets of species: eukaryotes +
prokaryotes
The properties of protein family space depend on experimental design Kunin et al, Bioinformatics 2005
Our Questions What does the GFS distribution look
like?• How much does the clustering method affect
the GFSD?• How much does the cluster E-value threshold
affect the GFSD?• How much does the GFSD vary across
species?• Can we fit the GFSD to a particular function?
Our Analysis Species:
• Yeast vs Yeast (5131 Genes)• Mouse vs Mouse (7343 Genes)• Human vs Human (10610 Genes) IN PROGRESS
Clustering Methods• Hierarchical Clustering
• Multiple variants• 5 E-value thresholds
• TribeMCL• 5 inflation parameters
Hierarchical Clustering Method Threshold Complete linkage
Average linkage
Single Linkage
TribeMCL
Inflation parameter (but is difficult to understand)• 4-5: small clusters• 1.1-3: larger clusters
However, clusters do not strictly increase in size when inflation value is reduced• e.g., clusters are not hierarchical
http://micans.org/mcl
Markov clustering• More flow across higher weight
edges• How much total flow between each
gene? Handles multi-domain proteins? Very Efficient
Mouse Complete-Linkage 10-10
Log (gene family size)
Gene family size
Yeast Complete Linkage