Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | cory-george |
View: | 225 times |
Download: | 7 times |
Functional Classification of PSI Proteins to Support High Throughput Biochemical
Characterization:Classes of Reciprocal Sequence
Homologs (CRSH)
Samuel Handelman, Nelson Tong, Jon D. Luff, David P. Lee, André Lazar, Paul Smith, Prasanna Gogate, Rohan Mallelwar and John Hunt
Bacterial physiology in the post-genome era• Exponential growth in sequence information.
• Structural information is more difficult to obtain. Evolution is key to leveraging what we do know.
• Direct functional information is scarcer still: evolution and comparative studies are even more critical.
genome images from BacMap (UAlberta) and VirtualLaboratory; protein structure images from NESG (Columbia/Rutgers).
vs.
3
Even today, most proteins are of unknown biochemical function
H. SapiensE. coli53%
“hypothetical”“putative”
“uncharacterized”or “unknown”
(01/23/08)
~4,200 proteins
~27,000 proteins
54%Neither identicalnor similar to any
experimentallyvalidatedprotein *
*Genome Information Integration Project And H-Invitational 2 (2007) Nucleic
Acids Research 36:D793-799
“Known” “Known”
• Closing this gap lays the groundwork for systems
biology.
CRSH Goal: Group Functionally Equivalent Homologs.
CRSH Approach:
• Homology clusters contain multiple
distinct protein functions.
• Identify sub-clusters such that all
members have equivalent
function (in bacteria only).
Topic Overview
• CRSH: what they are, why they’re useful
• CRSH Web Interface, merits of mapping of TargetDB to protein functional groups
• Using CRSH and Gene Neighborhood to predict stable tertiary interactions.
Classes of Reciprocal Sequence Homologs(CRSHs)
Cluster based on BLAST scores; verify clusters on profile scores
Split into sub-clusters when multiple members come from a single organism (likely paralogs);
verify sub-clusters on profile scores
Merge sub-clusters into classes if more similar than expected after accounting for inter-
organism distances; verify final classes on profile scores
Predicted proteins from 474 fully sequenced bacterial genomes
} CRSHs likely same function~75,000
Main application: Gene neighborhood method. Calculate “co-localization” counts for all
CRSH pairs(# of times their genes are within 15 kB on chromosomes of fully diverged organisms)
Split into sub-clusters when multiple members come from a
single organism
M. tuberculosis RV0859
E. coli PaaJ
Indicates a pair of reciprocal closest homologs in their respective organisms
A. tumefaciens ATU0502
A. tumefaciens PcaF
beta-ketoadipyl CoA thiolases
acetyl-CoA acetyltransferases
O
1
O
2
O
3
O
4
O
N…
O
1
O
3
O
3
O
1
O
1
Genome 1
Genome 2
Genome 3
Gene Neighborhood PreviewCourtesy Marco Punta
Each Octagon represents a CRSH
“Co-localized” = within 15 kB
O
3
• Stronger neighborhood conservation => better function predictions.
• Insight into function of unknown proteins.
A Fixed Homology Threshold Fails to Reliably Segregate Functionally Equivalent Proteins
Frequency Distribution of Mean %ID in CRSH
0
0.05
0.1
0.15
0.2
0 25 50 75 100
Mean %ID
Fre
qu
ency
• Tremendous range in sequence conservation with more or less equivalent conservation of function.
0.00 0.25 0.50 0.75 1.00 1.25 1.500.0
0.1
0.2
0.3
0.4
Orthologs
Paralogs
Length-Normalized Blast Bit-Score
P (
Ea
ch G
en
e N
eig
hb
or
is C
on
serv
ed
)
Like Rost clusters, but for function
• Based on sequence information, you can conclude that two proteins have the same structure, even if you don’t know the structure.
• We’re working towards an analogous scheme for protein function, but each functional group needs it’s own cutoff.
• We propose to do this especially for proteins whose function we do not yet know.
.
Number of residues aligned
Pair
wis
e se
quen
ce id
entit
y
100
75
50
25
0
Sequence identityimplies
structuralsimilarity !
Don't know region
Graph Courtesy Burkhard Rost
• We have developed a web interface for these CRSH, which is meant for use by experimentalists.
• Presently hosted in India (at http://61.8.141.68:8080/Columbia/), will be hosted at the NESG (at www.orthology.org), where CRSH pages will be available for each entry in targetDB.
• The CRSH Pages that follow have been mapped to targetDB, so that biologists working in the centers can access them directly.
• Within 2 mos. we hope for a direct link from the PSI TargetDB gateway to the CRSHs.
• CRSHs already have links to biocyc, a leading bacterial physiology database; links coming to other functional genomics databases.
• A consensus domain architecture schematic will appear shortly.
• The applet on the left provides a graphical display of the phylogenetic distribution. In the near future, we’ll add the info from targetDB to this applet and to the table below.
• Known complexes in biocyc are targets for structural genomics efforts to solve multi-protein structures.
• The genetically co-localizing CRSH are promising secondary targets, as I will explain…
Gene Neighborhood Hypothesis Generation
With suggested applications in structural genomics and functional genomics
ORRational ideas have consequences for action;
reason necessarily has a constructive function.
Known Stable Complexes Strongly Correlate with Gene Neighborhood
• For every pair of CRSH for which complex-membership data is available in biocyc, we count the instances where the two CRSH appear in a putative operon together.
• These counts correlate strongly with well-established, well-studied, stable and definitive physical complexes (drawn in this case from biocyc).
• These Probabilities are overestimated due to the methods used.
0
0.2
0.4
0.6
0.8
1
0 50 100
Co-localization counts (logarithmic bins)P
(CR
SH
to
ge
the
r in
sta
ble
co
mp
lex
)
All Hetero-Complexes
Heterodimers Only
Gene Neighborhood has some Correlation with Small Molecule Interaction Partners
• For each CRSH, we extract from biocyc a set of known small molecule interaction partners (ligands, substrates, products, etc.) We excluded very common partners (water, phosphate, ATP, etc.)
• Because proteins together in operons are often part of the same metabolic pathways or respond to similar chemical signals, it is reasonable to extrapolate small molecule interactions to the conserved gene neighbors.
• There is a definite correlation. This graph is preliminary – it is likely an underestimate.
0
0.2
0.4
0.6
0.8
1
Aggregate Co-localization counts for CRSH/Small Molecule
P (
Kno
wn
Inte
ract
ion
betw
een
CR
SH
M
embe
r an
d S
mal
l Mol
ecul
e)
A
• This view, which is still in beta, gives the known small-molecule interactions of all of the gene neighbors for a given CRSH, weighted to reflect the strength of gene neighborhood conservation.
• As well as providing a starting point for interaction screening, this can make the functional insights provided by the gene neighborhood method more accessible.
Salvage Pipeline• For structural genomics targets which have been
cloned and are soluble, but which have failed to crystallize, we introduce a parallel pipeline to salvage them by adding “known” or predicted protein or small molecule binding partners.
• Bonus biology: whole greater than sum of parts.
Crystallizewithout Partner
Crystallize with Partner
Concluding Remarks
• We are eager to add links to PSI resources to our CRSH pages – they are intended to facilitate collaboration between structural and functional genomics, in particular.
• Functional information can improve the impact of structural genomics efforts, and may provide new salvage pathways for difficult targets.
Thank youJohn “The Jersey Eliminator” HuntPaul “Schmitty” SmithGreg “Cassis” BoelSai “Full Nelson” TongMarco “The Shark” PuntaBurkhard “Wrecking Ball” RostPrasanna “Crackerjack” GogateRohan “The Punisher” MallelwarJon “JD” Luff Liang “Red, White and Thunder” TongHoward “Hurricane” ShumanDana “Steel Toe” Pe’erHarmen “H-Bomb” BussemacherLarry “The Tank” ChasinDre “Enter the Dragon” LazarDavid “Intravenous” LeeGirish “Bone Breaker” RaoStephanie “Bronx” WongDiana “1-2-3” FlynnGeorge “El Pato Loco” OldanAllison “Grid Iron” FayJordi “El Chupacabra” BanachJohn “Steel” DworkinEtay “Aces” ZivChris “Fireball” WigginsGerwald “Sunshine” JoglCal “Howitzer” LobelYongzhao “Downtown” ShaoDavid “Finger of Death” DraperGae “Knuckles” MonteleoneMike “The Red Baron” BaranJohn “Mountain Man” EverettThe Hunt Lab, The NESGAmerican Heart Association, CF Foundation, NSF.
Consistency in CRSH sequence divergence levels between remote phyla
EACH DOT IS A CRSH
25 45 65 8525
45
65
85
D. radiodurans with B. subtilis a.a. %IDwith binomial standard error
E.
coli
wit
hS
. el
on
gat
us
a.a.
%ID
wit
h b
ino
mia
l st
and
ard
err
or
0.5 1.0 1.5 2.00.5
1.0
1.5
2.0
D. radiodurans with B. subtilislength-normalized blast bit score
E.
coli
wit
hS
. el
on
gat
us
len
gth
-no
rmal
ized
bla
st b
it s
core
Deviation from Evolutionary Consensus in Protein Complexes
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-1 -0.5 0 0.5 1
Spearman's Rho on Deviation from Consensus Distance
Fre
qu
ency
Interaction Pairs from Biocyc
Random Pairs from BiocycInteraction SetWith two S.D. againsthypothesis
Consistency in CRSH sequence divergence levels between remote phyla
0.5 1.0 1.5 2.00.5
1.0
1.5
2.0
D. radiodurans with B. subtilislength-normalized blast bit score
E.
coli
wit
hS
. el
on
gat
us
len
gth
-no
rmal
ized
bla
st b
it s
core
25 45 65 8525
45
65
85
D. radiodurans with B. subtilis a.a. %IDwith binomial standard error
E.
coli
wit
hS
. el
on
gat
us
a.a.
%ID
wit
h b
ino
mia
l st
and
ard
err
or
EACH DOT IS A CRSH