Post on 30-Dec-2015
transcript
Identification of Protein Domains
Orthologs and Paralogs
Describing evolutionary relationships among genes (proteins):
Two major ways of creating homologous genes is gene duplication and speciation.
Homology: not sufficiently well-defined Therefore additional terms are used:
Orthologs are two genes from two different species that derive from a single gene in the last common ancestor of the species.
ortho
para
ortho Paralogs are genes that derive from a single gene that was duplicated within a genome.
Co-orthologs are paralogs produced by duplications of orthologs subsequent to a given speciation event.
co-ortho
Inparalogs are paralogs in a given lineage that all evolved by gene duplications that happened after the speciation event.
in-para
in-para
out-para
Outparalogs are paralogs in the given lineage that evolved by gene duplications that happened before the speciation event
Orthologs and Paralogs
• Orthologs - evolutionary functional counterparts in different species
• Inparalogs – important for detecting lineage-specific adaptations
Proteins :• Rapidly growing databases of protein
sequences due to genome sequencing projects.
• Many new proteins belong to protein families with known functions, (significant sequence similarity).
• Only a small fraction of known proteins have functions determined by experiment.
• Databases providing computational sequence analysis allow us to classify new proteins to known families, and thus determine their function.
Protein Domains
• A domain is an independent structural unit which can be found alone or in conjunction with other domains or repeats.
• Module = mobile domain.
• Different domains have distinct functions.
• Many eukaryotic proteins have multiple domains.
Protein Domains
PX domain with ligand
SH3 domain with ligand
Identifying Protein Domains:
Problems :
– Defining the members of each family.– Building multiple alignments of the
members.– Finding the boundaries of the domain.
Identifying Protein Domains
• Little structural data identification by sequence analysis.
• Sequence characterization of families - determine 3D structure and molecular functions.
• Even when the structure of the domain is not known it may be possible to define its boundaries from sequence alone.
Identifying Protein Domains:
• They do not give a clear picture of the domain boundaries.
• Lack sensitivity.
Motif matches are often useful to indicatefunctional sites, however :
Identifying Protein Domains:
Automatic methods :• Fast, effective, deals with a lot of
information.• Might fragment domain families.• Might cause fusion of domain families.
Manual methods :• Knowledge of protein experts is put to
use.• Slow, require a lot of manpower.
SMART : (Simple Modular Architecture Research Tool)
Web-based resource used for :– rapid annotation of protein domains.– analysis of domain architectures.
Domain ArchitectureProtein: PA-3427CGSpecies: Drosophila melanogaster
Protein: ENSMUSP00000023109
Species: Mus musculus
Protein: ENSANGP00000009529Species: Anopheles gambiae
SMART (Simple Modular Architecture Research Tool)
• There are over 600 domain families.
• Provides information about :– function .– subcellular localization.– phyletic distribution.– tertiary structure.
• Based on HMMs (Hidden Markov Models).
SMART (Simple Modular Architecture Research Tool)
HMM – based on seed alignment.
Threshold values used to determine homology of domains.
SMART (Simple Modular Architecture Research Tool)• Alignments of proteins by:
– Minimize insertions/deletions in conserved alignment blocks.
– Optimize amino acid property conservation.
– Closing unnecessary gaps.
• Gapped alignments prefered over ungapped ones:– prediction of domain boundaries.– greater information content.
• Alignment of entire structural domains.
PROSITE - database of protein families and
domains • Database of biologically significant sites
and patterns. Contains 1,609 profiles.• Pattern – conserved sequence of a few
amino acids.• Identifies to which known family of
proteins (if any) the new sequence belongs.
• Used to determine the function of uncharacterized proteins translated from genomic or cDNA sequences.
PROSITE - database of protein families and domains
• A protein too distant from any other to detect its resemblance by overall sequence alignment, can be classified according to a Pattern.
• Patterns arise because of requirements of binding sites that impose very tight constraint on the evolution of portions of the protein.
PROSITE – how is a pattern developed ?
• As short as possible.
• Detects all/most sequences it describes.
• As little false results as possible.
high sensitivity and high specificity.
PROSITE – how is a pattern developed ?First – study reviews on a protein family.
Then build alignment table with particularattention to residues and regions important tothe biological function of that family. - Enzyme catalytic sites. - Prostethic group attachment sites (heme). - Amino acids involved in binding a metal ion.- Cysteines involved in disulfide bonds. - Regions involved in binding a molecule
(ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein.
PROSITE steps in the development of a pattern:
• Finding a core pattern : 4-5 biologically significant residues.
• Test the pattern on a large database.• If lucky – there is correlation in this
region which indicates a good pattern.• Mostly, there is no correlation :
– Gradually increase the size of the pattern.– search over other patterns.
PROSITE – An example
ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS
This pattern is small and would probably pick up too many false positive results :
Profiles – characterize a protein family or domain over its entire length.
Patterns - small regions, high sequence similarity.
Research: Finding new domain familiesAutomatic methods• The team started with 107 nuclear
domains.• Using SMART - get all proteins with
at least one of these domains, characterize their complete domain structure.
• Regions not annotated using known SMART domain models were extracted with their domain context.
Finding new domain families: Automatic methods
• Grouping proteins by region similarity.
• Finding homologs using PSI-BLAST on longest of every group (Threshold E-value<0.001).
• Finding domain organization via SMART.
• Homologous regions – candidates for a novel domain family.
Finding new domain families:
m an u a l in sp ec tion m ore search es
d om ain a rch itec tu re - S M A R T
P S I-B L A S T fin d in g h om olog s
g rou p reg ion s
reg ion s n o t kn ow n b y S M A R T
fin d in g p ro te in s -S M A R T
1 0 7 n u c lear d om ain s
Finding new domain families: Manual confirmation• Different context – novel module family.• Proteins with nuclear AND extracellular
domains excluded.• Multiple alignments and known locations of
domains – definition of domains’ borders.• Automatic searches to find more members,
E-value < 0.1, and manual checks.• Marginal similarity to domain family –
possible divergent family.
Prediction of Function: Chromatin-Binding Domains
• Protein SPT6 containing CSZ domain, regulates transcription through a histone-binding capability.
• It also contains two other types of domains, which are unlikely to bind histones.
• Therefore it was predicted that CSZ domain has that function.
Research :
• Search of C-terminal by PSI-BLAST (E-value<10-5) found UBX containing proteins and metazoan homologs of PNGases.
• PNGases – proteins involved in UPR.
• UPR – unfolded protein response. • PUG – the homologous regions.• PUG domains found in proteins
with domains central to ubiquitin- mediated proteolysis, (UBA and UBX).
• Arabidopsis protein – UBA in N-terminal.
Conclusion :
PUG containing proteins might link the UPR to ubiquitin mediated protein degradation.
PUG UBA
PUG
PUG
UBX
PUG UBCc
PNGasesBelieved to
have a role in the UPR
Domains central to ubiquitin mediated proteolysis
ApoptosisUbx domain from human faf1
Dna binding proteinc-terminal uba domain of the human homologue of rad23a (hhr23a)
• Orthologs of PNGases in metazoan are present singly, (not in multiple paralogs) – likely to have similar cellular localization.
• The ortholog in Sacharaomyces cervisiae is known to be localized mainly in the nucleus. Likely that PNGases are localized in the nucleus too.
• HMM from the PUG – marginal similarity to IRE1p-like Kinases which are known to initiate the UPR as well.
• They suggest the presence of divergent PUG domains in the C termini of these Proteins.
• Analysis revealed a conserved region in metazoan PNGases. Named it PAW. Put it in SMART.
• The team found 28 novel nuclear domain families.
• Most of them with representatives in diverse molecular context in different species.
• Some specific to single species.
• Others divergent members of previously recognized families.
The End