Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships...

transcript

Identification of Protein Domains

Orthologs and Paralogs

Describing evolutionary relationships among genes (proteins):

Two major ways of creating homologous genes is gene duplication and speciation.

Homology: not sufficiently well-defined Therefore additional terms are used:

Orthologs are two genes from two different species that derive from a single gene in the last common ancestor of the species.

ortho Paralogs are genes that derive from a single gene that was duplicated within a genome.

Co-orthologs are paralogs produced by duplications of orthologs subsequent to a given speciation event.

co-ortho

Inparalogs are paralogs in a given lineage that all evolved by gene duplications that happened after the speciation event.

in-para

out-para

Outparalogs are paralogs in the given lineage that evolved by gene duplications that happened before the speciation event

Orthologs and Paralogs

• Orthologs - evolutionary functional counterparts in different species

• Inparalogs – important for detecting lineage-specific adaptations

Proteins :• Rapidly growing databases of protein

sequences due to genome sequencing projects.

• Many new proteins belong to protein families with known functions, (significant sequence similarity).

• Only a small fraction of known proteins have functions determined by experiment.

• Databases providing computational sequence analysis allow us to classify new proteins to known families, and thus determine their function.

Protein Domains

• A domain is an independent structural unit which can be found alone or in conjunction with other domains or repeats.

• Module = mobile domain.

• Different domains have distinct functions.

• Many eukaryotic proteins have multiple domains.

Protein Domains

PX domain with ligand

SH3 domain with ligand

Identifying Protein Domains:

Problems :

– Defining the members of each family.– Building multiple alignments of the

members.– Finding the boundaries of the domain.

Identifying Protein Domains

• Little structural data identification by sequence analysis.

• Sequence characterization of families - determine 3D structure and molecular functions.

• Even when the structure of the domain is not known it may be possible to define its boundaries from sequence alone.

• They do not give a clear picture of the domain boundaries.

• Lack sensitivity.

Motif matches are often useful to indicatefunctional sites, however :

Automatic methods :• Fast, effective, deals with a lot of

information.• Might fragment domain families.• Might cause fusion of domain families.

Manual methods :• Knowledge of protein experts is put to

use.• Slow, require a lot of manpower.

SMART : (Simple Modular Architecture Research Tool)

Web-based resource used for :– rapid annotation of protein domains.– analysis of domain architectures.

Domain ArchitectureProtein: PA-3427CGSpecies: Drosophila melanogaster

Protein: ENSMUSP00000023109

Species: Mus musculus

Protein: ENSANGP00000009529Species: Anopheles gambiae

SMART (Simple Modular Architecture Research Tool)

• There are over 600 domain families.

• Provides information about :– function .– subcellular localization.– phyletic distribution.– tertiary structure.

• Based on HMMs (Hidden Markov Models).

SMART (Simple Modular Architecture Research Tool)

HMM – based on seed alignment.

Threshold values used to determine homology of domains.

SMART (Simple Modular Architecture Research Tool)• Alignments of proteins by:

– Minimize insertions/deletions in conserved alignment blocks.

– Optimize amino acid property conservation.

– Closing unnecessary gaps.

• Gapped alignments prefered over ungapped ones:– prediction of domain boundaries.– greater information content.

• Alignment of entire structural domains.

PROSITE - database of protein families and

domains • Database of biologically significant sites

and patterns. Contains 1,609 profiles.• Pattern – conserved sequence of a few

amino acids.• Identifies to which known family of

proteins (if any) the new sequence belongs.

• Used to determine the function of uncharacterized proteins translated from genomic or cDNA sequences.

PROSITE - database of protein families and domains

• A protein too distant from any other to detect its resemblance by overall sequence alignment, can be classified according to a Pattern.

• Patterns arise because of requirements of binding sites that impose very tight constraint on the evolution of portions of the protein.

PROSITE – how is a pattern developed ?

• As short as possible.

• Detects all/most sequences it describes.

• As little false results as possible.

high sensitivity and high specificity.

PROSITE – how is a pattern developed ?First – study reviews on a protein family.

Then build alignment table with particularattention to residues and regions important tothe biological function of that family. - Enzyme catalytic sites. - Prostethic group attachment sites (heme). - Amino acids involved in binding a metal ion.- Cysteines involved in disulfide bonds. - Regions involved in binding a molecule

(ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein.

PROSITE steps in the development of a pattern:

• Finding a core pattern : 4-5 biologically significant residues.

• Test the pattern on a large database.• If lucky – there is correlation in this

region which indicates a good pattern.• Mostly, there is no correlation :

– Gradually increase the size of the pattern.– search over other patterns.

PROSITE – An example

ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS

This pattern is small and would probably pick up too many false positive results :

Profiles – characterize a protein family or domain over its entire length.

Patterns - small regions, high sequence similarity.

Research: Finding new domain familiesAutomatic methods• The team started with 107 nuclear

domains.• Using SMART - get all proteins with

at least one of these domains, characterize their complete domain structure.

• Regions not annotated using known SMART domain models were extracted with their domain context.

Finding new domain families: Automatic methods

• Grouping proteins by region similarity.

• Finding homologs using PSI-BLAST on longest of every group (Threshold E-value<0.001).

• Finding domain organization via SMART.

• Homologous regions – candidates for a novel domain family.

Finding new domain families:

m an u a l in sp ec tion m ore search es

d om ain a rch itec tu re - S M A R T

P S I-B L A S T fin d in g h om olog s

g rou p reg ion s

reg ion s n o t kn ow n b y S M A R T

fin d in g p ro te in s -S M A R T

1 0 7 n u c lear d om ain s

Finding new domain families: Manual confirmation• Different context – novel module family.• Proteins with nuclear AND extracellular

domains excluded.• Multiple alignments and known locations of

domains – definition of domains’ borders.• Automatic searches to find more members,

E-value < 0.1, and manual checks.• Marginal similarity to domain family –

possible divergent family.

Prediction of Function: Chromatin-Binding Domains

• Protein SPT6 containing CSZ domain, regulates transcription through a histone-binding capability.

• It also contains two other types of domains, which are unlikely to bind histones.

• Therefore it was predicted that CSZ domain has that function.

Research :

• Search of C-terminal by PSI-BLAST (E-value<10-5) found UBX containing proteins and metazoan homologs of PNGases.

• PNGases – proteins involved in UPR.

• UPR – unfolded protein response. • PUG – the homologous regions.• PUG domains found in proteins

with domains central to ubiquitin- mediated proteolysis, (UBA and UBX).

• Arabidopsis protein – UBA in N-terminal.

Conclusion :

PUG containing proteins might link the UPR to ubiquitin mediated protein degradation.

PUG UBA

PUG UBCc

PNGasesBelieved to

have a role in the UPR

Domains central to ubiquitin mediated proteolysis

ApoptosisUbx domain from human faf1

Dna binding proteinc-terminal uba domain of the human homologue of rad23a (hhr23a)

• Orthologs of PNGases in metazoan are present singly, (not in multiple paralogs) – likely to have similar cellular localization.

• The ortholog in Sacharaomyces cervisiae is known to be localized mainly in the nucleus. Likely that PNGases are localized in the nucleus too.

• HMM from the PUG – marginal similarity to IRE1p-like Kinases which are known to initiate the UPR as well.

• They suggest the presence of divergent PUG domains in the C termini of these Proteins.

• Analysis revealed a conserved region in metazoan PNGases. Named it PAW. Put it in SMART.

• The team found 28 novel nuclear domain families.

• Most of them with representatives in diverse molecular context in different species.

• Some specific to single species.

• Others divergent members of previously recognized families.

The End

Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships...

Documents