Bioinformatics and Computational Molecular Biology
Geoff Barton
http://www.compbio.dundee.ac.uk
Practical Tutorial
• Dr David Martin practical tutorial on the use of pymol molecular graphics software.
• In this lecture I will show lots of protein structures – use www.ebi.ac.uk/msd to find them, and/or scop domains database (find with google).
Similarities in Proteins
• Lecture 1– Overview of data in molecular biology– Protein modelling– Similarities of Protein Sequence, Structure,
Function
Introduction to Sequence Comparison
• Lecture 2:– Why compare sequences?– Methods for sequence comparison/alignment.– Multiple alignment– Database searching - FASTA/BLAST– Iterative searching - PSI-BLAST
Practical/WWW references
• Organised by Drs Martin– Good preparation would be to look at:
http://www.ebi.ac.uk/Tools andhttp://www.ncbi.nlm.nih.gov
– Look at BLAST and FASTA on these sites as well as database access facilities.
Private DataPast Experiments.Lab note books.
Group discussions.
Traditional biological research
AnalysisReading. Talking.
Thinking.
Hypothesis!
ExperimentDesign. Execution.
Publish!
Public DataJournals
Conferences
Private DataPast Experiments.Lab note books.
Group discussions.DNA sequences
Protein SequencesGenetic mapsTranscripts
3D structuresproteomics results
SNP dataetcetcetc
Bioinformatics/Computational Biology and biological research Analysis
Reading. Talking.Thinking.
ComputationalAnalysis
Software Development
Hypothesis!Computer aided.
ExperimentDesign. Execution.
Computational experimentsSimulation
Publish!Database submission
Database management
Public DataJournals
ConferencesDNA sequences
Protein SequencesGenetic mapsTranscripts
3D structuresproteomics results
SNP dataetcetcetc
EMBL Nucleotide Sequence Database Growth (to 2nd Oct 2006)
Taken from: www.ebi.ac.uk
Protein Sequences
Approx 3,500,000 known for all species (Oct. 2006.)
25,000 for Human
(not counting splice variants and post-translational modifications)
Protein 3D Structures
Approx 39,000 known(much duplication)
Biological data in context
DNA
RNA
Protein Sequence
Protein 3D structure
Molecular function
Overview of Biological Hierarchy...
Whole organismanimal, plant, etc.
Tissue/organbrain, heart, lungs
blood, ...
Ecosystemmany different organisms
Populationgroup of the same type of organism
Familygroup with known common lineage
Cellnerve,muscle,etc..
Organellenucleus, mitochondria, etc...
Nucleus
Chromosome
Gene
MolecularLevels
DNA
RNA
Protein Sequence
Protein 3D structure
Molecular function
Whole organismanimal, plant, etc.
Tissue/organbrain, heart, lungs
blood, ...
Ecosystemmany different organisms
Populationgroup of the same type of organism
Familygroup with known common lineage
Cellnerve,muscle,etc..
Organellenucleus, mitochondria, etc...
Nucleus
Chromosome
Gene
Expression Data(Transcriptomics)
Which of the genes are switched on in which cells/tissues and when?
What are the effects of drugs anddisease on expression patterns
DNA ‘CHIP’ TECHNOLOGY
Technology and data in biology
DNA
RNA
Protein Sequence
Protein 3D structure
Molecular function
Whole organismanimal, plant, etc.
Tissue/organbrain, heart, lungs
blood, ...
Ecosystemmany different organisms
Populationgroup of the same type of organism
Familygroup with known common lineage
Cellnerve,muscle,etc..
Organellenucleus, mitochondria, etc...
Nucleus
Chromosome
Gene
Protein Expression Data(Proteomics)
Which proteins arebeing produced in which cells/tissues when? Which modified forms are present?
What are the effects of drugs and disease on these patterns
2D Gels + Mass Spectrometry.
Technology and data in biology
DNA
RNA
Protein Sequence
Protein 3D structure
Molecular function
Whole organismanimal, plant, etc.
Tissue/organbrain, heart, lungs
blood, ...
Ecosystemmany different organisms
Populationgroup of the same type of organism
Familygroup with known common lineage
Cellnerve,muscle,etc..
Organellenucleus, mitochondria, etc...
Nucleus
Chromosome
Gene
Protein 3D Structure - the bridge to chemistry(Structural Genomics)
What is the atomic level structure of the protein?
What other molecules does it interact with?
What small molecules - potential drugs - does it interact with?
What are the effects of point mutations on the structure?
X-ray crystallography, NMR spectroscopy, single particle, cryo-electron microscopy.
Technology and data in biology
Whole organismanimal, plant, etc.
Tissue/organbrain, heart, lungs
blood, ...
Ecosystemmany different organisms
Populationgroup of the same type of organism
Familygroup with known common lineage
Cellnerve,muscle,etc..
Organellenucleus, mitochondria, etc...
Nucleus
Chromosome
Gene
DNA
RNA
Protein Sequence
Protein 3D structure
Molecular function
Overview of Biological Hierarchy...
Macroscopic Levels
Biology is now a data intensive science
To do good science, you need to know how to use (and not abuse)
computational tools.
Protein Structure Prediction
• ‘Homology’ modelling– Relies on the fact that similarity of sequence
implies similarity of 3D structure.
Lysozyme (1lz1) -lactalbumin (1alc)
?
Imagine we don’t know the 3D structure of -lactalbumin, but we do know its amino acid sequence and that of lysozyme
Lysozyme (1lz1) -lactalbumin (1alc)
37.7% Identity, Z=17.6
?
Protein structure prediction(Homology Modelling)
• Align sequence of protein of unknown structure to sequence of protein of known structure.
• In ‘conserved core’ of protein, substitute the amino acid types into the known structure.
• Deal with ‘loops’ between the core elements of structure.
Lysozyme (1lz1) -lactalbumin (1alc)
37.7% Identity, Z=17.6
Protein structure prediction(Homology modelling)
• Problems:– Need protein of known structure that is similar
in sequence.– Building loops where there are deletions.– Verifying model.
• Key is getting a good alignment in the first place– Bad alignment => bad model.
Good alignment on its own can:
• Identify key residues (absolutely conserved)
• Identify likely protein core (conserved hydrophobic residues)
• Help predict protein secondary structure (not this lecture).
Sequence alignment is a fundamental technique in
molecular biology.
• May predict proteins of common function even when no 3D structure is known.
• May be used to predict 3D structure and so help understanding of mutants.
• Some examples of where this is right and wrong...
Prediction of structure and function by similarity to known
sequences and structures
Assumption is that similar sequence implies similar structureand function.
But what do we mean by “similar”?
Does similarity of sequence really imply similarity of function?
Protein Sequence/Structure/Function Network
Sequence 3D Structure Function
Similar Similar Similar
Different Different Different
Protein Sequence/Structure/Function Network
Sequence 3D Structure Function
Similar Similar Similar
Different Different Different
Similar Sequence, Similar Structure, Similar Function.
e.g. Trypsin-like Serine Proteinases
Same fold, same catalytic mechanism.
But DIFFERENT specificity.
e.g. Immunoglobulin variable domains.
Same fold, similar binding function.
But DIFFERENT specificity.
True of all examples. Similarities only give clues to function, differences in specificity can be regarded as differences of function.
ImmunoglobulinVariable Domains
e.g. see: 1a2y
Tryptophan at core of Ig variable domain
Protein Sequence/Structure/Function Network
Sequence 3D Structure Function
Similar Similar Similar
Different Different Different
Lysozyme (1lz1) -lactalbumin (1alc)
37.7% Identity, Z=17.6
-crystallin/L-Lactate Dehydrogenase
Protein Sequence/Structure/Function Network
Sequence 3D Structure Function
Similar Similar Similar
Different Different Different
Trypsin (3ptn) Subtilisin (2sec)
Trypsin (3ptn) Subtilisin (2sec)
Trypsin (3ptn)
Subtilisin (2sec)
His- 57, Asp-102, Ser-195
Asp- 32, His- 64, Ser-221
Protein Sequence/Structure/Function Network
Sequence 3D Structure Function
Similar Similar Similar
Different Different Different
Nature 398,84-90, 1999
PDB: 1b47
11% sequence ID
rmsd 1.47Åover 70 residues
PDB: 1b47
Protein Sequence/Structure/Function Network
Sequence 3D Structure Function
Similar Similar Similar
Different Different Different
Russell, R. B. and Barton, G. J. (1993), "An SH2-SH3 Domain hybrid", Nature, 364, 765.
PDB: 1bia PDB: 2ptk
PDB:2aai PDB:1bas
Matthews, S., et al. (1994), "The p17 Matrix Protein from HIV-1 is Structurally Similar to Interferon-gamma", Nature, 370, 666-668.
Protein Sequence/Structure/Function Network
Sequence 3D Structure Function
Similar Similar Similar
Different Different Different
Does this ever happen?
HIV Reverse Transcriptase (RT)
HIV Reverse Transcriptase (RT)
HIV Reverse Transcriptase (RT) - domain linkers
Protein Sequence and Structural Similarity
Type Similarity Find By
Homologous(scop family)
Similar StructureSimilar SequenceSimilar Function
Pair-wise SequenceComparison(BLAST/FASTA/Smith-Waterman)
‘RemoteHomologue’(scop superfamily)
Similar StructureWeakly Similar SequenceSimilar Function
ProfileIterative Search(e.g. PSI-BLAST)
Threading/fold recognition?
Analogue(scop fold)
Similar StructureNo sequence similarityOften no functionalsimilarity
Solve BOTH structures byX-ray/NMR methods.
Mapping?
Protein Sequence and Structural Similarity
Type Similarity Find By
Homologous(scop family)
Similar StructureSimilar SequenceSimilar Function
Pair-wise SequenceComparison(BLAST/FASTA/Smith-Waterman)
‘RemoteHomologue’(scop superfamily)
Similar StructureWeakly Similar SequenceSimilar Function
ProfileIterative Search(e.g. PSI-BLAST)
Threading/fold recognition?
Analogue(scop fold)
Similar StructureNo sequence similarityOften no functionalsimilarity
Solve BOTH structures byX-ray/NMR methods.
Mapping?
Barton, G. J. et al, (1992), "Human Platelet Derived Endothelial Cell Growth Factor is Homologous to E.coli Thymidine Phosphorylase", Prot. Sci., 1, 688-690.
Protein Sequence and Structural Similarity
Type Similarity Find By
Homologous(scop family)
Similar StructureSimilar SequenceSimilar Function
Pair-wise SequenceComparison(BLAST/FASTA/Smith-Waterman)
‘RemoteHomologue’(scop superfamily)
Similar StructureWeakly Similar SequenceSimilar Function
ProfileIterative Search(e.g. PSI-BLAST)
Threading/fold recognition?
Analogue(scop fold)
Similar StructureNo sequence similarityOften no functionalsimilarity
Solve BOTH structures byX-ray/NMR methods.
Mapping?
Barton, G. J., Cohen, P. T. C. and Barford, D. (1994),"Conservation Analysis and Structure Prediction of the Protein Serine/Threonine Phosphatases: Sequence Similarity with Diadenosine Tetra-phosphatase fromE. coli Suggests Homology to the Protein Phosphatases", Eur. J. Biochem.,220, 225-237.
Protein Sequence and Structural Similarity
Type Similarity Find By
Homologous(scop family)
Similar StructureSimilar SequenceSimilar Function
Pair-wise SequenceComparison(BLAST/FASTA/Smith-Waterman)
‘RemoteHomologue’(scop superfamily)
Similar StructureWeakly Similar SequenceSimilar Function
ProfileIterative Search(e.g. PSI-BLAST)
Threading/fold recognition?
Analogue(scop fold)
Similar StructureNo sequence similarityOften no functionalsimilarity
Solve BOTH structures byX-ray/NMR methods.
Mapping?
Russell, R. B. and Barton, G. J. (1993), "An SH2-SH3 Domain hybrid", Nature, 364, 765.
Reading material for this lecture:
This lecture itself. pdf’s for “Barton” papers: www.compbio.dundee.ac.uk/ftp/pdf/
Database statistics: http://www.ebi.ac.uk/embl/
Structure of the amino-terminal domain of Cbl complexed to its binding site on ZAP-70 kinase Wuyi Meng, Sansana Sawasdikosol, Steven J. Burakoff, Michael J. EckNature 398, 84 - 90 (04 March 1999)(available on-line at www.nature.com - search for ZAP-70 kinase - republished in December on-line)
Protein recognition: An SH2 domain in disguise John Kuriyan, James E. DarnellNature 398, 22 - 25 (04 March 1999) (news and views article for above paper)
Russell, R. B. and Barton, G. J. (1993), "An SH2-SH3 Domain hybrid", Nature, 364, 765.
Matthews, S., et al. (1994), "The p17 Matrix Protein from HIV-1 is Structurally Similar to Interferon-gamma", Nature, 370, 666-668.
Barton, G. J., Cohen, P. T. C. and Barford, D. (1994),"Conservation Analysis and Structure Prediction of the Protein Serine/Threonine Phosphatases: Sequence Similarity with Diadenosine Tetra-phosphatase fromE. coli Suggests Homology to the Protein Phosphatases", Eur. J. Biochem.,220, 225-237.
The end of Lecture 1
Lecture 2 will be on sequence comparison methods.