Michael Schroeder BioTechnological Center TU Dresden Biotec Introduction based on Chapter 1 Lesk,...

Post on 28-Dec-2015

219 views 3 download

transcript

Michael Schroeder BioTechnological CenterTU Dresden Biotec

Introduction

based onChapter 1

Lesk, Introduction to Bioinformatics

By Michael Schroeder, Biotec, 2

Contents

Molecular biology primer The role of computer science Phylogeny Sequence Searching

By Michael Schroeder, Biotec, 3

23 June 2000: Draft of Human genome sequenced!

1953: Watson and Crick discover the structure of DNA 2000: Draft of human genome is published

“The most wondrous map ever produced by human kind” “One of the most significant scientific landmarks of all

time, comparable with the invention of the wheel or the splitting of the atom”

By Michael Schroeder, Biotec, 4

High-throughput biomedicine

Microarrays Measure activity of thousands of genes at the same time Example:

Cancer Compare activity with and without drug treatment Result: Hundreds of candidate drug targets

RNAi (Noble prize 2004, Fire and Mello) Knock-down genes and observe effect Example:

Infectious diseases Which proteins orchestrate entry into cell? Result: Hundreds of candidate proteins

Atomic force microscopes (Noble prize Binnig) Pull protein out of membrane and measure force Example:

Eye diseases resulting fomr misfolding Result: Hundreds of candidate residues

By Michael Schroeder, Biotec, 5

Drug Discovery

Challenge: Longer time to market, fewer drugs, exploding costs

Approach: Use of compound libraries and high-throughput screening

By Michael Schroeder, Biotec, 6

HTS and Bioinformatics

High-throughput technologies have completely changed the work of biomedical researchers

Challenge: Interpret (often large) results of screens

Approach: Before running secondary assays use bioinformatics and IT to assemble all possible information

By Michael Schroeder, Biotec, 7

Good News

10 thousands of 3D Structures

Millions ofSequences

Millions ofArticles

Hundreds of DBs/Tools

By Michael Schroeder, Biotec, 8

Bad News: Data != Knowledge

How to analyse data, how to integrate data?

Comptuer science to the rescue…

By Michael Schroeder, Biotec, 9

Examlpe: computer science is key for sequencing

Human genome is a string of length 3.200.000.000 Shotgun sequencing: Break multiple copies of string

into shorter substrings Example:

shotgunsequencing shotgunsequencing shotgunsequencing

cing en encing equ gun ing ns otgu seq sequ sh sho shot tg uenc un

Computing problem: Assemble strings

By Michael Schroeder, Biotec, 10

Computer science key for sequencing

sh sho shot otgu tg gun un ns seq sequ equ uenc encing en cing ing

QUESTION: How can you handle long repetitive sequences?

Heeeeelllllllllllooooooo

QUESTION: Why was a draft announced? When was the finalversion ready?

By Michael Schroeder, Biotec, 11

Arabidopsis thaliana

mouse

rat

Caenorhabitis elegans

Drosophilamelanogaster

Mycobacteriumleprae

Vibrio cholerae

Plasmodiumfalciparum

Mycobacteriumtuberculosis

Neisseria meningitidis

Z2491

Helicobacter pylori

Xylella fastidiosa

Borrelia burgorferi

Rickettsia prowazekii

Bacillus subtilis

Archaeoglobusfulgidus

Campylobacter jejuni

Aquifex aeolicus

Thermotoga maritima

Chlamydiapneumoniae

Pseudomonasaeruginosa

Ureaplasmaurealyticum

Buchnerasp. APS

Escherichia coli

Saccharomycescerevisiae

Yersinia pestis

Salmonellaenterica

Thermoplasmaacidophilum

By Michael Schroeder, Biotec, 12

DNA – the molecule of life

http://www.ornl.gov/hgmis

By Michael Schroeder, Biotec, 13

The genetic code

By Michael Schroeder, Biotec, 14

Protein Structure

DNA: Nucleotides are very similar

and hence the structure of DNA is very uniform

Proteins: Great variety in three-

dimensional conformation to support diverse structure and functions

If heated, protein “unfolds” to biologically-inactive structure; in normal conditions protein folds

By Michael Schroeder, Biotec, 15

Paradox

Translation from DNA sequence to amino acid sequence is very simple to describe, but requires immensely complicated machinery

(ribosome, tRNA) The folding of the protein sequence into its three-

dimensional structure is very difficult to describe But occurs spontaneously

By Michael Schroeder, Biotec, 16

Central Dogma

DNA sequence determines protein sequence Protein sequence determines protein structure Protein structure determines protein function

By Michael Schroeder, Biotec, 17

Sequence vs. structure similarity

Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt

By Michael Schroeder, Biotec, 18

Sequence vs. structure similarity

Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt

High sequence similarity = high structure similary

By Michael Schroeder, Biotec, 19

Sequence vs. structure similarity

Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt

Low sequence similarityusuallylow structure similarity

By Michael Schroeder, Biotec, 20

Sequence vs. structure similarity

Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt

Low sequence similarity possibly stillhigh structure similary

11% sequence identity, structure perfectly match

By Michael Schroeder, Biotec, 21

Sequence similarity is key concept

Similar sequences are a hint for common ancestry and possibly similar function

Sequence similarity is key concept

Similar sequences are a hint for common ancestry and possibly similar function

Sequence similarity is key conceptExample: v-sys vs. PDGF

Example from early 80s: V-sys in simian sarcoma virus leads to cancer in infected cells PDGF in humans is a normal growth factor for cells V-sys and PDGF are 85% similar

Alignment from: http://pdf.aminer.org/000/244/500/design_and_implementation_of_a_dna_sequence_processor.pdf

Sequence similarity is key concept

If an unknown sequence is found, deduce its function/structure indirectly by finding similar sequences, whose function/structure is known

Assumption: Evolution changes sequences “slowly” often maintaining main features of a sequence’s function/structure

Sequence similarity is key concept

Similar sequences are a hint for common ancestry and possibly similar function

Sequence is hint for evolutionary relationship

By Michael Schroeder, Biotec, 28

How similar are sequences?

>sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Equus caballus (Horse).

KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHFDASVEVST

>sp|P00673|RNP_BALAC Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Balaenoptera acutorostrata (Minke whale) (Lesser rorqual).

RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQKNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHFDNSV

>sp|P00686|RNP_MACRU Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Macropus rufus (Red kangaroo) (Megaleia rufa).

ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDAYV

By Michael Schroeder, Biotec, 29

Multiple Alignment with ClustalW (www.ebi.ac.uk/clustalw)

CLUSTAL W (1.82) multiple sequence alignmensp|P00674|RNP_HORSEsp|P00673|RNP_BALACsp|P00686|RNP_MACRU

KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60 RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60 -ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59 *:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* *

KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120 KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120 ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118:*: ****::***:*.* : **:** *..****** *:**: :::******* ******

DASVEVST 128 DNSV---- 124 DAYV---- 122 * *

By Michael Schroeder, Biotec, 30

Example: Number of Aligned Residues

Horse and Minke whale: 95 Minke whale and Red kangoroo: 82 Horse and Red kangoroo: 75

Conclusion: Horse and whale share the most identical residues

Horse and whale are placental, kangaroo is marsupial

By Michael Schroeder, Biotec, 31

Example: Elephant and Mammoth

Mitochondrial cytochrome b from Siberian woolly mammoth

(Mammuthus primigenius) preserved in arctic perma frost

African elephant (Loxodonta africana) Indian elephant (Elephans maximus)

By Michael Schroeder, Biotec, 32

Indian elephant: sp|P24958|CYB_LOXAF Mammoth: sp|P92658|CYB_MAMPR African elephant: sp|O47885|CYB_ELEMA

MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60*** ** ***:**:**********************************************

TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120************************************************************

LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA 180LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180**************************************:*********************

LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240:********:***********************************************:**

LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300******************************************************:*****

LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIIIGQMASILYFS 360**:*************************: *** **********:***************

IILAFLPIAGVIENYLIK 378IILAFLPIAGMIENYLIK 378IILAFLPIAGMIENYLIK 378**********:*******

By Michael Schroeder, Biotec, 33

Example: Elephant and Mammoth

Mammoth and African elephant have 10 mismatches, mammoth and Indian elephant 14.

Significant?

By Michael Schroeder, Biotec, 34

Similarity and Homology

Important difference: Similarity is the measurement of resemblance of

sequences Homology: common ancestor

Similarity is gradual, homology is either true or false Similarity = now, homology = past events Homology is only very rarely directly observed (e.g. lab

population, clinical study of viral infection)

Homology is inferred from sequence similarity

By Michael Schroeder, Biotec, 35

Homology = derived from common ancestor

Characteristics derived from a common ancestor are called homologous

E.g. eagle’s wing and human’s arm

Other apparently similar characteristics may have arisen independently by convergent evolution

E.g. eagle’s wing and bee’s wing. The most common ancestor of eagles and bees did not have wings

Homologous characters may diverge functionally E.g. bones in human middle and jaws of primitive fish

By Michael Schroeder, Biotec, 36

Example: Homology/Similarity

The assertion that the cytocrome b sequences are homologues means that there is a common ancestor

BUT: 1. Maybe cytochrome b functionally requires so many

conserved residues and will hence occur in many species ( In fact, This is not the case here)

2. Maybe cytochrome b has to function this way in elephant-like species, but in fact started out from different ancestors (i.e. convergent evolution)

3. Maybe mammoth and African elephant have only fewer mismatches, because Indian elephant’s DNA mutated faster

4. Maybe all of them acquired cytochrome b through a virus (horizontal gene transfer)

Similarity vs. Homology

Any sequence can be similar Sequences homologues if evolved from common

ancestor Homologous sequences:

Orthologs: similar biological function Paralogs: different biological function (after gene

duplication), e.g. lysozyme and α-lactalbumin, a mammalian regulatory protein

Assumption: Similarity indicator for homology Note, altered function of the expressed protein will

determine if the organism will survive to reproduce, and hence pass on the altered gene

Sequence similarity is key concept

How similar are two sequences?How to align the sequences?How to align multiple sequences?How to find motifs?

By Michael Schroeder, Biotec, 39

Sequence alignment

Global match: align all of one with all of the other sequence (mismatches, insertions, deletions) And.--so,.from.hour.to.hour.we.ripe.and.ripe|||| |||||||||||||||||||||||| ||||||And.then,.from.hour.to.hour.we.rot-.and.rot-

Local match: find region in one sequence that matches the other (mismatches, insertions, deletions ; ends can be ignored) My.care.is.loss.of.care,.by.old.care.done, ||||||||| ||||||||||||| |||||| ||Your.care.is.gain.of.care,.by.new.care.won

By Michael Schroeder, Biotec, 40

Sequence alignment

Motif search: find matches of short sequence in long sequence Option:

perfect, 1 mismatch, mismatches+gaps+insertions+deletions

match ||||for the watch to babble and to talk is most tolerable

By Michael Schroeder, Biotec, 41

Sequence alignment

Multiple sequence alignment

No.sooner.---met.--------.but.they.look’d

No.sooner.look’d.--------.but.they.lo-v’d

No.sooner.lo-v’d.--------.but.they.sigh’d

No.sooner.sigh’d.--------.but.they.--asked.one.another.the.reason

No.sooner.knew.the.reason.but.they.-------------sought.the.remedy

No.sooner. .but.they.

By Michael Schroeder, Biotec, 42

Quick check

By now you should Know the main data sources (sequence and structure) Know the role that bioinformatics plays Understand the difference between homology and similarity Understand what sequence comparison and alignment are