+ All Categories
Home > Documents > Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence...

Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence...

Date post: 19-Dec-2015
Category:
View: 217 times
Download: 0 times
Share this document with a friend
Popular Tags:
47
1 Lecture 2.4 Sequencing & Sequence Alignment G E N E T I C S G 60 40 30 20 20 0 10 0 E 40 50 30 30 20 0 10 0 N 30 30 40 20 20 0 10 0 E 20 20 20 30 20 10 10 0 S 20 20 20 20 20 0 10 10 I 10 10 10 10 10 20 10 0 S 0 0 0 0 0 0 0 10
Transcript
Page 1: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

1 Lecture 2.4

Sequencing & Sequence Alignment

G E N E T I C SG 60 40 30 20 20 0 10 0E 40 50 30 30 20 0 10 0N 30 30 40 20 20 0 10 0E 20 20 20 30 20 10 10 0S 20 20 20 20 20 0 10 10I 10 10 10 10 10 20 10 0S 0 0 0 0 0 0 0 10

Page 2: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

2 Lecture 2.4

Objectives

• Understand how DNA sequence data is collected and prepared

• Be aware of the importance of sequence searching and sequence alignment in biology and medicine

• Be familiar with the different algorithms and scoring schemes used in sequence searching and sequence alignment

Page 3: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

3 Lecture 2.4

High Throughput DNA Sequencing

Page 4: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

4 Lecture 2.4

30,000

Page 5: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

5 Lecture 2.4

Shotgun Sequencing

IsolateChromosome

ShearDNAinto Fragments

Clone intoSeq. Vectors Sequence

Page 6: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

6 Lecture 2.4

Principles of DNA Sequencing

Primer

PBR322

Amp

Tet

Ori

DNA fragment

Denature withheat to produce

ssDNA

Klenow + ddNTP + dNTP + primers

Page 7: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

7 Lecture 2.4

The Secret to Sanger Sequencing

Page 8: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

8 Lecture 2.4

Principles of DNA Sequencing

5’

5’ Primer

3’ TemplateG C A T G C

dATPdCTPdGTPdTTPddATP

dATPdCTPdGTPdTTPddCTP

dATPdCTPdGTPdTTPddTTP

dATPdCTPdGTPdTTP

ddCTP

GddC

GCATGddC

GCddA GCAddT ddG

GCATddG

Page 9: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

9 Lecture 2.4

Principles of DNA SequencingG

C

T

A

+

_

+

_

G

C

A

T

G

C

Page 10: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

10 Lecture 2.4

Capillary Electrophoresis

Separation by Electro-osmotic Flow

Page 11: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

11 Lecture 2.4

Multiplexed CE with Fluorescent detection

ABI 3700 96x700 bases

Page 12: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

12 Lecture 2.4

Shotgun Sequencing

SequenceChromatogram

Send to Computer AssembledSequence

Page 13: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

13 Lecture 2.4

Shotgun Sequencing

• Very efficient process for small-scale (~10 kb) sequencing (preferred method)

• First applied to whole genome sequencing in 1995 (H. influenzae)

• Now standard for all prokaryotic genome sequencing projects

• Successfully applied to D. melanogaster• Moderately successful for H. sapiens

Page 14: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

14 Lecture 2.4

The Finished Product

GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTAGAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGAT

Page 15: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

15 Lecture 2.4

Sequencing Successes

T7 bacteriophagecompleted in 198339,937 bp, 59 coded proteins

Escherichia colicompleted in 19984,639,221 bp, 4293 ORFs

Sacchoromyces cerevisaecompleted in 199612,069,252 bp, 5800 genes

Page 16: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

16 Lecture 2.4

Sequencing Successes

Caenorhabditis eleganscompleted in 199895,078,296 bp, 19,099 genes

Drosophila melanogastercompleted in 2000116,117,226 bp, 13,601 genes

Homo sapiens1st draft completed in 20013,160,079,000 bp, 31,780 genes

Page 17: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

17 Lecture 2.4

So what do we do with all this

sequence data?

Page 18: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

18 Lecture 2.4

Sequence Alignment

G E N E T I C SG 60 40 30 20 20 0 10 0E 40 50 30 30 20 0 10 0N 30 30 40 20 20 0 10 0E 20 20 20 30 20 10 10 0S 20 20 20 20 20 0 10 10I 10 10 10 10 10 20 10 0S 0 0 0 0 0 0 0 10

Page 19: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

19 Lecture 2.4

Alignments tell us about...

• Function or activity of a new gene/protein

• Structure or shape of a new protein

• Location or preferred location of a protein

• Stability of a gene or protein

• Origin of a gene or protein

• Origin or phylogeny of an organelle

• Origin or phylogeny of an organism

Page 20: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

20 Lecture 2.4

Factoid:

Sequence comparisons

lie at the heart of all

bioinformatics

Page 21: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

21 Lecture 2.4

Similarity versus Homology

• Similarity refers to the likeness or % identity between 2 sequences

• Similarity means sharing a statistically significant number of bases or amino acids

• Similarity does not imply homology

• Homology refers to shared ancestry

• Two sequences are homologous is they are derived from a common ancestral sequence

• Homology usually implies similarity

Page 22: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

22 Lecture 2.4

Similarity versus Homology

• Similarity can be quantified

• It is correct to say that two sequences are X% identical

• It is correct to say that two sequences have a similarity score of Z

• It is generally incorrect to say that two sequences are X% similar

Page 23: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

23 Lecture 2.4

• Homology cannot be quantified

• If two sequences have a high % identity it is OK to say they are homologous

• It is incorrect to say two sequences have a homology score of Z

It is incorrect to say two sequences are X% homologous

Similarity versus Homology

Page 24: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

24 Lecture 2.4

Sequence Complexity

MCDEFGHIKLAN…. High Complexity

ACTGTCACTGAT…. Mid Complexity

NNNNTTTTTNNN…. Low Complexity

Translate those DNA sequences!!!

Page 25: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

25 Lecture 2.4

Assessing Sequence Similarity

THESTORYOFGENESISTHISBOOKONGENETICS

THESTORYOFGENESI-STHISBOOKONGENETICS

THE STORY OF GENESISTHIS BOOK ON GENETICS

Two CharacterStrings

CharacterComparison

ContextComparison

* * * * * * * * * * *

Page 26: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

26 Lecture 2.4

Assessing Sequence Similarity

Rbn KETAAAKFERQHMDLsz KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNT

Rbn SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLALsz QATNRNTDGSTDYGILQINSRWWCNDGRTP GSRN

Rbn DVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKYLsz LCNIPCSALLSSDITASVNC AKKIVSDGDGMNAWVAWR

Rbn PNACYKTTQANKHIIVACEGNPYVPHFDASVLsz NRCKGTDVQA WIRGCRL

is this alignment significant?

Page 27: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

27 Lecture 2.4

Is This Alignment Significant?

Gelsolin 89 L G N E L S Q D E S G A A A I F T V Q L 108

Annexin 82 L P S A L K S A L S G H L E T V I L G L 101

154 L E K D I I S D T S G D F R K L M V A L 173

240 L E – S I K K E V K G D L E N A F L N L 258

314 L Y Y Y I Q Q D T K G D Y Q K A L L Y L 333

Consensus L x P x x x P D x S G x h x x h x V L L

Page 28: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

28 Lecture 2.4

Some Simple Rules

• If two sequence are > 100 residues and > 25% identical, they are likely related

• If two sequences are 15-25% identical they may be related, but more tests are needed

• If two sequences are < 15% identical they are probably not related

• If you need more than 1 gap for every 20 residues the alignment is suspicious

Page 29: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

29 Lecture 2.4

Doolittle’s Rules of Thumb

Evolutionary Distance VS Percent Sequence Identity

0

20

40

60

80

100

120

0 40 80 120 160 200 240 280 320 360 400

Number of Residues

Sequ

ence

Iden

tity

(%)

Twilight Zone

Page 30: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

30 Lecture 2.4

Sequence Alignment - Methods

• Dot Plots

• Dynamic Programming

• Heuristic (Fast) Local Alignment

• Multiple Sequence Alignment

• Contig Assembly

Page 31: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

31 Lecture 2.4

PAM Matrices

• Developed by M.O. Dayhoff (1978)• PAM = Point Accepted Mutation• Matrix assembled by looking at patterns of

substitutions in closely related proteins• 1 PAM corresponds to 1 amino acid

change per 100 residues• 1 PAM = 1% divergence or 1 million years

in evolutionary history

Page 32: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

32 Lecture 2.4

Developed by Lipman & Pearson (1985/88) Refined by Altschul et al. (1990/97) Ideal for large database comparisons Uses heuristics & statistical simplification Fast N-type algorithm (similar to Dot Plot) Cuts sequences into short words (k-tuples) Uses “Hash Tables” to speed comparison

Fast Local Alignment Methods

Page 33: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

33 Lecture 2.4

FASTA• Developed in 1985 and 1988 (W. Pearson)• Looks for clusters of nearby or locally

dense “identical” k-tuples• init1 score = score for first set of k-tuples• initn score = score for gapped k-tuples• opt score = optimized alignment score• Z-score = number of S.D. above random• expect = expected # of random matches

Page 34: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

34 Lecture 2.4

FASTAgi|135775|sp|P08628|THIO_RABIT THIOREDOXIN (104 aa) initn: 641 init1: 641 opt: 642 Z-score: 806.4 expect() 3.2e-38Smith-Waterman score: 642; 86.538% identity in 104 aa overlap (2-105:1-104)

gi|135 2- 105: --------------------------------------------------------------------:

10 20 30 40 50 60 70 80thiore MVKQIESKTAFQEALDAAGDKLVVVDFSATWCGPCKMINPFFHSLSEKYSNVIFLEVDVDDCQDVASECEVKCTPTFQFF :::::::.::::.::.:::::::::::::::::::::.::::.::::..::.:.:::::::.:.:.:::::: ::::::gi|135 VKQIESKSAFQEVLDSAGDKLVVVDFSATWCGPCKMIKPFFHALSEKFNNVVFIEVDVDDCKDIAAECEVKCMPTFQFF 10 20 30 40 50 60 70

90 100thiore KKGQKVGEFSGANKEKLEATINELV ::::::::::::::::::::::::.gi|135 KKGQKVGEFSGANKEKLEATINELL 80 90 100

Page 35: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

35 Lecture 2.4

Multiple Sequence Alignment

Multiple alignment of Calcitonins

Page 36: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

36 Lecture 2.4

Multiple Alignment Algorithm

• Take all “n” sequences and perform all possible pairwise (n/2(n-1)) alignments

• Identify highest scoring pair, perform an alignment & create a consensus sequence

• Select next most similar sequence and align it to the initial consensus, regenerate a second consensus

• Repeat step 3 until finished

Page 37: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

37 Lecture 2.4

Multiple Sequence Alignment

• Developed and refined by many (Doolittle, Barton, Corpet) through the 1980’s

• Used extensively for extracting hidden phylogenetic relationships and identifying sequence families

• Powerful tool for extracting new sequence motifs and signature sequences

Page 38: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

38 Lecture 2.4

Multiple Alignment

• Most commercial vendors offer good multiple alignment programs including:

• GCG (Accelerys)• PepTool/GeneTool (BioTools Inc.)• LaserGene (DNAStar)

• Popular web servers include T-COFFEE, MULTALIN and CLUSTALW

• Popular freeware includes PHYLIP & PAUP

Page 39: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

39 Lecture 2.4

Mutli-Align Websites

• Match-Box http://www.fundp.ac.be/sciences/biologie/bms/matchbox_submit.shtml

• MUSCA http://cbcsrv.watson.ibm.com/Tmsa.html

• T-Coffee http://www.ch.embnet.org/software/TCoffee.html

• MULTALIN http://www.toulouse.inra.fr/multalin.html

• CLUSTALW http://www.ebi.ac.uk/clustalw/

Page 40: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

40 Lecture 2.4

Multi-alignment & Contig Assembly

ATCGATGCGTAGCAGACTACCGTTACGATGCCTT…TAGCTACGCATCGTCTGATGGCAATGCTACGGAA..

ATCGAT

GCGTAG

CTAGCAGACTACCGTT

GTTACGATGCCTT

TAGCTACGCATCGT

Page 41: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

41 Lecture 2.4

Contig Assembly

• Read, edit & trim DNA chromatograms• Remove overlaps & ambiguous calls• Read in all sequence files (10-10,000)• Reverse complement all sequences (doubles

# of sequences to align)• Remove vector sequences (vector trim)• Remove regions of low complexity• Perform multiple sequence alignment

Page 42: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

42 Lecture 2.4

Chromatogram Editing

Page 43: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

43 Lecture 2.4

Sequence Loading

Page 44: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

44 Lecture 2.4

Sequence Alignment

Page 45: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

45 Lecture 2.4

Contig Alignment - Process

ATCGATGCGTAGCAGACTACCGTTACGATGCCTT…

ATCGATGCGTAGCTAGCAGACTACCGTT

GTTACGATGCCTT

CGATGCGTAGCA

ATCGATGCGTAGCTAGCAGACTACCGTTGTTACGATGCCTTTGCTACGCATCG CGATGCGTAGCA

Page 46: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

46 Lecture 2.4

Sequence Assembly Programs

• Phred - base calling program that does detailed statistical analysis (UNIX) http://www.phrap.org/

• Phrap - sequence assembly program (UNIX) http://www.phrap.org/

• TIGR Assembler - microbial genomes (UNIX) http://www.tigr.org/softlab/assembler/

• The Staden Package (UNIX) http://www.mrc-lmb.cam.ac.uk/pubseq/

• GeneTool/ChromaTool/Sequencher (PC/Mac)

Page 47: Lecture 2.4 1 Sequencing & Sequence Alignment. Lecture 2.4 2 Objectives Understand how DNA sequence data is collected and prepared Be aware of the importance.

47 Lecture 2.4

Conclusions• Sequence alignments and database

searching are key to all of bioinformatics• There are four different methods for doing

sequence comparisons 1) Dot Plots; 2) Dynamic Programming; 3) Fast Alignment; and 4) Multiple Alignment

• Understanding the significance of alignments requires an understanding of statistics and distributions


Recommended