+ All Categories
Home > Documents > Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Date post: 11-Jan-2016
Category:
Upload: irma-palmer
View: 220 times
Download: 0 times
Share this document with a friend
62
Sequence Alignment Sequence Alignment Lakshmanan Iyer, Ph. D.
Transcript
Page 1: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Sequence AlignmentSequence AlignmentLakshmanan Iyer, Ph. D.

Page 2: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

The Building Blocks…The Building Blocks…

ATGC

VLMFNQEDHKRCSTPYW

Page 3: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Why Align Sequences?Why Align Sequences?

Discover functional, structural, and evolutionary information

Similar Sequences may have similar function– Gene Regulation

– Biochemical Function

– Similar Structure Homology

– Similar sequences may have a common ancestor

Page 4: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

What is Sequence Alignment?What is Sequence Alignment?

Local Alignment

Global Algnment

LGPSSKQTGKGS-SRIWDN| | ||| | |LN-ITKSAGKGAIMRLGDA

-------TGKGS------- ||| -------AGKGA-------

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html

Page 5: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Example Sequence Alignment?Example Sequence Alignment?

Evolutionary Tree

Example AlignmentConserved

Similar

Page 6: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Methods of Sequence Methods of Sequence AlignmentAlignmentPair-wise Sequence Alignment

Multiple Sequence Alignment

Dot Matrix Analysis Dynamic Programming Algorithm Word or k-tuple methods (FASTA,BLAST,

BLAT)

Page 7: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Dot Matrix AlignmentDot Matrix Alignment

Place Sequences on X and Y axis and put a dot where there is a match

Especially useful to detect repetitive structure

Page 8: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Dynamics ProgrammingDynamics Programming

The problem at hand is diving into a series of sub-problems

The sub-problems are solved in steps The results are compiled to find the final

solution.

Page 9: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Scoring SystemsScoring Systems

•Position Independent MatricesPosition Independent Matrices•Nucleic Acids – identity matrix•Proteins

•PAM Matrices (Percent Accepted Mutation)•Implicit model of evolution•Higher PAM number all calculated from PAM1•PAM250 widely used

•BLOSUM Matrices (BLOck SUbstitution Matrices)•Empirically determined from alignment of conserved blocks•Each includes information up to a certain level of identity•BLOSUM62 widely used

•Position Specific Score Matrices (PSSMs)Position Specific Score Matrices (PSSMs)•PSI and RPS BLAST

Page 10: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62BLOSUM62

Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

Page 11: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Gapped AlignmentsGapped Alignments

•Gapping provides more

biologically realistic alignments•Gapped BLAST parameters

must be simulated

•Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)

LGPSSKQTGKGS-SRIWDN| | ||| | |LN-ITKSAGKGAIMRLGDA

-------TGKGS------- ||| -------AGKGA-------

Page 12: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

ScoresScores

V D S – C Y

V E T L C F

BLOSUM62 +4 +2 +1 -12 +9 +3 7

PAM30 +7 +2 0 -10 +10 +2 11

Page 13: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

H E AH E AP -2 -1 -1P -2 -1 -1A -2 -1 4A -2 -1 4W -2 -3 -3W -2 -3 -3

H E AH E AP -2 -1 -1P -2 -1 -1A -2 -1 4A -2 -1 4W -2 -3 -3W -2 -3 -3

00 -8-8 -16-16

-8-8

-16-16

-24-24

-24-24

-2-2 -9-9

-3-3 -5-5

-6-6

-17-17

-11-11-18-18

-10-10

WW

AA

PP

HH EE AA

Calculate scores for site pairsCalculate scores for site pairsBLOSUM62BLOSUM62

Calculate scores for site pairsCalculate scores for site pairsBLOSUM62BLOSUM62

D DYNAMIC PROGRAMMING D DYNAMIC PROGRAMMING Global Alignment: Needleman-Global Alignment: Needleman-WunschWunsch

Page 14: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65 -73A -16 -10 -3 -5 -13 -21 -29 -37 -45 -53 -61W -24 -18 -11 -6 -7 -15 -10 -18 -26 -34 -41H -32 -16 -18 -13 -8 -9 -17 -12 -10 -18 -26E -40 -24 -11 -19 -15 -9 -12 -19 -12 -5 -13A -48 -32 -19 -7 -15 -11 -12 -12 -20 -13 -6E -56 -40 -27 -15 -9 -16 -14 -14 -12 -15 -8

H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65 -73A -16 -10 -3 -5 -13 -21 -29 -37 -45 -53 -61W -24 -18 -11 -6 -7 -15 -10 -18 -26 -34 -41H -32 -16 -18 -13 -8 -9 -17 -12 -10 -18 -26E -40 -24 -11 -19 -15 -9 -12 -19 -12 -5 -13A -48 -32 -19 -7 -15 -11 -12 -12 -20 -13 -6E -56 -40 -27 -15 -9 -16 -14 -14 -12 -15 -8-8-8-8-8

-13-13-12-12

-12-12-10-10

-21-21-25-25-17-17

-16-16-8-8

H E A G A W G H E E- - P - A W H E A EH E A G A W G H E E- - P - A W H E A E

Trace BackTrace Back

Page 15: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

BLAST…BLAST…

NCBI Presentation …

Page 16: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

NCBI Molecular Biology NCBI Molecular Biology ResourcesResources

January 2006 Peter Cooper

Using NCBI BLAST

Page 17: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Sequence Similarity Sequence Similarity SearchingSearching

Basic Local Alignment Search ToolBasic Local Alignment Search Tool

Page 18: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

What BLAST tells youWhat BLAST tells you BLAST reports surprising alignments

– Different than chance Assumptions

– Random sequences

– Constant composition Conclusions

– Surprising similarities imply evolutionary homology

Evolutionary Homology: descent from a common ancestorDoes not always imply similar function

Page 19: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

BBasic asic LLocal ocal AAlignment lignment SSearch earch TToolool Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database.

– DNA vs DNA

– DNA translation vs Protein

– Protein vs Protein

– Protein vs DNA translation

– DNA translation vs DNA translation

www, standalone, and network clients

Page 20: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

BLAST and BLAST-like BLAST and BLAST-like programsprograms

Traditional BLAST (blastall) nucleotide, protein, translations

– blastn nucleotide query vs. nucleotide database

– blastp protein query vs. protein database

– blastx nucleotide query vs. protein database

– tblastn protein query vs. translated nucleotide database

– tblastx translated query vs. translated database Megablast nucleotide only

– Contiguous megablast Nearly identical sequences

– Discontiguous megablast Cross-species comparison

Position Specific BLAST Programs protein only

– Position Specific Iterative BLAST (PSI-BLAST) Automatically generates a position specific score matrix (PSSM)

– Reverse PSI-BLAST (RPS-BLAST) Searches a database of PSI-BLAST PSSMs

Page 21: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

GTACTGGACATGGACCCTACAGGAACGT

TGGACATGGACCCTACAGGAACGTATAC

CATGGACCCTACAGGAACGTATACGTAA . . .

Nucleotide WordsNucleotide Words

GTACTGGACAT

TACTGGACATG

ACTGGACATGG

CTGGACATGGA

TGGACATGGAC

GGACATGGACC

GACATGGACCC

ACATGGACCCT . . .

Make a lookuptable of words

GTACTGGACATGGACCCTACAGGAACGTATACGTAAG Query

11-mer

1228megablast

711blastn

Min.Def.WORD SIZE

Page 22: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Protein WordsProtein WordsGTQITVEDLFYNIATRRKALKNQuery:

Neighborhood Words

LTV, MTV, ISV, LSV, etc.

GTQ

TQI

QIT

ITV

TVE

VED

EDL

DLF

...

Make a lookuptable of words

Word size = 3 (default) Word size can only be 2 or 3

Page 23: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Minimum Requirements for a Minimum Requirements for a HitHit

•Nucleotide BLAST requires one exact match•Protein BLAST requires two neighboring matches within 40 aa

GTQITVEDLFYNI

SEI YYN

ATCGCCATGCTTAATTGGGCTT

CATGCTTAATT

neighborhood words

exact word match

one match

two matches

Page 24: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

An alignment that BLAST can’t An alignment that BLAST can’t findfind

1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| |

1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG

61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT

| || || || ||| || | |||||| || | |||||| ||||| | |

61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT

121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC

|||| || ||||| || || | | |||| || |||

121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

Page 25: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Megablast: NCBI’s Genome Megablast: NCBI’s Genome AnnotatorAnnotator

Long alignments for similar DNA sequences Concatenation of query sequences Faster than blastn Contiguous Megablast

– exact word match– Word size 28

Discontiguous Megablast– initial word hit with mismatches– cross-species comparison

Page 26: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Templates for Discontiguous Templates for Discontiguous WordsWords

W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111

Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5

W = word size; # matches in template

t = template length (window size within which the word match is evaluated)

Page 27: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Local Alignment StatisticsLocal Alignment StatisticsHigh scores of local alignments between two random sequencesfollow the Extreme Value Distribution

Score

Alig

nm

en

ts

(applies to ungapped alignments)

E = Kmne-S or E = mn2-S’

K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2

Expect Value

E = number of database hits you expect to find by chance

size of database

your score

expected number of random hits

Page 28: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Scoring SystemsScoring Systems

•Position Independent MatricesPosition Independent Matrices•Nucleic Acids – identity matrix•Proteins

•PAM Matrices (Percent Accepted Mutation)•Implicit model of evolution•Higher PAM number all calculated from PAM1•PAM250 widely used

•BLOSUM Matrices (BLOck SUbstitution Matrices)•Empirically determined from alignment of conserved blocks•Each includes information up to a certain level of identity•BLOSUM62 widely used

•Position Specific Score Matrices (PSSMs)Position Specific Score Matrices (PSSMs)•PSI and RPS BLAST

Page 29: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X

BLOSUM62BLOSUM62Common amino acids have low weights

Rare amino acids have high weights

Negative for less likely substitutions

Positive for more likely substitutions

Page 30: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Position Specific Substitution Position Specific Substitution Rates Rates

Active site serineActive site serineTypical serineTypical serine

Page 31: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Position Specific Score Matrix Position Specific Score Matrix (PSSM)(PSSM)

A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3

Serine scored differently in these two positions

Active site nucleophile

Page 32: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Gapped AlignmentsGapped Alignments

•Gapping provides more biologically realistic alignments•Gapped BLAST parameters must be simulated

•Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)

Page 33: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

ScoresScores

V D S – C Y

V E T L C F

BLOSUM62 +4 +2 +1 -12 +9 +3 7

PAM30 +7 +2 0 -10 +10 +2 11

Page 34: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

WWW WWW BLASTBLAST

Page 35: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

The BLAST The BLAST homepagehomepage

Specialized Databases

Standard databases

Page 36: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

BLAST Databases: Non-redundant BLAST Databases: Non-redundant proteinprotein

nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein

PIR, Swiss-Prot, PRFPDB (sequences from structures)

pat protein patents

env_nr environmental samples

nr (non-redundant protein sequences)– GenBank CDS translations– NP_ RefSeqs– Outside Protein

PIR, Swiss-Prot, PRFPDB (sequences from structures)

pat protein patents

env_nr environmental samples

Page 37: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Nucleotide Databases: GenomicNucleotide Databases: Genomic

Human and mouse genomes and reference transcripts now available

Human and mouse genomes and reference transcripts now available

Page 38: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Nucleotide Databases: Nucleotide Databases: StandardStandard

Page 39: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Nucleotide Databases: Nucleotide Databases: TraditionalTraditional

nr (nt)– Traditional GenBank– NM_ and XM_

RefSeqs refseq_rna

refseq_genomic– NC_ RefSeqs

dbest – EST Division

est_human, mouse, others

htgs – HTG division

gss – GSS division

wgs– whole genome shotgun

env_nt– environmental samples

Page 40: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

3000 Myr3000 Myr

1000 Myr1000 Myr

540 Myr540 Myr

Alzheimer’sDisease

Ataxiatelangiectasia

Colon cancer

Pancreaticcarcinoma

Yeast BacteriaWormFlyHuman

BLAST and Molecular BLAST and Molecular EvolutionEvolution

MLH1 MutL

Page 41: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Protein BLAST PageProtein BLAST Page

>Mutated in Colon CancerIETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEVQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSS

Protein database

Page 42: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Advanced Options: Entrez limitAdvanced Options: Entrez limit

all[Filter] NOT mammals[Organism]

gene_in_mitochondrion[Properties]2003:2005 [Modification Date]tpa[Filter]

Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]

all[Filter] NOT mammals[Organism]

gene_in_mitochondrion[Properties]2003:2005 [Modification Date]tpa[Filter]

Nucleotidebiomol_mrna[Properties]biomol_genomic[Properties]

Page 43: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Advanced Options: FiltersAdvanced Options: Filters

Hides low complexity for initial word hits only

Hides low complexity for initial word hits only

Masks regions of query in lower case (pre-masked)Masks regions of query in lower case (pre-masked)

Masks Human or Mouse Interspersed repeats.Default for genome searches.Masks Human or Mouse Interspersed repeats.Default for genome searches.

ProteinProtein

NucleotideNucleotide

Masks Low Complexity Sequencewith X or n

Masks Low Complexity Sequencewith X or n

Page 44: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Advanced Options: Advanced Options: Composition based Composition based statsstats

Amino acid composition:Ala (A) 42 19.6%Arg (R) 4 1.9%Asn (N) 4 1.9%Asp (D) 1 0.5%Cys (C) 0 0.0%Gln (Q) 2 0.9%Glu (E) 6 2.8%Gly (G) 13 6.1%His (H) 0 0.0%Ile (I) 3 1.4%Leu (L) 10 4.7%Lys (K) 57 26.6%Met (M) 0 0.0%Phe (F) 1 0.5%Pro (P) 19 8.9%Ser (S) 23 10.7%Thr (T) 14 6.5%Trp (W) 0 0.0%Tyr (Y) 1 0.5%Val (V) 14 6.5%

Negatively charged residues (Asp + Glu): 7Positively charged residues (Arg + Lys): 61

Amino acid composition:Ala (A) 42 19.6%Arg (R) 4 1.9%Asn (N) 4 1.9%Asp (D) 1 0.5%Cys (C) 0 0.0%Gln (Q) 2 0.9%Glu (E) 6 2.8%Gly (G) 13 6.1%His (H) 0 0.0%Ile (I) 3 1.4%Leu (L) 10 4.7%Lys (K) 57 26.6%Met (M) 0 0.0%Phe (F) 1 0.5%Pro (P) 19 8.9%Ser (S) 23 10.7%Thr (T) 14 6.5%Trp (W) 0 0.0%Tyr (Y) 1 0.5%Val (V) 14 6.5%

Negatively charged residues (Asp + Glu): 7Positively charged residues (Arg + Lys): 61

Histone H1

Page 45: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

BLAST Formatting Page BLAST Formatting Page

Conserved DomainConserved Domain

Page 46: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

BLAST Output: Graphical BLAST Output: Graphical OverviewOverview

mouse overmouse over

Sort by taxonomySort by taxonomy

Page 47: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

BLAST Output: DescriptionsBLAST Output: Descriptions

Link to entrezLink to entrez

Sorted by e valuesSorted by e values

3 X 10-123 X 10-12

Default e value cutoff 10Default e value cutoff 10

Gene LinkoutGene Linkout

Page 48: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

TaxBLAST: Taxonomy ReportsTaxBLAST: Taxonomy Reports

Page 49: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615

Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%)

Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ LSbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338

>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615

Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%)

Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ LSbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338

BLAST Output: AlignmentsBLAST Output: Alignments

Identical matchIdentical match

positive score(conservative)positive score(conservative)

negative substitution

negative substitution gapgap

Page 50: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Low Complexity FilterLow Complexity Filter

>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 Length=756

Score = 231 bits (589), Expect = 1e-62 Identities = 131/131 (100%), Positives = 131/131 (100%), Gaps = 0/131 (0%)

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335

Query 61 GSNSSRMYFTQTLLPGLAGPSGEMVKsttsltssstsgssDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDASbjct 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395

Query 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct 396 FLQPLSKPLSS 406

>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 Length=756

Score = 231 bits (589), Expect = 1e-62 Identities = 131/131 (100%), Positives = 131/131 (100%), Gaps = 0/131 (0%)

Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335

Query 61 GSNSSRMYFTQTLLPGLAGPSGEMVKsttsltssstsgssDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDASbjct 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395

Query 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct 396 FLQPLSKPLSS 406

low complexity sequence filtered

Page 51: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Nucleotide: Human RepeatsNucleotide: Human Repeats

Human Albumin Genomic RegionHuman Albumin Genomic Region

Page 52: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Nucleotide: Human Repeat Nucleotide: Human Repeat FilterFilter

Alb mRNAsAlb mRNAs

Page 53: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Nucleotide BLAST: New OutputNucleotide BLAST: New Output

Crab-eating macaque CDC20 mRNA

Crab-eating macaque CDC20 mRNA

Default human databaseDefault human database

New output displayNew output display

Page 54: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Sortable ResultsSortable Results

Pseudogene on Chromosome 9Pseudogene on Chromosome 9

Functional Gene on Chromosome 1Functional Gene on Chromosome 1

Separate Sections for

Transcript and Genome

Separate Sections for

Transcript and Genome

Page 55: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Total Score: All SegmentsTotal Score: All Segments

Functional Gene Now FirstFunctional Gene Now First

Page 56: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Sorting in Exon OrderSorting in Exon Order

Default Sorting Order: ScoreLongest exon usually firstDefault Sorting Order: ScoreLongest exon usually first

Query start positionExon orderQuery start positionExon order

Page 57: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Links to Map ViewerLinks to Map Viewer

Chromosome 1 Chromosome 9

Page 58: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Service AddressesService Addresses

•General Help [email protected]•BLAST [email protected]

Telephone support: 301- 496- 2475

Page 59: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Back to Multiple Sequence Back to Multiple Sequence AlignmentAlignment

Page 60: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Multiple Sequence AlignmentMultiple Sequence Alignment

An extension of the pair-wise alignment…– We will learn by example– We will use Jalview to learn it

Page 61: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

JalviewJalview

Viewing– Reads and writes

alignments– save alignments and

associated trees

Editing– Inserted/delete Gaps– Insert/delete gaps in

groups of sequences.– Remove of gapped

columns

Analysis– Align sequences using Web

Services – Amino acid conservation analysis – Alignment sorting options (by

name, tree order, percent identity, group)

– UPGMA and NJ trees calculated and drawn

– Sequence clustering using principal component analysis.

– Removal of redundant sequences.– Smith Waterman pairwise

alignment of selected sequences.

Page 62: Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

AcknowledgementAcknowledgement

Dr. Peter Cooper at NCBI for permission to use the BLAST Powerpoint presentation

Dr. Kurt Wollenberg for slides on Dynamic Programming


Recommended