Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | andrea-johnson |
View: | 216 times |
Download: | 1 times |
NC
BI
Fie
ldG
uid
eMapViewerMapViewer
Genome Resources and Sequence Similarity
LocusLinkLocusLink
UniGeneUniGene
HomologeneHomologene
Basic Local AlignmentBasic Local Alignment
Search Tool Search Tool
Gene databaseGene database
NC
BI
Fie
ldG
uid
eTopics
• Why use sequence similarity?
• BLAST algorithm
– blastn, blastp, megablast
• BLAST statistics
• BLAST output
• Examples
NC
BI
Fie
ldG
uid
eWhy Do We Need
Sequence Similarity Searching?
• To identify and annotate sequences
• To evaluate evolutionary relationships
• Other:
– model genomic structure (e.g., Spidey)
– check primer specificity in silico
: NCBI’s tool
NC
BI
Fie
ldG
uid
eGlobal vs Local Alignment
Seq 1
Seq 2
Seq 1
Seq 2
Global alignment
Local alignment
NC
BI
Fie
ldG
uid
e
Global vs Local Alignment
Seq1: WHEREISWALTERNOW (16aa)Seq2: HEWASHEREBUTNOWISHERE (21aa)
Global
Seq1: 1 W--HEREISWALTERNOW 16 W HERE
Seq2: 1 HEWASHEREBUTNOWISHERE 21
LocalSeq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERESeq2: 3 WASHERE 9 Seq2: 15 WISHERE 21
NC
BI
Fie
ldG
uid
eGlobal programming algorithm
NC
BI
Fie
ldG
uid
eGlobal Dynamic Programming
• Full sequence must be aligned• Gaps at ends are penalized as much as
internal ones• F(n,m) is the best score for alignment• Traceback can give >1 correct alignment• Used to examine closely related
sequences• http://www.sbc.su.se/~per/molbioinfo2001/
dynprog/dynamic.html
NC
BI
Fie
ldG
uid
eLocal Alignment – Smith-Waterman
NC
BI
Fie
ldG
uid
eLocal alignments - How
• Notice the top row and left column are now filled with 0 (if the best alignment has a negative score, it’s better to start a new one)
• The alignment can end anywhere in the matrix
• Instead of starting at F (n, m), start traceback at highest value of F (i, j); the traceback ends when you hit a 0
NC
BI
Fie
ldG
uid
e
Heuristic alignment algorithms
• Shortcuts are important– Searching a sequence length of 1000 against a
database with 108 residues requires approximately 1011 matrix cells. At ten million matrix cells a second, it would take about 3 hours.
• BLAST – the heuristic is based on that true match alignments are very likely to contain somewhere within them a sort stretch of identities. Look for short stretches to serve as seeds to extend.
NC
BI
Fie
ldG
uid
eSeeding
• BLAST takes your query and breaks it down into words of fixed length (3 for protein, 11 for nucleotide)
• It scans through a database looking for a word from the query set with some minimum score T, when it finds it, it begins a “hit” extension to extend the possible match in both directions, stopping at the maximum scoring extension.
NC
BI
Fie
ldG
uid
eExtension
• The seeds are extended to locally optimal pairs, whose scores cannot be improved by extension or trimming.
• These locally optimal alignments are called high scoring segment pairs or HSP’s
• Sometimes you return only a portion of a sequence – this is the reason you need to look carefully at your BLAST alignments
NC
BI
Fie
ldG
uid
eAlignment example
• The quick brown fox jumps over the lazy dog.• The quiet brown cat purrs when she sees him.
• Matches = +1; Mismatches = -1; ignore spaces and do not allow gaps. • Assume the seed is the capital T, extend the alignment• You’ll hit a mismatch c/e should you continue and how far?• Generate a variable X to measure how far the score drops off.• Set X = 5 and try the alignment…• Set X = 2 and try again …• A large X value will increase the speed, however, speed is often
modulated by word size and other parameters…
NC
BI
Fie
ldG
uid
eGapped BLAST – a time saver
• Extension is costly, now have a two hit (gapped) BLAST where you require two hits within a distance (A)
• A gapped extension takes much longer to execute than ungapped, but overall run fewer extensions – time saver
• Gapped BLAST requires two non-overlapping hits of at least score (T) within distance A of one another before ungapped extension of second hit
• T is adjustable, higher the T then the smaller the search space
NC
BI
Fie
ldG
uid
eEvaluation
• Once seeds are extended to generate alignments, these alignments are tested for statistical significance.
• Can establish thresholds for reporting
NC
BI
Fie
ldG
uid
eThe Flavors of BLAST
• Standard BLAST– traditional “contiguous” word hit– position independent scoring – nucleotide, protein and translations (blastn, blastp,
blastx, tblastn, tblastx)• Megablast
– optimized for large batch searches– can use discontiguous words
• PSI-BLAST– constructs PSSMs automatically; uses as query– very sensitive protein search
• RPS BLAST– searches a database of PSSMs– tool for conserved domain searches
NC
BI
Fie
ldG
uid
eBLASTN variations
• BLASTN seeds are always identical words; T is never used
• To make BLASTN faster, increase word size, to make it more sensitive decrease word size
• MegaBLAST increases word size to 28• The minimum word size is 7• http://monod.uwaterloo.ca/papers/02ph.pd
f
NC
BI
Fie
ldG
uid
eBLASTP implementation
• To make searches faster, set word size to 3 and T to a large value (999), which removes all potential neighborhood words (two-hit distance is 40 amino acids by default)
• Affine gaps– Decreased penalty for gap extension relative
to gap introduction
NC
BI
Fie
ldG
uid
eAlso, FASTA
• Similar to Gapped BLAST – except bigger neighborhood
• Generates a lookup table to locate all identically matching words of length ktup protein 1-2, DNA 4-6
• Once identified, looks for diagonals with many mutually supporting word matches
• Extensions similar to BLAST
NC
BI
Fie
ldG
uid
e
Scoring Matrices
• Scoring matrix specifies a score, sij, for aligning sequence I with sequence II.
• Choice of matrix depends on the divergence level of desired/expected hits.
• Examples: PAM, BLOSUM• Both can be modified for different divergence
levels (eg, BLOSUM40, BLOSUM62)
• Advice: try several matrices when possible.
NC
BI
Fie
ldG
uid
e
Dayhoff Family of Matrices
• Dayhoff model measures sequence evolution in units of “PAMs”– One PAM unit represents the evolutionary
distance in which 1% of the amino acids have changed.
• Mutability of an aa is its relative rate of change (amino acids with high mutabilities are more likely to change)– Mutability of alanine was defined to be 100.
NC
BI
Fie
ldG
uid
e
Dayhoff Family of Matrices
Problems with the original Dayhoff scheme• It does not consider the genetic code.
– Not all amino acid substitutions can occur by a single nucleotide substitution event.
• Parameters were estimated from a small sample of closely related proteins.
• Evolution at the “average site” of the “average protein” is being modeled.
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
BLOSUM Scoring MatricesBlocks Substitution Matrix. A substitution matrix in
which scores for each position are derived from observations of the frequencies ofsubstitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff)
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
• Widely used similarity search tool• Heuristic approach based on Smith Waterman
algorithm• Finds best local alignments• Provides statistical significance• All combinations (DNA/Protein) query and database.
– DNA vs DNA– DNA translation vs Protein– Protein vs Protein– Protein vs DNA translation– DNA translation vs DNA translation
• www, standalone, and network clients
• Widely used similarity search tool• Heuristic approach based on Smith Waterman
algorithm• Finds best local alignments• Provides statistical significance• All combinations (DNA/Protein) query and database.
– DNA vs DNA– DNA translation vs Protein– Protein vs Protein– Protein vs DNA translation– DNA translation vs DNA translation
• www, standalone, and network clients
Basic Local Alignment Search Tool
NC
BI
Fie
ldG
uid
e
How BLAST Works
• Make lookup table of “words” for query
• Scan database for hits
• Ungapped extensions of hits (initial HSPs)
• Gapped extensions (no traceback)
• Gapped extensions (traceback; alignment details)
• Make lookup table of “words” for query
• Scan database for hits
• Ungapped extensions of hits (initial HSPs)
• Gapped extensions (no traceback)
• Gapped extensions (traceback; alignment details)
X dropoff (X1)
X dropoff (X2)
X dropoff (X3)
NC
BI
Fie
ldG
uid
eNucleotide Words
GTACTGGACATGGACCCTACAGGAAQuery:
GTACTGGACAT
TACTGGACATG
ACTGGACATGG
CTGGACATGGA
TGGACATGGAC
GGACATGGACC
GACATGGACCC
ACATGGACCCT
Make a lookuptable of words
11-mer
. . .
828megablast
711blastn
minimumdefaultWORD SIZE
NC
BI
Fie
ldG
uid
e
Protein WordsGTQITVEDLFYNIATRRKALKNQuery:
Neighborhood Words
LTV, MTV, ISV, LSV, etc.
GTQ
TQI
QIT
ITV
TVE
VED
EDL
DLF
...
Make a lookuptable of words
Word size = 3 (default)
Word size can only be 2 or 3
[ -f 11 = blastp default ]
NC
BI
Fie
ldG
uid
eMinimum Requirements for a Hit
• Nucleotide BLAST requires one exact match• Protein BLAST requires two neighboring matches within 40 aa
GTQITVEDLFYNI
SEI YYN
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT
neighborhood words
one exact match
two matches
[ -A 40 = blastp default ]
NC
BI
Fie
ldG
uid
e
BLASTP Summary
YLS HFLSbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333
Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47
Gapped extension with trace back
Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + +Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337
Final HSP
+E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ ISbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333
High-scoring pair (HSP)
HFL 18HFV 15 HFS 14HWL 13NFL 13DFL 12HWV 10etc …
YLS 15YLT 12 YVS 12YIT 10etc …
Neighborhood words
Neighborhood score threshold
T (-f) =11
Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV…
example query words
NC
BI
Fie
ldG
uid
eScoring Systems - Nucleotides
A G C T
A +1 –3 –3 -3
G –3 +1 –3 -3
C –3 –3 +1 -3
T –3 –3 –3 +1
Identity matrix
CAGGTAGCAAGCTTGCATGTCA
|| |||||||||||| ||||| raw score = 19-9 = 10
CACGTAGCAAGCTTG-GTGTCA
[ -r 1 -q -3 ]
NC
BI
Fie
ldG
uid
eScoring Systems - Proteins
Position Independent MatricesPAM Matrices (Percent Accepted Mutation)
• Derived from observation; small dataset of alignments• Implicit model of evolution• All calculated from PAM1• PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices)• Derived from observation; large dataset of highly
conserved blocks• Each matrix derived separately from blocks with a
defined percent identity cutoff• BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs)PSI- and RPS-BLAST
NC
BI
Fie
ldG
uid
e
A 4R -1 5 N -2 0 6D -2 -2 1 6C 0 -3 -3 -3 9Q -1 1 0 0 -3 5E -1 0 0 2 -4 2 5G 0 -2 0 -1 -3 -2 -2 6H -2 0 1 -1 -3 0 0 -2 8I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
BLOSUM62
Common amino acids have low weights
Rare amino acids have high weights
D
F
Negative for less likely substitutions
D
Y
FPositive for more likely substitutions
NC
BI
Fie
ldG
uid
ePosition-Specific Score Matrix
DAF-1
Serine/Threonine protein kinases catalytic loop
1 7 4PSSM scores 5 4
NC
BI
Fie
ldG
uid
e
A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3
Position-Specific Score Matrix
catalytic loop
[ >./blastpgp -i NP_499868.2 -d nr -j 3 -Q NP_499868.pssm ]
NC
BI
Fie
ldG
uid
eLocal Alignment Statistics
High scores of local alignments between two random sequencesfollow the Extreme Value Distribution
Score (S)
Alig
nm
en
ts
(applies to ungapped alignments)
E = Kmne-S or E = mn2-S’
K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2
Expect ValueE = number of database hits you expect to find by chance, ≥ S
your score
expected number of
random hits
More info: www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
NC
BI
Fie
ldG
uid
eGapped Alignments
Gapping provides more biologically realistic alignments
Gapped BLAST parameters are simulated for each scoring matrix
Affine gap costs = -(a+bk)a = gap open penalty b = gap extend penaltyA gap of length 1 receives the score -(a+b)
NC
BI
Fie
ldG
uid
eAn Alignment BLAST Cannot Make
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
Reason:
no contiguous exact match of 7 bp.
NC
BI
Fie
ldG
uid
e
BLAST 2 Sequences (blastx) output:
An Alignment BLAST Can Make
Solution: compare protein sequences; BLASTXScore = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3
NC
BI
Fie
ldG
uid
eOther BLAST Algorithms
• Megablast
• Discontiguous Megablast
• PSI-BLAST
NC
BI
Fie
ldG
uid
e
Megablast: NCBI’s Genome Annotator
• Long alignments of similar DNA sequences
• Greedy algorithm
• Concatenation of query sequences
• Faster than blastn; less sensitive
NC
BI
Fie
ldG
uid
e
Discontiguous Megablast
• Uses discontiguous word matches
• Better for cross-species comparisons
NC
BI
Fie
ldG
uid
eDiscontiguous (Cross-species) MegaBLAST
NC
BI
Fie
ldG
uid
eDiscontiguous Word Options
NC
BI
Fie
ldG
uid
eTemplates for Discontiguous Words
W = 11, t = 16, coding: 1101101101101101W = 11, t = 16, non-coding: 1110010110110111W = 12, t = 16, coding: 1111101101101101W = 12, t = 16, non-coding: 1110110110110111W = 11, t = 18, coding: 101101100101101101W = 11, t = 18, non-coding: 111010010110010111W = 12, t = 18, coding: 101101101101101101W = 12, t = 18, non-coding: 111010110010110111W = 11, t = 21, coding: 100101100101100101101W = 11, t = 21, non-coding: 111010010100010010111W = 12, t = 21, coding: 100101101101100101101W = 12, t = 21, non-coding: 111010010110010010111
Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5
W = word size; # matches in template
t = template length
NC
BI
Fie
ldG
uid
eBLAST Databases: Nucleic Acid
nr (nt)• traditional GenBank
divisions• NM_ and XM_ RefSeqs
dbest • EST division
htgs • HTG division
gss • GSS division
chromosome • NC_ RefSeqs
env_nr•environmental sample[filter]•e.g., 16S rRNA
NC
BI
Fie
ldG
uid
e
BLAST Databases: Protein
nr (non-redundant protein sequences) GenBank CDS translations NP_ RefSeqs Outside databases
PIR, Swiss-Prot, PRF PDB (sequences from structures)
env_nr (environmental sample[filter])
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
NC
BI
Fie
ldG
uid
e
Web BLAST: BLASTP
>Mutated in Colon CancerIETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEVQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSDKVYAHQMVRTDSREQKLDAFLQPLSKPLSS
1
1. Paste in the query sequence
2
2. Select the appropriate db
3. BLAST
3
NC
BI
Fie
ldG
uid
eFormat Options
NC
BI
Fie
ldG
uid
eBLAST Formatting Page
102347584-927-19372.BLASTQ3
NC
BI
Fie
ldG
uid
e
RPS-BLAST (CD search) Results Summary
partial sequence
partial domain
NC
BI
Fie
ldG
uid
eRPS-BLAST Results (CDD)
DNA_mis_repair
complete sequence
NC
BI
Fie
ldG
uid
e
BLAST Output: Graphic Overview
Sort results by taxonomy
same database sequence
NC
BI
Fie
ldG
uid
eBLAST Output: Descriptionssorted by e values
8 X 10-58
Bacterial mismatch repair proteins
Linkouts
E value cutoff
GEO
UniGene
Structure
NC
BI
Fie
ldG
uid
e
BLAST Output: Alignments
>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615
Score = 44.3 bits (103), Expect = 5e-05 Identities = 25/59 (42%), Positives = 33/59 (55%), Gaps = 8/59 (13%)
Query: 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILERVQQHIESKL 59 L + P L LEI P VDVNVHP KHEV F +H+ + +L +QQ +E+ LSbjct: 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338
>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615
Score = 44.3 bits (103), Expect = 5e-05 Identities = 25/59 (42%), Positives = 33/59 (55%), Gaps = 8/59 (13%)
Query: 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILERVQQHIESKL 59 L + P L LEI P VDVNVHP KHEV F +H+ + +L +QQ +E+ LSbjct: 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338
positive (conservative) substitution
negative substitution gap
NC
BI
Fie
ldG
uid
eBLAST Output: Alignments & Filter
low complexity sequence filtered
NC
BI
Fie
ldG
uid
e
Advanced OptionsLimit to Organism
protein all[filter] A
Example Entrez Queriesproteins all[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]
Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]
OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments
Example Entrez Queriesproteins all[Filter] NOT mammalia[Organism]ray finned fishes[Organism]srcdb refseq[Properties]
Nucleotide only:biomol mrna[Properties]biomol genomic[Properties]
OtherAdvanced–e 10000 expect value-v 2000 descriptions-b 2000alignments
Filter options
-e 10000 -v 2000
NC
BI
Fie
ldG
uid
e
PSI-BLAST
Example: Confirming relationships of purine
nucleotide metabolism proteins
Position-specific Iterated BLAST
NC
BI
Fie
ldG
uid
e>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINEMAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFVIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDEQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYRTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAVRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK
PSI-BLAST
0.005 E value cutoff for PSSM
NC
BI
Fie
ldG
uid
e
RESULTS: Initial BLASTP
Same results as protein-protein BLAST; different format
NC
BI
Fie
ldG
uid
eResults of First PSSM Search
Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
NC
BI
Fie
ldG
uid
eTenth PSSM Search: Convergence
Just below threshold, another nucleotide metabolism enzyme
Check to add to PSSM
NC
BI
Fie
ldG
uid
eReverse PSI-BLAST (RPS)-BLAST
NC
BI
Fie
ldG
uid
eAdenosine/AMP Deaminase Domain
AMP Deaminases
.
.
.
NC
BI
Fie
ldG
uid
ePHI-BLAST
>gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASELIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHVIKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDILKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEIASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK
[GA]xxxxGK[ST]
NC
BI
Fie
ldG
uid
eGenome BLAST
NC
BI
Fie
ldG
uid
e
What is an HMM?
• Hidden Markov Model• Important to know: it's a generalization of the
profile in terms of statistical weights, rather than scores.
• At each position, the profile HMM gives the probability of finding a particular amino acid, an insertion, or a deletion
• HMMs are very popular in molecular data analysis but are not specific to this field
NC
BI
Fie
ldG
uid
eA Characterization Example
How could we characterize this (hypothetical) family of nucleotide sequences?– Keep the Multiple Alignment– Try a regular expression
[AT] [CG] [AC] [ACTG]* A [TG] [GC]• But what about?
– T G C T - - A G G vrs– A C A C - - A T C
– Try a consensus sequence:A C A - - - A T C• Depends on distance measure
Example borrowed from Salzberg, 1998
NC
BI
Fie
ldG
uid
e
HMMs to the rescue!
Transition probabilitiesEmission Probabilities
NC
BI
Fie
ldG
uid
e
Insert (Loop) States
NC
BI
Fie
ldG
uid
eScoring our simple HMM
• #1 - “T G C T - - A G G” vrs: #2 - “A C A C - - A T C”– Regular Expression ([AT] [CG] [AC] [ACTG]* A [TG] [GC]):
• #1 = Member #2: Member
– HMM: • #1 = Score of 0.0023% #2 Score of 4.7% (Probability)
• #1 = Score of -0.97 #2 Score of 6.7 (Log odds)
NC
BI
Fie
ldG
uid
eStandard Profile HMM Architecture
• Three types of states:– Match– Insert– Delete
• One delete and one match per position in model
• One insert per transition in model
• Start and end “dummy” states
Example borrowed from Cline, 1999
NC
BI
Fie
ldG
uid
eAligning and Training HMMs
• Training from a Multiple Alignment
• Aligning a sequence to a model– Can be used to create an alignment– Can be used to score a sequence– Can be used to interpret a sequence
• Training from unaligned sequences (not included in current HMMer package)
NC
BI
Fie
ldG
uid
eTraining from an existing alignment
• This process what we’ve been seeing up to this point.– Start with a predetermined number of states in your
HMM.– For each position in the model, assign a column in the
multiple alignment that is relatively conserved.– Emission probabilities are set according to amino acid
counts in columns.– Transition probabilities are set according to how many
sequences make use of a given delete or insert state.
NC
BI
Fie
ldG
uid
eRemember the simple example
• Chose six positions in model.• Highlighted area was selected to be modeled by
an insert due to variability.
NC
BI
Fie
ldG
uid
eAligning sequences to a model
• Now that we have a profile model, let’s use it!
• Try every possible path through the model that would produce the target sequence – Keep the best one and its probability.
NC
BI
Fie
ldG
uid
e
A T C T C - C G A
A G C T - - T G G
T G T T C T C T A
A A C T C - C G A
A G C T C - C G A
Profile HMMs
A 0.8 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.8 T 0.2 0.2 0.2 1.0 0.0 0.2 0.2 0.2 0.0 G 0.0 0.6 0.0 0.0 0.0 0.0 0.0 0.8 0.2 C 0.0 0.0 0.8 0.0 0.8 0.0 0.8 0.0 0.0 P
roba
bili
ty
NC
BI
Fie
ldG
uid
e
A .8C 0G 0T .2
A .2C 0G .6T .2
A 0C .8G 0T .2
A 0C 0G 0T 1
A 0C .8G 0T .2
A 0C 0G .8T .2
A .8C 0G .2T 0
A 0C .8G 0T .2
1.0 1.0 1.0 1.0 1.0
0.80.8
0.2
0.2
T T T T - T T T G
. . . . . . . .2 .2 1 0 2 1 0 2 1 1 0 8 0 2 0 8 0 2 1 0 1 0 2
T T T GT TT T
Score = 8.2 x 10-6 Consensus score = 0.1 Scores generally calculated with base e logarithms
NC
BI
Fie
ldG
uid
eThe HMM must first be “trained” using a database of known signals.
Consensus sequences for all signals are needed.
Compositional rules (i.e., emission probabilities) and length distributions are necessary for content sensors.
Transition probabilities between all connected states must be estimated.
Pseudocounts prevent the “regular expression” problem of non-matching or zero probability of a given amino acid…
NC
BI
Fie
ldG
uid
eGene Finding Software
• GENSCAN
• HMMGENE
• GENMARK
• GRAIL
HMMs
Neural Net
NC
BI
Fie
ldG
uid
e
HMM resources
• UC Santa Cruz (David Haussler group)– SAM-02 server. Returns alignments, secondary structure
predictions, HMM parameters, etc. etc.– SAM HMM building program
(requires free academic license)
• Washington U. St. Louis (Sean Eddy group)– Pfam. Large database of precomputed HMM-based
alignments of proteins– HMMer, program for building HMMs
• Gene finders and other HMMs (more later)
NC
BI
Fie
ldG
uid
ehttp://www.cse.ucsc.edu/research/compbio/HMM-apps/HMM-applications.html
NC
BI
Fie
ldG
uid
ehttp://hmmer.janelia.org/
NC
BI
Fie
ldG
uid
e
http://pfam.janelia.org/