Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 223 times |
Download: | 0 times |
Sequence based searching
Lesson 7Based on a presentation by Irit Gat-ViksBased on presentation by Amir Mitchel,Introduction to bioinformatics course,
Bioinformatics unit, Tel Aviv University.
Reminder – Importance of Homology
Use a sequence as a search query in order to find homologous sequences in a database.
Homology – similarity between sequences that results from a common ancestor.
Basic Assumption: Sequence homology → similar structure/function
Why ?• Characterizing an ORF.• Finding duplicate genes in the same organism
(known function, variants)• Finding homologues genes in other organisms (phylogeny, known
function)
Study a sequence through homologs
Identity Similarity HomologyIdentity Similarity HomologyQuery= uniprot|Q9UP52|TFR2_HUMAN Transferrin receptor protein 2 (TfR2).
>gi|20140567|sp|Q07891|TFR1_CRIGR Transferrin receptor protein 1 (TfR1) (TR) (TfR) (Trfr)
Length = 757 Score = 540 bits (1392), Expect = e-152 Identities = 305/727 (41%), Positives = 412/727 (56%), Gaps = 52/727 (7%)Query: 87 LTALLIFTGAFLLGYVAF--RGSCQAC--------GDSVLVVSEDVNYEPDLDFHQGRLY 136 + ++ F F++GY+ + R + C G+S ++ E++ RLYSbjct: 71 IAVVIFFLIGFMIGYLGYCKRTEQKDCVRLAETETGNSEIIQEENIP-------QSSRLY 123
Query: 137 WSDLQAMFLQFLGEGRLEDTIRQTSLRERVAGSAGMAALTQDIRAALSRQKLDHVWTDTH 196 W+DL+ + + L DTI+Q S R AGS L I KL VW D HSbjct: 124 WADLKKLLSEKLDAIEFTDTIKQLSQTSREAGSQKDENLAYYIENQFRDFKLSKVWRDEH 183
Query: 197 YVGLQFPDPAHPNTLHWVDEAGKVGEQLPLEDPDVYCPYSAIGNVTGELVYAHYGRPEDL 256 YV +Q A N + ++ G + +E+P Y YS V+G+L++A++G +D Sbjct: 184 YVKIQVKGSAAQNAVTIINVNG---DSDLVENPGGYVAYSKATTVSGKLIHANFGTKKDF 240
Query: 257 QDLRAXXXXXXXXXXXXXXXXISFAQKVTNAQDFGAQGVLIYPEPADFSQDPPKPSLSSQ 316 +DL+ I+FA+KV NAQ F A GVLIY + F P + ++Sbjct: 241 EDLK---YPVNGSLVIVRAGKITFAEKVANAQSFNAIGVLIYMDQTKF------PVVEAE 291
Query: 317 QAVYGHVHLGTGDPYTPGFPSFNQTQFPPVASSGLPSIPAQPISADIASRLLRKLKGPVA 376 +++GH HLGTGDPYTPGFPSFN TQFPP SSGLPSIP Q IS A +L + ++ Sbjct: 292 LSLFGHAHLGTGDPYTPGFPSFNHTQFPPSQSSGLPSIPVQTISRKAAEKLFQNMETNCP 351
• For Proteins, finding distant relatives is a difficult task.For Proteins, finding distant relatives is a difficult task.• Distant protein family members, may share <20% amino acid identity(!).Distant protein family members, may share <20% amino acid identity(!).
>gi|3582021|emb|CAA70575.1| cytochrome P450 [Nepeta racemosa] Length = 509
Score = 405 bits (1043), Expect = e-111 Identities = 94/479 (19%), Positives = 192/479 (40%), Gaps = 35/479 (7%)
Query: 61 NLYHFWRETGTHKVHLHHVQNFQKYGPIYREKLGNVESVYVIDPEDVALLFKSEGPNPER 120 NL+ G + H + ++YGP+ + G+V + PE + K++ Sbjct: 45 NLHQL----GLY-PHRYLQSLSRRYGPLMQLHFGSVPVLVASSPEAAREIMKNQDIVFSN 99 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Query: 297 -----DYRGMLYRLLGDSK----MSFEDIKANVTEMLAGGVDTTSMTLQWHLYEMARNLK 347 D+ +L + ++K + + +KA + +M G DTT+ L+W + E+ +N +Sbjct: 271 GDGALDFVDILLQFQRENKNRSPVEDDTVKALILDMFVAGTDTTATALEWAVAELIKNPR 330
Query: 348 VQDMLRAEVLAARHQAQGDMATMLQLVPLLKASIKETLRLH-PISVTLQRYLVNDLVLRD 406 L+ EV L+ +P LKASIKE+LRLH P+ + + R D + Sbjct: 331 AMKRLQNEVREVAGSKAEIEEEDLEKMPYLKASIKESLRLHVPVVLLVPRESTRDTNVLG 390
Query: 407 YMIPAKTLVQVAIYALGREPTFFFDPENFDPTRWLSK--DKNITYFRNLGFGWGVRQCLG 464 Y I + T V + +A+ R+P+ + +PE F P R+L D +F L FG G R C GSbjct: 391 YDIASGTRVLINAWAIARDPSVWENPEEFLPERFLDSSIDYKGLHFELLPFGAGRRGCPG 450 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Query typesDNA vs. Protein• The sequence query can be a nucleotide sequence or an amino acid
sequence.• The search is preformed against a nucleotide or amino acid database
Which search is preferable?1. The genetic code is redundant. Some amino acids are coded by more
than one codon. Therefore, the DNA sequence can change while the amino acid sequence will remain the same.
2. Nucleotides: a four letter alphabet. Amino acids: a twenty letter alphabet. Two random DNA sequences will share on average 25% of identity. Two random protein sequences will share on average 5% of identity.
3. Protein comparison matrices are much more sensitive than those for DNA, i.e., similarity relationships are defined between two amino acids (PAM/Blosum).
4. DNA databases are much larger, meaning more random hits.
Using the amino acid sequence is
preferable for homology search.
1. Protein sequence comparisons typically double the evolutionary look-back time over DNA sequence comparisons.
2. Evolutionary distant proteins will exhibit a high similarity rather than a high identity.
3. Hits can exhibit a long alignment (homology) or a short alignment (conserved domains).
Why use a nucleotide sequence after all?
Query type• The sequence query can be a nucleotide sequence or an amino acid sequence.
But … we can translate the query sequence!• The search is performed against a nucleotide or amino acid database.
But … we can use translated databases! (e.g., trEMBL)
All types of searches are possible.
• Nucleotide query can be translated and searched against protein databases:1. Translate all reading frames (3 + 3)2. Find long ORF.
• Amino acid query can be back-translated to and searched against nucleotide databases?
1. During translation we lose information. 2. A single amino acid sequence can be back-translated to many possible nucleotide
sequences .
Query: DNA Protein
Database: DNA Protein
Query types
1. amino acid query against protein database (blastp)– identifying a protein sequence – finding similar sequences in protein databases.
2. nucleotide query against nucleotide database (blastn)– In non-coding regions (no ORF found)- Identify the query sequence or find similar
sequences.– Find primer binding sites or map short contiguous motifs
3. compares translated nucleotide query against protein database. (blastx) – Useful when the query include a coding region, and we try to find homologous
proteins. – Used extensively in analyzing EST sequences. This search is more sensitive than
nucleotide blast since the comparison is performed at the protein level.4. protein query against translated nucleotide database (tblastn)
– useful for finding protein homologs in unnannotated nucleotide data of coding regions (e.g., ESTs, draft genome records (HTG)).
5. translated nucleotide query against translated nucleotide database. (tblastz)– Useful for identifying novel genes in error prone query sequences. – Used for identifying potential proteins encoded by single pass read ESTs.
* six-frame in all translations!!!
CG © Ron Shamir, 06 9
Why Heuristics ?• Motivation:
– Dynamic programming guarantees an optimal solution & is efficient, but
– Not fast enough when searching a database of size ~109, with a query of length 200-500bp
• Solutions:– Implement on hardware. (COMPUGEN)– Parallel hardware. (MASSPAR) – Use faster heuristic algorithms.
• Common Heuristics: FASTA, BLAST
CG © Ron Shamir, 06 10
Key observations
• Even O(m+n) time would be problematic when db size is huge
• Numerous queries are run on the same db• Preprocessing of the db is desirable• Substitutions are much more likely than indels• Homologous sequences contain many matches
CG © Ron Shamir, 06 11Alignment Dot-Plot Matrix
a a g t c c c g t g
a * *
g * * *
g * *
t * *
c * * *
c * * *
g * * *
t * *
t * *
c * * *
BLAST
• Preprocess – Low complexity regions are removed– A dictionary for K-tuple words is prepared for the query
sequence and the database. Protein 3 letter words, DNA 4-6 or even 11 letter words.
• Searches for K-tuple words and find database records with common words. Words can be similar, not only identical. – Identity - CAT : CAT– Similarity – CAT : CAT, CAR, HAT …– But even CAT : ZTX can be similar– For each three letter word there are at most 203 similar
words.– Similar words are only the ones that have a minimum
cut-off score (T).
BLAST Stage I
• Find matching word pairs
• Extend word pairs as much as possible,i.e., as long as the total weight increases
• Result: High-scoring Segment Pairs (HSPs)
THEFIRSTLINIHAVEADREAMESIRPATRICKREAD
INVIEIAMDEADMEATTNAMHEWASNINETEEN
BLAST Stage II
• Try to connect HSPs by aligning the sequences in between them:
THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD
INVIEIAMDEADMEATTNAMHEW___ASNINETEEN
Versions of the programVersions of the program[t]BLAST[x/n/p]
t : Translate a DNA database in all 6 reading frames for comparison with a Protein query.
x : Translate a nucleotide query in all 6 reading frames for comparison with a Protein database.
p : Comparison is against a Protein database.
n : Comparison is against a Nucleotide database.
BLAST
Masking low complexity
• There is one frequent case where the random models and therefore the statistics discussed here break down: regions with highly biased amino acid composition ("low complexity" regions). • Alignments of two regions with similarly biased composition may achieve very high scores that owe virtually nothing to residue order but are due to segment composition. • Usually generated by slippage and thus not interesting. The BLAST programs employ the SEG (protein) DUST(DNA) algorithm to filter low complexity regions from proteins before executing a database search.•Masking is practiced on the query sequence only, not on the database sequences!
BLAST
http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/
BLAST http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/
BLAST
BLAST I can limit my search to a selected organism
I can limit my search to a selected organism
…or even construct my own searchable database by an
Entrez query
…or even construct my own searchable database by an
Entrez query
Filter Low Complexity regions by SEG or DUST
Mask for lookup table hit search stage, but NOT for the
hit extension stage.
Mask According to the case within the query sequence.
BLAST
Lineage Report
root. Bilateria [animals]. . Coelomata [animals]. . . Euteleostomi [vertebrates]. . . . Tetrapoda [vertebrates]. . . . . Eutheria [mammals]. . . . . . Homo sapiens (man) ------------ 571 18 hits [mammals] retinoic acid induced 3; retinoic acid responsive gene [Hom. . . . . . Mus musculus (mouse) .......... 432 15 hits [mammals] retinoic acid inducible protein 3 [Mus musculus]. . . . . . Rattus norvegicus (brown rat) . 411 5 hits [mammals] similar to retinoic acid inducible protein 3 [Rattus norveg. . . . . Xenopus laevis (clawed frog) ---- 216 1 hit [amphibians] MGC68729 protein [Xenopus laevis]. . . . Takifugu rubripes (torafugu) ------ 40 4 hits [bony fishes] pheromone receptor [Takifugu rubripes]. . . Drosophila melanogaster ------------- 48 4 hits [flies] CG8285-PA [Drosophila melanogaster] >gi|2827758|sp|P22815|B. . . Drosophila virilis .................. 39 1 hit [flies] Bride of sevenless protein precursor >gi|1079166|pir||A4755. . . Anopheles gambiae str. PEST ......... 38 1 hit [flies] ENSANGP00000013404 [Anopheles gambiae] >gi|21296536|gb|EAA0. . Caenorhabditis elegans ---------------- 41 2 hits [nematodes] calcium-sensing receptor, similar to human metabotropic glu. environmental sequence ------------------ 40 2 hits [unclassified] unknown [environmental sequence]
Lineage Report
root. Bilateria [animals]. . Coelomata [animals]. . . Euteleostomi [vertebrates]. . . . Tetrapoda [vertebrates]. . . . . Eutheria [mammals]. . . . . . Homo sapiens (man) ------------ 571 18 hits [mammals] retinoic acid induced 3; retinoic acid responsive gene [Hom. . . . . . Mus musculus (mouse) .......... 432 15 hits [mammals] retinoic acid inducible protein 3 [Mus musculus]. . . . . . Rattus norvegicus (brown rat) . 411 5 hits [mammals] similar to retinoic acid inducible protein 3 [Rattus norveg. . . . . Xenopus laevis (clawed frog) ---- 216 1 hit [amphibians] MGC68729 protein [Xenopus laevis]. . . . Takifugu rubripes (torafugu) ------ 40 4 hits [bony fishes] pheromone receptor [Takifugu rubripes]. . . Drosophila melanogaster ------------- 48 4 hits [flies] CG8285-PA [Drosophila melanogaster] >gi|2827758|sp|P22815|B. . . Drosophila virilis .................. 39 1 hit [flies] Bride of sevenless protein precursor >gi|1079166|pir||A4755. . . Anopheles gambiae str. PEST ......... 38 1 hit [flies] ENSANGP00000013404 [Anopheles gambiae] >gi|21296536|gb|EAA0. . Caenorhabditis elegans ---------------- 41 2 hits [nematodes] calcium-sensing receptor, similar to human metabotropic glu. environmental sequence ------------------ 40 2 hits [unclassified] unknown [environmental sequence]
BLAST
BLAST
BLAST
CG © Ron Shamir, 06 26
BLAST statistics• Theory of Karlin, Altschul, and Dembo on the
distribution of the MSP of score at random• Define parameters K, (depending on AA
distribution)• Pr (finding a pair of score >S in comparing two
random seqs of length m, n) = 1 – e-y where Y=Kmn e-s
• Extreme value dist (or Gumbell dist)• Allow the calculated choice of smallest C
CG © Ron Shamir, 06 27
Improved BLAST - Altschul et al. 97
Two Hit Version: • Need two non-overlapping w-long words with:
– score t, each– on same diagonal– within distance AA
• Allow local alignment with indels (merging local alignments from different diagonals)
• Extends less starting points!
The statistics of sequence similarity scores
• Bits score – A score for the alignments according to the number of similarities, identities, etc.
• Expected-score (E-value) (of an alignment having a score S):
The number of times one expects to find alignments with a score >= S of a random sequence Vs. a random database. (having the same lengths and compositions). The closer the e-value approaches zero, the greater the confidence that the match is real (from zero to one).
BLAST
What about:•Short sequences?•large sequences and queries?
Short sequences:
Parameter settings for standard blastp and "Search for short and nearly exact matches"
ProgramWord Size
SEG Filter
Expect Value
Score Matrix
Standard Blastp 3 On 10 BLOSUM62
Search for short and nearly exact matches
2 Off 20000 PAM30
Parameter settings for standard blastn and"Search for short and nearly exact matches"
ProgramWord Size
DUST Filter
Expect Value
Standard blastn 11 On 10
Search for short nearly exact matches
7 Off 1000
PAM vs. BLUSOM- reminder– Different BLOSUM matrices are derived from blocks with different identity
percentage. (e.g., blosum62 is derived from an alignment of sequences that share at least 62% identity.) Larger n smaller evolutionary distance.
– Single PAM was constructed from at least 85% identity dataset. Different PAM matrices were computationally derived from it. Larger n larger evolutionary distance
Observed % Difference
Evolutionary distance (PAM)
BLOSUM
1 1 9910 11 9020 23 8030 38 7040 56 6050 80 5060 120 4070 159 3080 250 20
62
120
250
How to filter results from large sequences and queries?
1. Some sequences contain large regions of ALU repeats. In this case you can select the "Human Repeat" filtering option on the main BLAST search page. This will mask repeat regions which generate a large number of biologically uninteresting hits to the databases.
2. Increase the Word Size to 20 - 25. With a default Word Size of 7, limiting the number small initial fragments to be extended to HSPs.
3. Decrease the Expect value to 1.0 or lower eliminates many hits and concentrate on results which are more likely to contain large coding regions and genomic fragments.
4. Processing multiple query sequences in one run can be much faster than processing them with separate runs because the database is scanned only 1 time for the entire set of queries.
• The most sensitive BLAST program, making it useful for finding very distantly related proteins.
• Use PSI-BLAST when your standard protein-protein BLAST search failed to find significant hits.
Algorithm:• The first round of PSI-BLAST is a standard protein-protein
BLAST search. The program builds a position-specific scoring matrix (PSSM or profile) from an alignment of the sequences returned with Expect values better (lower) than the inclusion threshold (default=0.005).
• The PSSM will be used to evaluate the alignment in the next iteration of search. Any new database hits below the inclusion finding very distantly related proteins.
PSI-BLAST (Position-Specific Iterated (PSI)-BLAST )Sensitive protein-protein similarity searches.