Post on 07-Apr-2018
transcript
8/6/2019 Sequence Alignment and Searching
1/54
Sequence Alignment and Searching
By: Sarika goyal
8/6/2019 Sequence Alignment and Searching
2/54
What is the purpose of sequence alignment?
Identification of homology and homologous sites inrelated sequences
Inference of evolutionary history that lead to thedifferences in observed sequences
8/6/2019 Sequence Alignment and Searching
3/54
The Problem
Biological problem
Finding a way to compare and represent
similarity or dissimilarity betweenbiomolecular sequences (DNA, RNA or aminoacid)
8/6/2019 Sequence Alignment and Searching
4/54
The Problem
Computational problem
Finding a way to perform inexact or approximatematching of subsequences within strings ofcharacters
Sequence comparison and alignment is a centralproblem in computational biology: High sequence similarity usually => structural
or functional similarity
8/6/2019 Sequence Alignment and Searching
5/54
Substring and subsequence
Example:
xyz is a subsequence within axayaz, butNOT a substring
Characters in a substring must be contiguous
8/6/2019 Sequence Alignment and Searching
6/54
Types of comparisons and alignment methods
LOLOCALCAL GLOBALGLOBAL
TWO SEQUENCESTWO SEQUENCES
(Pairwise alignment)(Pairwise alignment)
Database search againstquery sequences
BLASTalgorithm
Comparison of twosequences;
First step in multiplesequence alignment
THREE OR MORETHREE OR MORESEQUENCESSEQUENCES
(Multiple alignment)(Multiple alignment)
Defining consensussequences, protein
structural motifs anddomains, regulatoryelements in DNA etc.
Determination of conservedresidues and domains;
Introductory step inmolecular phylogenetic
analysis
According tosequenceCoverage:According to
number ofsequences:
8/6/2019 Sequence Alignment and Searching
7/54
Introduction to sequence alignment
Given two text strings:
First string = a b c d e
Second string = a c d ef
a reasonable alignment would
bea b c d e -
a - c d e f
We must choose criteria sothat algorithm can choosethe best alignment.
For the sequences gctgaacgand ctataatc:
An uninformative alignment:
-------gctgaacg
ctataatc-------
An alignment without gaps
gctgaacg
ctataatc
An alignment with gaps
gctga-a--cg--ct-ataatc
And another
gctg-aa-cg
-ctataatc-
8/6/2019 Sequence Alignment and Searching
8/54
The dotplot (1)
A simple picture that gives an overview of the similaritiesbetween two sequences
Dotplot showing identities between sequences (DOROTHYHODGKIN) and(DOROTHYCROWFOOTHODGKIN):
Letters corresponding toisolatedmatches are shown innon-bold type. The longestmatching regions, shown inboldface, are DOROTHY andHODGKIN. Shorter matchingregions, such as the OTH ofdorOTHy and RO ofdoROthyand cROwfoot, are noise.
8/6/2019 Sequence Alignment and Searching
9/54
The dotplot (2)
Dotplot showing identitiesbetween a repetitivesequence(ABRACADABRACADABRA)and itself. The repeatsappear on severalsubsidiary diagonalsparallel to the maindiagonal.
8/6/2019 Sequence Alignment and Searching
10/54
The dotplot (3)
Dotplot showingidentities between thepalindromic sequence
MAX I STAY AWAY ATSIX AM and itself. Thepalindrome revealsitself as a stretch ofmatchesperpendicularto the main diagonal.
8/6/2019 Sequence Alignment and Searching
11/54
Dotplots and sequence alignments
Any path through thedotplot from upper left tolower right passes througha succession of cells, eachof which picks out a pair ofpositions, one from therow and one from thecolumn, that correspond inthe alignment; or thatindicates a gap in one ofthe sequences. The pathneed not pass throughfilled-in points only.However, the more filled-inpoints on the diagonalsegments of the path, themore matching residues inthe alignment.
Corrseponding alignment:
DOROTHY--------HODGKIN
DOROTHYCROWFOOTHODGKIN
8/6/2019 Sequence Alignment and Searching
12/54
Measures of sequence similarity
Two measures of distance between two character strings: The Hamming distance, defined between two strings of equal
length, is the number of positions with mismatching characters. The Levenshtein, or edit distance, between two strings of not
necessarily equal length, is the minimal number of 'editoperations' required to change one string into the other, wherean edit operation is a deletion, insertion or alteration of a single
character in either sequence.
agtc
cgta Hamming distance = 2
ag-tcc
cgctca Levenshtein distance = 3
8/6/2019 Sequence Alignment and Searching
13/54
The Edit Distance between two strings
Definition: The edit distance between two strings is
defined as the minimum number of editoperations insertions, deletions, &substitutions needed to transform the firststring into the second. For emphasis, notethat matches are not counted.
Example: AATT and AATG
Distance = 1 (edit operation of substitution)
8/6/2019 Sequence Alignment and Searching
14/54
String alignment
An edit transcript is a way to represent a particulartransformation of one string into another Emphasizes point mutations in the model
An alignment displays a relationship between twostrings Global alignment means for each string, entire string
is involved in the alignment Examples:
A A G C A
A A _ C _
8/6/2019 Sequence Alignment and Searching
15/54
Sequence diversion
Sequences may have diverged from commonancestor through mutations: Substitution (AAGC AAGT) Insertion (AAG AAGT) Deletion (AAGC AAG)
8/6/2019 Sequence Alignment and Searching
16/54
Scoring schemes
In molecular biology, certain changes are more likelyto occur than others Amino acid substitutions tend to be conservative In nucleotide sequences, transitions are more frequent
than transversions
-> We want to give different weights todifferent edit operations
Example: a DNA substitution matrix:
a g c t
a 20 10 5 5
g 10 20 5 5
c 5 5 20 10
t 5 5 10 20
8/6/2019 Sequence Alignment and Searching
17/54
BLAST the workhorse of bioinformaticshttp://www.ncbi.nlm.nih.gov/BLAST
BLAST = Basic localalignment searchtool
When you have anucleotide orprotein sequence
that you want tosearch againstsequence databases to determine what
the sequence is to find related
sequences
(homologs)
8/6/2019 Sequence Alignment and Searching
18/54
Different BLAST programs
8/6/2019 Sequence Alignment and Searching
19/54
DNA can be translated into six potential proteins
5 CAT CAA
5 ATC AAC
5 TCA ACT
5 GTG GGT
5 TGG GTA
5 GGG TAG
DNA potentially encodes six proteins
5 CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3
3 GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5
8/6/2019 Sequence Alignment and Searching
20/54
(1) Choose the sequence (query)
(2) Select the BLAST program
(3) Choose the database to search
(4) Choose optional parameters
Then click BLAST
Four components to a BLAST search
8/6/2019 Sequence Alignment and Searching
21/54
Step 1: Choose your sequence
Sequence can be input inFASTA format or as accessionnumber
8/6/2019 Sequence Alignment and Searching
22/54
Example of the FASTA format for a BLAST query
8/6/2019 Sequence Alignment and Searching
23/54
Step 2: Choose the BLAST program
Program Input Database 1
blastn DNA DNA
1blastp protein protein
6blastx DNA protein
6tblastn protein DNA 36
tblastx DNA DNA
8/6/2019 Sequence Alignment and Searching
24/54
Step 2: Choose the BLAST program
8/6/2019 Sequence Alignment and Searching
25/54
Step 3: choose the database
nr = non-redundant (most general database)
dbest = database of expressed sequence tags
dbsts = database of sequence tag sites
pdb = sequences derived from 3d structure of proteins
Patents = Nucleotide sequences derived from patentdivision of GenBank.
8/6/2019 Sequence Alignment and Searching
26/54
Step 4a: Select optional search parameters
You can...
choose the organism to search
turn filtering on/off change the substitution matrix
change the expect (e) value
change the word size
change the output format
8/6/2019 Sequence Alignment and Searching
27/54
Step 4a: Select optional search parameters
CD search
8/6/2019 Sequence Alignment and Searching
28/54
Step 4a: Select optional search parameters
Entrez!
Filter
Scoring matrix
Word sizeExpect
organism
8/6/2019 Sequence Alignment and Searching
29/54
filtering
8/6/2019 Sequence Alignment and Searching
30/54
8/6/2019 Sequence Alignment and Searching
31/54
Step 4b: optional formatting parameters
Alignment view
Descriptions
Alignments
page 97
8/6/2019 Sequence Alignment and Searching
32/54
BLAST format options
8/6/2019 Sequence Alignment and Searching
33/54
Alignments Views
Pairwise
Standard BLAST alignment in pairs of query sequence and data
match.
Query: 251 tgaccggtaacgaccgcaccctggacgtcatggcgctggatgtggtgtggacggcgga 3
|||||||||| ||||||| |||||||| |||||| ||||||||||||||||||||
Sbjct: 248575 tgaccggtaaagaccgcagcttggacgtgatggcgatggatgtggtgtggacagcgga248634
8/6/2019 Sequence Alignment and Searching
34/54
8/6/2019 Sequence Alignment and Searching
35/54
database
query
program
8/6/2019 Sequence Alignment and Searching
36/54
8/6/2019 Sequence Alignment and Searching
37/54
High scores
low e values
8/6/2019 Sequence Alignment and Searching
38/54
Algorithm for Blast
8/6/2019 Sequence Alignment and Searching
39/54
How a BLAST search works
The central idea of the BLAST
algorithm is to confine attention
to segment pairs that contain aword pair of length W with a score
of at least T.
Altschul et al. (1990)
8/6/2019 Sequence Alignment and Searching
40/54
BLAST algorithm principles
(Basic Local Alignment Search Tool)
Main idea:
1. Construct a dictionary of all thewords in the query
2. Initiate a local alignment for eachword match between query and DB
query
DB
8/6/2019 Sequence Alignment and Searching
41/54
BLAST Original Version
Dictionary:
All words of length k
Alignment initiated between words of alignment score T
Alignment:
Ungapped extensions until score below statistical threshold
Output:
All local alignments with score > statistical threshold
Background
8/6/2019 Sequence Alignment and Searching
42/54
The BLAST Algorithm
3 Stages
Preprocessing of the query
Generation of hits
Extension of the hits
Background
8/6/2019 Sequence Alignment and Searching
43/54
Step 1: Preprocessing of the query
Speed gained by minimizing search space Alignments require word hits ( word size = W)
Sequence 1
word hits
Sequen
ce
2
Background
8/6/2019 Sequence Alignment and Searching
44/54
Step 1: Preprocessing of the query (Contd.)
Threshold score = T
Neighborhood words of RGD
Wand Tmodulate speed andsensitivity
RGD 17
KGD 14
QGD 13RGE 13
EGD 12
HGD 12
NGD 12
RGN 12
AGD 11
MGD 11
RAD 11
RGQ 11
RGS 11
RND 11
RSD 11
SGD 11
TGD 11
T=12
Background
8/6/2019 Sequence Alignment and Searching
45/54
Step 2: Generation of hits
A hit is made with one or several successive pairs ofsimilar words.
All the possible hits between the query sequence and
sequences from databases are calculated in this way.
query
Background
8/6/2019 Sequence Alignment and Searching
46/54
Step 3: Extension
Each hit is extended in both directions. Extension is terminated when the maximum score drops
belowX.
DB
query
scan
Background
8/6/2019 Sequence Alignment and Searching
47/54
The BLAST Algorithm: Summary
8/6/2019 Sequence Alignment and Searching
48/54
BLAST Original Version
A C G A A G T A A G G T C C A G T
C
C
C
T
T
C
C
T
G
G
A
T
T
G
C
G
AExample:
k = 4,
T = 4
The matching word GGTCinitiates an alignment
Extension to the left andright with no gaps
Output:GTAAGGTCC
GTTAGGTCC
8/6/2019 Sequence Alignment and Searching
49/54
Evalue - number of unrelated databank sequences expectedto yield same or higher score by pure chance
BLAST results: List of hits
8/6/2019 Sequence Alignment and Searching
50/54
Fundamental unit of the BLASTalgorithm output
HSP (high scoring pair) Aligned fragments of query and detectedsequence with similarity score exceeding a
set cutoff value
Score = 61 (27.8 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 10/17
(58%), Positives = 16/17 (94%)
Query: 81 SGDLSMLVLLPDEVSDL 97
+GD+SM +LLPDE++D+
Sbjct: 259 AGDVSMFLLLPDEIADV 275
E-value (Expectation)
HSP
BLAST results: High scoring pairs (HSPs)
8/6/2019 Sequence Alignment and Searching
51/54
BLAST Confidence measures
Score and bit-score :depend on scoring method
E-value (Expect value) : number of unrelated databasesequences expected to yield same or higher score by purechance
The Expect value is used as a convenient way to create a significancethreshold for reporting results. When the Expect value is increasedfrom the default value of 10, a larger list with more low-scoring hits canbe reported. (E-value approching zero => significant alignment)
An E value of 1 assigned to a hit can be interpreted as in a database ofthe current size one might expect to see 1 match with a similar scoresimply by chance.
8/6/2019 Sequence Alignment and Searching
52/54
Statistical Terminology
True positive: A hit returned from a database search which is homologouswith the query sequence. GOOD
False positive: A hit returned from a database search which is nothomologous with the query sequence. BAD
True negative: A sequence which is not homologous with the querysequence is not returned from database search. GOOD
False negative: A sequence which is homologous with the querysequence is not returned from database search. BAD
Sensitivity: A program which is sensitive picks up on most true positive.
Selectivity: A program which is selective does not include false positives.
8/6/2019 Sequence Alignment and Searching
53/54
Conclusion
Treat BLAST searches as scientific experiments
Dont use the default parameters Default changes from time to time
BLAST is quite complicated. But its very useful.
8/6/2019 Sequence Alignment and Searching
54/54
Thanks