Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 219 times |
Download: | 1 times |
DNA Sequence Evolution
AAGACTT -3 mil yrs
-2 mil yrs
-1 mil yrs
today
AAGACTT
T_GACTTAAGGCTT
_GGGCTT TAGACCTT A_CACTT
ACCTT (Cat)
ACACTTC (Lion)
TAGCCCTTA (Monkey)
TAGGCCTT (Human)
GGCTT(Mouse)
T_GACTTAAGGCTT
AAGACTT
_GGGCTT TAGACCTT A_CACTT
AAGGCTT T_GACTT
AAGACTT
TAGGCCTT (Human)
TAGCCCTTA (Monkey)
A_C_CTT (Cat)
A_CACTTC (Lion)
_G_GCTT (Mouse)
_GGGCTT TAGACCTT A_CACTT
AAGGCTT T_GACTT
AAGACTT
Sequence alignments
They tell us about
• Function or activity of a new gene/protein
• Structure or shape of a new protein
• Location or preferred location of a protein
• Stability of a gene or protein
• Origin of a gene or protein
• Origin or phylogeny of an organelle
• Origin or phylogeny of an organism
• And more…
Pairwise alignment
• How to align two sequences?• We use dynamic programming• Treat DNA sequences as strings over the
alphabet {A, C, G, T}
Dynamic programmingDefine V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)
Dynamic programming
Time and space complexity is O(mn)
Define V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)
Dynamic programming
Animation slides by Elizabeth Thomas in
Cold Spring Harbor Labs (CSHL)
http://meetings.cshl.org/tgac/tgac/flash/DynamicProgramming.swf
Structural alignment - example 1
Alignment of thioredoxins fromhuman and fly taken from theWikipedia website. This proteinis found in nearly all organismsand is essential for mammals.
PDB ids are 3TRX and 1XWC.
Structural alignment - example 2
Computer generated aligned proteins
Unaligned proteins.2bbm and 1top areproteins from fly andchicken respectively.
Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html
Structural alignments
• We can produce high quality manual alignments by hand if the structure is available.
• These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.
Benchmark alignments
• Protein alignment benchmarks– BAliBASE, SABMARK, PREFAB,
HOMSTRAD are frequently used in studies for protein alignment.
– Proteins benchmarks are generally large and have been in the research community for sometime now.
– BAliBASE 3.0
Biologically realistic scoring matrices
• PAM and BLOSUM are most popular• PAM was developed by Margaret
Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins
• BLOSUM is more recent and computed from blocks of sequences with sufficient similarity
PAM
• We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j
• Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families
• Compute probabilities of change and background probabilities by simple counting
Local alignment
• Global alignment recursions:
• Local alignment recursions
€
V (i, j) =
V (i −1, j −1) + S(x i,y j )
V (i −1, j) + g
V (i, j −1) + g
⎧
⎨ ⎪
⎩ ⎪
⎫
⎬ ⎪
⎭ ⎪
€
V (i, j) =
0
V (i −1, j −1) + S(x i,y j )
V (i −1, j) + g
V (i, j −1) + g
⎧
⎨ ⎪ ⎪
⎩ ⎪ ⎪
⎫
⎬ ⎪ ⎪
⎭ ⎪ ⎪
Local alignment traceback
• Let T(i,j) be the traceback matrices and m and n be length of input sequences.
• Global alignment traceback: – Begin from T(m,n) and stop at T(0,0).
• Local alignment traceback: – Find i*,j* such that T(i*,j*) is the maximum over all T(i,j).– Begin traceback from T(i*,j*) and stop when
T(i,j) <= 0.
BLAST
• Local pairwise alignment heuristic
• Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.
• Online server: http://www.ncbi.nlm.nih.gov/blast
BLAST
1. Given a query q and a target sequence, find substrings of length k (k-mers) of score at least t --- also called hits. k is normally 3 to 5 for amino acids and 12 for nucleotides.
2. Extend each hit to a locally maximal segment. Terminate the extension when the reduction in score exceeds a pre-defined threshold
3. Report maximal segments above score S.
Finding k-mers quickly
• Preprocess the database of sequences:– For each sequence in the database store all k-
mers in hash-table.– This takes linear time
• Query sequence:– For each k-mer in the query sequence look up the
hash table of the target to see if it exists– Also takes linear time
Profile-sequence alignment
• Given a family alignment, how can we align it to a sequence?
• First, we compute a profile of the alignment.• We then align the profile to the sequence using
standard dynamic programming.• However, we need to describe how to align a profile
vector to a nucleotide or residue.
Profile
• A profile can be described by a set of vectors of nucleotide/residue frequencies.
• For each position i of the alignment, we we compute the normalized frequency of nucleotides A, C, G, and T
Aligning a profile vector to a nucleotide
• ClustalW/MUSCLE – Let f be the profile vector
– Score(f,j)=
– where S(i,j) is substitution scoring matrix
€
f i S(i, j)i∈{A ,C ,G,T}
∑
Multiple sequence alignment
• “Two sequences whisper, multiple sequences shout out loud”---Arthur Lesk
• Computationally very hard---NP-hard
Multiple sequence alignment
Unaligned sequences
GGCTT
TAGGCCTT
TAGCCCTTA
ACACTTC
ACTT
Aligned sequences
_G_ _ GCTT_
TAGGCCTT_
TAGCCCTTA
A_ _CACTTC
A_ _C_ CTT_ Conserved regions help us to identify functionality
Iterative alignment(heuristic for sum-of-pairs)
• Pick a random sequence from input set S• Do (n-1) pairwise alignments and align to
closest one t in S• Remove t from S and compute profile of
alignment• While sequences remaining in S
– Do |S| pairwise alignments and align to closest one t
– Remove t from S
Iterative alignment
• Once alignment is computed randomly divide it into two parts
• Compute profile of each sub-alignment and realign the profiles
• If sum-of-pairs of the new alignment is better than the previous then keep, otherwise continue with a different division until specified iteration limit
Progressive alignment
• Idea: perform profile alignments in the order dictated by a tree
• Given a guide-tree do a post-order search and align sequences in that order
• Widely used heuristic
Popular alignment programs
• ClustalW: most popular, progressive alignment• MUSCLE: fast and accurate, progressive and
iterative combination• T-COFFEE: slow but accurate, consistency based
alignment (align sequences in multiple alignment to be close to the optimal pairwise alignment)
• PROBCONS: slow but highly accurate, probabilistic consistency progressive based scheme
• DIALIGN: very good for local alignments
Evaluation of multiple sequence alignments
• Compare to benchmark “true” alignments
• Use simulation
• Measure conservation of an alignment
• Measure accuracy of phylogenetic trees
• How well does it align motifs?
• More…