BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes...

BNFO 602Lecture 2

Usman Roshan

DNA Sequence Evolution

AAGACTT -3 mil yrs

-2 mil yrs

-1 mil yrs

today

AAGACTT

T_GACTTAAGGCTT

_GGGCTT TAGACCTT A_CACTT

ACCTT (Cat)

ACACTTC (Lion)

TAGCCCTTA (Monkey)

TAGGCCTT (Human)

GGCTT(Mouse)

T_GACTTAAGGCTT

AAGACTT


AAGGCTT T_GACTT

AAGACTT

TAGGCCTT (Human)

TAGCCCTTA (Monkey)

A_C_CTT (Cat)

A_CACTTC (Lion)

_G_GCTT (Mouse)


AAGGCTT T_GACTT

AAGACTT

Sequence alignments

They tell us about

• Function or activity of a new gene/protein

• Structure or shape of a new protein

• Location or preferred location of a protein

• Stability of a gene or protein

• Origin of a gene or protein

• Origin or phylogeny of an organelle

• Origin or phylogeny of an organism

• And more…

Pairwise sequence alignment

• How to align two sequences?

Pairwise alignment

• How to align two sequences?• We use dynamic programming• Treat DNA sequences as strings over the

alphabet {A, C, G, T}

Pairwise alignment

Dynamic programmingDefine V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)

Dynamic programming

Time and space complexity is O(mn)

Define V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)

Dynamic programming

Animation slides by Elizabeth Thomas in

Cold Spring Harbor Labs (CSHL)

http://meetings.cshl.org/tgac/tgac/flash/DynamicProgramming.swf






How do we pick gap parameters?

Structural alignments

• Recall that proteins have 3-D structure.

Structural alignment - example 1

Alignment of thioredoxins fromhuman and fly taken from theWikipedia website. This proteinis found in nearly all organismsand is essential for mammals.

PDB ids are 3TRX and 1XWC.

Structural alignment - example 2

Computer generated aligned proteins

Unaligned proteins.2bbm and 1top areproteins from fly andchicken respectively.

Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html

Structural alignments

• We can produce high quality manual alignments by hand if the structure is available.

• These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

Benchmark alignments

• Protein alignment benchmarks– BAliBASE, SABMARK, PREFAB,

HOMSTRAD are frequently used in studies for protein alignment.

– Proteins benchmarks are generally large and have been in the research community for sometime now.

– BAliBASE 3.0

http://www-bio3d-igbmc.u-strasbg.fr/balibase/

http://www-bio3d-igbmc.u-strasbg.fr/balibase/

Biologically realistic scoring matrices

• PAM and BLOSUM are most popular• PAM was developed by Margaret

Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins

• BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

PAM

• We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j

• Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families

• Compute probabilities of change and background probabilities by simple counting

Local alignment

• Global alignment recursions:

• Local alignment recursions

€

V (i, j) =

V (i −1, j −1) + S(x i,y j )

V (i −1, j) + g

V (i, j −1) + g

⎧

⎨ ⎪

⎩ ⎪

⎫

⎬ ⎪

⎭ ⎪

€

V (i, j) =

0

V (i −1, j −1) + S(x i,y j )

V (i −1, j) + g

V (i, j −1) + g

⎧

⎨ ⎪ ⎪

⎩ ⎪ ⎪

⎫

⎬ ⎪ ⎪

⎭ ⎪ ⎪

Local alignment traceback

• Let T(i,j) be the traceback matrices and m and n be length of input sequences.

• Global alignment traceback: – Begin from T(m,n) and stop at T(0,0).

• Local alignment traceback: – Find i*,j* such that T(i*,j*) is the maximum over all T(i,j).– Begin traceback from T(i*,j*) and stop when

T(i,j) <= 0.

BLAST

• Local pairwise alignment heuristic

• Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.

• Online server: http://www.ncbi.nlm.nih.gov/blast

http://www.ncbi.nlm.nih.gov/blast

BLAST

1. Given a query q and a target sequence, find substrings of length k (k-mers) of score at least t --- also called hits. k is normally 3 to 5 for amino acids and 12 for nucleotides.

2. Extend each hit to a locally maximal segment. Terminate the extension when the reduction in score exceeds a pre-defined threshold

3. Report maximal segments above score S.

Finding k-mers quickly

• Preprocess the database of sequences:– For each sequence in the database store all k-

mers in hash-table.– This takes linear time

• Query sequence:– For each k-mer in the query sequence look up the

hash table of the target to see if it exists– Also takes linear time

Profile-sequence alignment

• Given a family alignment, how can we align it to a sequence?

• First, we compute a profile of the alignment.• We then align the profile to the sequence using

standard dynamic programming.• However, we need to describe how to align a profile

vector to a nucleotide or residue.

Profile

• A profile can be described by a set of vectors of nucleotide/residue frequencies.

• For each position i of the alignment, we we compute the normalized frequency of nucleotides A, C, G, and T

Aligning a profile vector to a nucleotide

• ClustalW/MUSCLE – Let f be the profile vector

– Score(f,j)=

– where S(i,j) is substitution scoring matrix

€

f i S(i, j)i∈{A ,C ,G,T}

∑

Multiple sequence alignment

• “Two sequences whisper, multiple sequences shout out loud”---Arthur Lesk

• Computationally very hard---NP-hard

Formally…

Multiple sequence alignment

Unaligned sequences

GGCTT

TAGGCCTT

TAGCCCTTA

ACACTTC

ACTT

Aligned sequences

_G_ _ GCTT_

TAGGCCTT_

TAGCCCTTA

A_ _CACTTC

A_ _C_ CTT_ Conserved regions help us to identify functionality

Sum of pairs score

Sum of pairs score

• What is the sum of pairs score of this alignment?

Iterative alignment(heuristic for sum-of-pairs)

• Pick a random sequence from input set S• Do (n-1) pairwise alignments and align to

closest one t in S• Remove t from S and compute profile of

alignment• While sequences remaining in S

– Do |S| pairwise alignments and align to closest one t

– Remove t from S

Iterative alignment

• Once alignment is computed randomly divide it into two parts

• Compute profile of each sub-alignment and realign the profiles

• If sum-of-pairs of the new alignment is better than the previous then keep, otherwise continue with a different division until specified iteration limit

Progressive alignment

• Idea: perform profile alignments in the order dictated by a tree

• Given a guide-tree do a post-order search and align sequences in that order

• Widely used heuristic

Popular alignment programs

• ClustalW: most popular, progressive alignment• MUSCLE: fast and accurate, progressive and

iterative combination• T-COFFEE: slow but accurate, consistency based

alignment (align sequences in multiple alignment to be close to the optimal pairwise alignment)

• PROBCONS: slow but highly accurate, probabilistic consistency progressive based scheme

• DIALIGN: very good for local alignments

MUSCLE

MUSCLE

Evaluation of multiple sequence alignments

• Compare to benchmark “true” alignments

• Use simulation

• Measure conservation of an alignment

• Measure accuracy of phylogenetic trees

• How well does it align motifs?

• More…

Comparison of alignments on BAliBASE

Date post:	20-Dec-2015
Category:	Documents
View:	219 times
Download:	1 times

BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes...

Documents