Contents First week First week: algorithms for exact string matching: One pattern One pattern: The...

Post on 19-Jan-2016

219 views 0 download

transcript

Contents

•First weekFirst week: algorithms for exact string matching:

One patternOne pattern: The algorithm depends on |p| and |

k patternsk patterns: The algorithm depends on k, |p| and ||

•Second weekSecond week: Alignment of sequences.

–Edit distance between two strings: dynamic programming

–Alignment of sequences:

– 2 sequences

– 3 or more sequences

•Third weekThird week: dealing with long sequences.

Distance between words

Which is the distance between the words:– table, maple– able, table– announce, pronounce– ACCTG, ACTT

… and between– ACGG, ACTGTGG

-AATCTACTAGCGTACTACTC,ACTACTACGTACTACG

Edit distance

We accept three types of errors:

The edit distance d between two strings is the minimum number of

substitutions,insertions and deletionsneeded to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

3. Deletion: ACCGTGAT ACCGGAT

2. Insertion: ACCGTGAT ACCGATGAT

1. Mismatch: ACCGTGAT ACCGAGAT

Indel

Edit distance

We accept three types of errors:

The edit distance d between two strings is the minimum number of

substitutions,insertions and deletionsneeded to transform the first string into the second one

3. Deletion: ACCGTGAT ACCGGAT

2. Insertion: ACCGTGAT ACCGATGAT

1. Mismatch: ACCGTGAT ACCGAGAT

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

Indel

0 1 23 1 2

Edit distance and alignments

The alignment that gives the distance can be represented:

And the score of the alignment is the addition of the scores of the columns:

– 0 if both chars are the same– 1 otherwise

ACCGTGAT ACCG -GAT * * * * * * *

ACCG -TGAT ACCGATGAT * * * * * * * *

ACCGTGATACCGAGAT * * * * * * *

ACCGTGTTATGTGTATG- - TGA - - AT ACCG -GAT- - GTGT -TGTTTGAGTAT * * * * * * * * * * * * * * * * *

Edit distance and alignments

But there are many alignments between two sequencesGiven ACCG ACT:

Then the Edit distance is the score of the best alignment

ACCG- - AC -T

ACCG AC - T * *

ACCGACT - * *

ACCG- - - - - - - ACT

so, we can find the distance by generating all alignments and picking up so, we can find the distance by generating all alignments and picking up

the one with smallest score.the one with smallest score.

Edit distance and Pairwise alignment

Given two DNA sequences

A (a1

a2

...an

) and B (b1

b2

...bm

) from the alphabet {a,c,t,g}

we say that A* and B* from {a,c,t,g,-} are aligned iff

i) A* and B* become A and B if gaps ( – ) are removed.

ii) |A*|=|B*|

iii) For all i, it is not possible that ai

= bi = -

Write all alignments between AA and AC ...

Edit distance and Pairwise alignment

To blackboard

Edit distance and alignment of strings

C T A C T A C T A C G T

ACTGA

Edit distance and alignment of strings

C T A C T A C T A C G T

ACTGA

Edit distance and alignment of strings

C T A C T A C T A C G T ACTGA

The cell contains the distance between AC and CTACT.

Edit distance and alignment of strings

C T A C T A C T A C G T A C T GA

?

Edit distance and alignment of strings

C T A C T A C T A C G T 0 A C T GA

?

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 A C T GA

-C

?

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 A C T GA

- -CT

?

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A C T GA

- - - - - -CTACTA

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A ?C ?T ?GA

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A 1C 2T 3G…A

ACT - - -

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A 1C 2T 3GA

Edit distance and alignment of strings

BA(AC,CTA) -C

BA(A,CTA)CC

BA(A,CTAC)C -

BA(AC,CTAC)= best

d(AC,CTAC)=min

d(AC,CTA)+1

d(A,CTA)

d(A,CTAC)+1

Bioinformatics

Pairwise alignment

Best alignment

How can an alignment be scored?

Catcactactgacgactatcgtagcgcggctat acatctacgccaa- ctac-t-gtgtagatcgccgg

c-tgactgc-- acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc-cgg----

* * *** * ************* ********* **** ******* * **** ** * ***

• Gap: worst case

• Mismatch: unfavorable

• Match: favorable

Then we assign a score for each case,

for example 1,-1,-2.

Pairwise alignment

Edit distance:

match=0 mismatch=1 indel=1

d(A,CTAC)+1d(AC,CTACT)=minimum d(A,CTA)….+1 d(AC,CTA)+1

Similarity:

match=1 mismatch=-1 indel=-2

s(A,CTAC)-2s(AC,CTACT)=maximum s(A,CTA) 1 s(AC,CTA)-2

-+

Pairwise alignment

Connect to alggen tool

Best alignment

accaccacaccacaacgagcata … acctgagcgatat

a

c

c

.

.

t

Given the maximum score, how can the best alignment be found?

• Quadratic cost in space and time

• Up to 10,000 bps sequences in length

Download alggen tool

Some preconceived ideas

We have developed the theory according to the following principles:

1) Both sequences have a similar length (global).

2) The model of gaps is linear

If there are k consecutive gaps

the penalty scores k(-2).

Assume that we have sequences with different length

S1

S2

Semiglobal pairwise alignment

It is meaningless to introduce gaps until both sequences have similar length ….

The most probable alignment should be

How can these alignments be found?

Final gaps Initial gaps

Semiglobal pairwise alignment

C T A C T A C T A C G T

A

C

T

Initial gaps

Note that

Final gaps

Semiglobal pairwise alignment

C T A C T A C T A C G T

A

C

T

The cell contains the score of the best

alignment of CTA with the empty sequence.

Given a cell

0 0 0 0 0 0 0 0 0 0 0 00

Semiglobal pairwise alignment

C T A C T A C T A C G T

0 0 0 0 0 0 0…

A

C

T

The contribution of the initial gaps is disregarded, then

C T A C T A C T A C G T

0 0 0 0 0 0 0…

A 1

C 2

T 3

but, what happens with the final gaps?

Semiglobal pairwise alignment

C T A C T A C T A C G T

0 0 0 0 0 0 0…

A 1

C 2

T 3

… by checking the last row for the best score.

How does the algorithm search for the best alignment?

Affine-gap model score

Given the following alignments

that have the same score …

a g t a c c c c g t a g

a g t - c c - - g t a -

a g t a c c c c g t a g

a g t - c - c - g t a -

a g t a c c c c g t a g

a g t - c - - c g t a -

a g t a c c c c g t a g

a g t - - c c - g t a -

a g t a c c c c g t a g

a g t - - c - c g t a -

a g t a c c c c g t a g

a g t - - - c c g t a -

Which is the most reliable case

from a biological point of view?

Affine-gap model score

Then, how can we distinguish between

consecutive gaps and separated gaps?

a g t a c c c c g t a g

a g t - - c - c g t a -

a g t a c c c c g t a g

a g t - - - c c g t a -

By scoring the opening gaps greater than the extension gaps,

for instance, -10 and -0.5.

Then, the penalty of k consecutive gaps becomes

OG + (k-1) EG

which is an affine-gap function.

How is the best alignment found?.

C T A C T A C T A C G T

A

C

T

G

A

Affine-gap model score

Smallest arrows: refer to the introduction of an opening gap.

Largest arrows: refer to the introduction of an extension gap.

But from which cell do the largest arrows originate?

Local alignment

Given two sequences, we can consider the alignments of all

their substrings…

…how can the best of them be found?

Two questions arise:

- how can the alignments be compared?

- how can the best one be selected?

Bioinformatics

Multiple alignment

A

C

A

-1__

Pairwise to multiple alignment

What happens with three strings?

Let n be their lenght, then the cost becomes

S3

S2

S1

O(n3) “O(23)” “O(32)”

And with k strings? O(nk 2k k2)

Multiple alignment

Programs of multialignment use different heuristics:

Clustal (Progressive alignment)

http://www.ebi.ac.uk/clustalw

TCoffee (Progressive alignment + data bases)

http://igs-server.cnrs-mrs.fr/Tcoffee_cgi/index.cgi

HMM (Hidden Markov Models)

Multiple alignment

Connect to alggen tool

Advanced Data Structure: Bioinformatics

•First weekFirst week: Algorithms for exact string matching.

•Second weekSecond week: Alignment of sequences.

•Third weekThird week: Dealing with long sequences.