+ All Categories
Home > Documents > Contents First week First week: algorithms for exact string matching: One pattern One pattern: The...

Contents First week First week: algorithms for exact string matching: One pattern One pattern: The...

Date post: 19-Jan-2016
Category:
Upload: martha-atkinson
View: 219 times
Download: 0 times
Share this document with a friend
39
First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and | k patterns k patterns: The algorithm depends on k, |p| and || Second week Second week: Alignment of sequences. Edit distance between two strings: dynamic programming Alignment of sequences: 2 sequences 3 or more sequences Third week Third week: dealing with long sequences.
Transcript
Page 1: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Contents

•First weekFirst week: algorithms for exact string matching:

One patternOne pattern: The algorithm depends on |p| and |

k patternsk patterns: The algorithm depends on k, |p| and ||

•Second weekSecond week: Alignment of sequences.

–Edit distance between two strings: dynamic programming

–Alignment of sequences:

– 2 sequences

– 3 or more sequences

•Third weekThird week: dealing with long sequences.

Page 2: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Distance between words

Which is the distance between the words:– table, maple– able, table– announce, pronounce– ACCTG, ACTT

… and between– ACGG, ACTGTGG

-AATCTACTAGCGTACTACTC,ACTACTACGTACTACG

Page 3: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance

We accept three types of errors:

The edit distance d between two strings is the minimum number of

substitutions,insertions and deletionsneeded to transform the first string into the second one

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

3. Deletion: ACCGTGAT ACCGGAT

2. Insertion: ACCGTGAT ACCGATGAT

1. Mismatch: ACCGTGAT ACCGAGAT

Indel

Page 4: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance

We accept three types of errors:

The edit distance d between two strings is the minimum number of

substitutions,insertions and deletionsneeded to transform the first string into the second one

3. Deletion: ACCGTGAT ACCGGAT

2. Insertion: ACCGTGAT ACCGATGAT

1. Mismatch: ACCGTGAT ACCGAGAT

d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=

Indel

0 1 23 1 2

Page 5: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignments

The alignment that gives the distance can be represented:

And the score of the alignment is the addition of the scores of the columns:

– 0 if both chars are the same– 1 otherwise

ACCGTGAT ACCG -GAT * * * * * * *

ACCG -TGAT ACCGATGAT * * * * * * * *

ACCGTGATACCGAGAT * * * * * * *

ACCGTGTTATGTGTATG- - TGA - - AT ACCG -GAT- - GTGT -TGTTTGAGTAT * * * * * * * * * * * * * * * * *

Page 6: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignments

But there are many alignments between two sequencesGiven ACCG ACT:

Then the Edit distance is the score of the best alignment

ACCG- - AC -T

ACCG AC - T * *

ACCGACT - * *

ACCG- - - - - - - ACT

so, we can find the distance by generating all alignments and picking up so, we can find the distance by generating all alignments and picking up

the one with smallest score.the one with smallest score.

Page 7: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and Pairwise alignment

Given two DNA sequences

A (a1

a2

...an

) and B (b1

b2

...bm

) from the alphabet {a,c,t,g}

we say that A* and B* from {a,c,t,g,-} are aligned iff

i) A* and B* become A and B if gaps ( – ) are removed.

ii) |A*|=|B*|

iii) For all i, it is not possible that ai

= bi = -

Write all alignments between AA and AC ...

Page 8: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and Pairwise alignment

To blackboard

Page 9: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignment of strings

C T A C T A C T A C G T

ACTGA

Page 10: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignment of strings

C T A C T A C T A C G T

ACTGA

Page 11: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignment of strings

C T A C T A C T A C G T ACTGA

The cell contains the distance between AC and CTACT.

Page 12: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignment of strings

C T A C T A C T A C G T A C T GA

?

Page 13: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 A C T GA

?

Page 14: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 A C T GA

-C

?

Page 15: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 A C T GA

- -CT

?

Page 16: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A C T GA

- - - - - -CTACTA

Page 17: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A ?C ?T ?GA

Page 18: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Edit distance and alignment of strings

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A 1C 2T 3G…A

ACT - - -

Page 19: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A 1C 2T 3GA

Edit distance and alignment of strings

BA(AC,CTA) -C

BA(A,CTA)CC

BA(A,CTAC)C -

BA(AC,CTAC)= best

d(AC,CTAC)=min

d(AC,CTA)+1

d(A,CTA)

d(A,CTAC)+1

Page 20: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Bioinformatics

Pairwise alignment

Page 21: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Best alignment

How can an alignment be scored?

Catcactactgacgactatcgtagcgcggctat acatctacgccaa- ctac-t-gtgtagatcgccgg

c-tgactgc-- acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc-cgg----

* * *** * ************* ********* **** ******* * **** ** * ***

• Gap: worst case

• Mismatch: unfavorable

• Match: favorable

Then we assign a score for each case,

for example 1,-1,-2.

Page 22: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Pairwise alignment

Edit distance:

match=0 mismatch=1 indel=1

d(A,CTAC)+1d(AC,CTACT)=minimum d(A,CTA)….+1 d(AC,CTA)+1

Similarity:

match=1 mismatch=-1 indel=-2

s(A,CTAC)-2s(AC,CTACT)=maximum s(A,CTA) 1 s(AC,CTA)-2

-+

Page 23: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Pairwise alignment

Connect to alggen tool

Page 24: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Best alignment

accaccacaccacaacgagcata … acctgagcgatat

a

c

c

.

.

t

Given the maximum score, how can the best alignment be found?

• Quadratic cost in space and time

• Up to 10,000 bps sequences in length

Download alggen tool

Page 25: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Some preconceived ideas

We have developed the theory according to the following principles:

1) Both sequences have a similar length (global).

2) The model of gaps is linear

If there are k consecutive gaps

the penalty scores k(-2).

Page 26: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Assume that we have sequences with different length

S1

S2

Semiglobal pairwise alignment

It is meaningless to introduce gaps until both sequences have similar length ….

The most probable alignment should be

How can these alignments be found?

Final gaps Initial gaps

Page 27: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Semiglobal pairwise alignment

C T A C T A C T A C G T

A

C

T

Initial gaps

Note that

Final gaps

Page 28: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Semiglobal pairwise alignment

C T A C T A C T A C G T

A

C

T

The cell contains the score of the best

alignment of CTA with the empty sequence.

Given a cell

0 0 0 0 0 0 0 0 0 0 0 00

Page 29: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Semiglobal pairwise alignment

C T A C T A C T A C G T

0 0 0 0 0 0 0…

A

C

T

The contribution of the initial gaps is disregarded, then

C T A C T A C T A C G T

0 0 0 0 0 0 0…

A 1

C 2

T 3

but, what happens with the final gaps?

Page 30: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Semiglobal pairwise alignment

C T A C T A C T A C G T

0 0 0 0 0 0 0…

A 1

C 2

T 3

… by checking the last row for the best score.

How does the algorithm search for the best alignment?

Page 31: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Affine-gap model score

Given the following alignments

that have the same score …

a g t a c c c c g t a g

a g t - c c - - g t a -

a g t a c c c c g t a g

a g t - c - c - g t a -

a g t a c c c c g t a g

a g t - c - - c g t a -

a g t a c c c c g t a g

a g t - - c c - g t a -

a g t a c c c c g t a g

a g t - - c - c g t a -

a g t a c c c c g t a g

a g t - - - c c g t a -

Which is the most reliable case

from a biological point of view?

Page 32: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Affine-gap model score

Then, how can we distinguish between

consecutive gaps and separated gaps?

a g t a c c c c g t a g

a g t - - c - c g t a -

a g t a c c c c g t a g

a g t - - - c c g t a -

By scoring the opening gaps greater than the extension gaps,

for instance, -10 and -0.5.

Then, the penalty of k consecutive gaps becomes

OG + (k-1) EG

which is an affine-gap function.

How is the best alignment found?.

Page 33: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

C T A C T A C T A C G T

A

C

T

G

A

Affine-gap model score

Smallest arrows: refer to the introduction of an opening gap.

Largest arrows: refer to the introduction of an extension gap.

But from which cell do the largest arrows originate?

Page 34: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Local alignment

Given two sequences, we can consider the alignments of all

their substrings…

…how can the best of them be found?

Two questions arise:

- how can the alignments be compared?

- how can the best one be selected?

Page 35: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Bioinformatics

Multiple alignment

Page 36: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

A

C

A

-1__

Pairwise to multiple alignment

What happens with three strings?

Let n be their lenght, then the cost becomes

S3

S2

S1

O(n3) “O(23)” “O(32)”

And with k strings? O(nk 2k k2)

Page 37: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Multiple alignment

Programs of multialignment use different heuristics:

Clustal (Progressive alignment)

http://www.ebi.ac.uk/clustalw

TCoffee (Progressive alignment + data bases)

http://igs-server.cnrs-mrs.fr/Tcoffee_cgi/index.cgi

HMM (Hidden Markov Models)

Page 38: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Multiple alignment

Connect to alggen tool

Page 39: Contents First week First week: algorithms for exact string matching: One pattern One pattern: The algorithm depends on |p| and |  k patterns k patterns:

Advanced Data Structure: Bioinformatics

•First weekFirst week: Algorithms for exact string matching.

•Second weekSecond week: Alignment of sequences.

•Third weekThird week: Dealing with long sequences.


Recommended