Bioinformatic PhD. course
Bioinformatics
Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
LSI Dep. de Llenguatges i Sistemes InformàticsBSC Barcelona Supercomputing Center
Universitat Politècnica de Catalunya
Contents
1. Biological introduction
Exact Extended Approximate
6. Projects: PROMO, MREPATT, …
5. Sequence assembly
2. Comparison of short sequences ( up to 10.000bps)
Dot Matrix Pairwise align. Multiple align. Hash alg.
3. Comparison of large sequences ( more that 10.000bps)
Data structures Suffix trees MUMs
4. String matching
String matching
1. (Exact) String matching of one pattern
2. (Exact) String matching of many patterns
3. Extended string matching
3. Approximate string matching (Dynamic programming)
• Flexible pattern matching in stringsG. Navarro and M. Raffinot, 2002, Cambridge Uni. Press
• Algorithms on strings, trees and sequencesD. Gusfield, Cambridge University Press, 1997
String matching
Definition: given a long text T and a set of k patterns p1,p2,…,pk, the string matching problem is to find
all the ocurrences of all the patterns in the text T.
On-line algorithms: the patterns are known.
Off-line algorithms: the text is known.
• Only one pattern (exact and approximated)• Five, ten, hundred, thusand,.. patterns (exact)
• Suffix trees
Master Course
First part:
(Exact) string matching
String matching: one pattern
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA.
How does the string algorithms made the search?
and for the pattern TACTACGGTATGACTAA
String Matching: Brute force algorithm
Given the pattern ATGTA, the search is
G T A C T A G A G G A C G T A T G T A C T G ...A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
Example:
String Matching: Brute force algorithm
Connect to
http://www-igm.univ-mlv.fr/~lecroq/string/index.html
and open Brute Force algorithm
String Matching of one pattern
The cost of Brute Force algorithm is O(nm),
Can the search be made with lower cost?
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
TACTACGGTATGACTAA
Factor search
Prefix search
Suffix search
and the expected number of comparisons?
String matching of one pattern
How does the string algorithms made the search?
There is a sliding window along the text against which the pattern is compared:
Pattern :
Text :
Which are the facts that differentiate the algorithms?
1. How the comparison is made.2. The length of the shift.
At each step the comparison is made and the window is shifted to the right.
String Matching: Brute force algorithm
Text :
Pattern :
From left to right: prefix search
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text :
The window is shifted only one cell
The cost is O(mn).
String Matching: one pattern
Most efficient algorithms (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Length of the pattern
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
String Matching: Horspool algorithm
Text :
Pattern :From right to left: suffix search
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text : a
It depends of where appears the last letter of the text, say it ‘a’, in the pattern:
a a a
Then it is necessary a preprocess that determines the length of the shift.
aa a
a a a
String Matching: Horspool algorithm
Given the pattern ATGTA, the shift table is A 4C 5G 2T 1
And the search: G T A C T A G A G G A C G T A T G T A C T G ...A T G T A
A T G T A
A T G T A
A T G T A A T G T A
A T G T A
Example:
String Matching: Horspool algorithm
Given the pattern ATGTA, the shift table is A 4C 5G 2T 1
And the search: G T A C T A G A G G A C G T A T G T A C T G ...A T G T A
A T G T A
A T G T A
A T G T A A T G T A
A T G T A A T G T A
Example:
…http://www-igm.univ-mlv.fr/~lecroq/string/index.html
String Matching: one pattern
The most efficient algorithms (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Length of the pattern
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
What happens with many patterns?
String matching: many patterns
Given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
Search for the patterns
ACTGACTGTCTAATT
ACTGATCTTTGTAGCAATACTACATGCACTGA.
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
4. Start the search
T A
A
G
GA
TTT
T
G
A
A
AA T
1. Build the trie of the inverted patterns
2. lmin=4A 1C 4 (lmin)G 2T 1
3. Table of shifts
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
…
Horspool for many patterns
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
The text ACATGCTATGTGACA…
A 1C 4 (lmin)G 2T 1
…
Short Shifts!
AA 1 AC 3 (LMIN-L+1)AG 3AT 1CA 3CC 3CG 3…
2 símbols
Horspool to Wu-Manber
How do we can increase the length of the shifts?
With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG
AA 1AT 1GT 1TA 2TG 2
A 1C 4 (lmin)G 2T 1
1 símbol
Wu-Manber algorithm
Search for ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
into the text: ACATGCTATGTGACATAATA
…
AA 1AT 1GT 1TA 2TG 2
Experimental length: log|Σ| 2*lmin*r
String matching of many patterns
5 10 15 20 25 30 35 40 45
8
4
2
| |
Wu-Manber
SBOMLmin
(5 patterns)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM(10 patterns)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM
(100 patterns)
String Matching: one pattern
The most efficient algorithms (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Length of the pattern
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
BNDM algorithm
• How the shift is determined?
• How the comparison is made?
Text :
Pattern :
Searches for suffixes of T that are factors of P
This state is expressed with an array D of bits:
D2 = 1 0 0 0 1 0 0
How the next state can be obtained?
D = D<<1 & B(x)
Given the mask B(x) of x, the cells where character x appears into the pattern
D3 = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )
If B(x) = ( 0 0 1 1 0 0 0) then
?
x
BNDM algorithm: example
Given the pattern ATGTA,
the mask of characters is:
B(A) = ( 1 0 0 0 1 )B(C) = B(G) = B(T) =
BNDM algorithm: example
Given the pattern ATGTA,
the mask of characters is:
B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )
BNDM algorithm: example
Given the pattern ATGTA,
Given the text :G T A C T A G A G G A C G T A T G T A C T G ...A T G T A
A T G T A
A T G T A
A T G T A
the mask of characters is:
B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )
D1 = ( 0 1 0 1 0 )D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )
D1 = ( 0 0 1 0 0 )D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 )
D1 = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )
BNDM algorithm: example
A T G T A
The pattern is ATGTA ,
the masks are:
and the text:G T A C T A G A G G A C G T A T G T A C T G ...A T G T A
B(A) = ( 1 0 0 0 1 )B(C) = ( 0 0 0 0 0 )B(G) = ( 0 0 1 0 0 )B(T) = ( 0 1 0 1 0 )
D1 = ( 1 0 0 0 1 )D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 )D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 )
Pattern found!
…
Text :
Pattern :
Searches for suffixes of T that are factors of P
BNDM algorithm
• How the shift is determined?
• How the comparison is made?
This state is expressed with an array D of bits:
D = 1 0 0 0 1 0 0
?
Text :
Pattern :
Searches for suffixes of T that are factors of P
BNDM algorithm
• How the shift is determined?
• How the comparison is made?
This state is expressed with an array D of bits:
D = 1 0 0 0 1 0 0
If the left bit is set to one in step i, it means that a prefix of P of length i is equal to a suffix of T, then the window is shifted m-i cells; otherwise it is shifted m cells
String matching: one pattern
The most efficient algorithms (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. patró
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
BOM (Backward Oracle Matching)
• How the shifted is determined?
• How the comparison is made?
Text :
Pattern : Automaton: Factor Oracle(1999)
Checks if the suffix is a factor of the pattern
?
Automaton Factor Oracle: properties
Factor Oracle of the word G T A T G T A
GG AT T ATTA
G
G T A T G
but the automaton also recognizes other strings as G T G
then it is usefull only for discard words out as factors!
A T G
G T G
T A T G
Suffixes found before.
Suffixes that have not been found before.
BOM: example
• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG
• Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T G A...A T G T A T G
• How the comparison is made?
GG AT T ATTA
G
BOM: example
• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG
• Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T GA T G T A T G
• How the comparison is made?
GG AT T ATTA
G
A T G T A T G
BOM: example
• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG
• Search G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G
• How the comparison is made?
GG AT T ATTA
G
A T G T A T G A T G T A T G
BOM: example
• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG
• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T GA T G T A T G
• How the comparison is made?
GG AT T ATTA
G
A T G T A T G A T G T A T G
A T G T A T G
BOM: example
• The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG
• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ...A T G T A T G
• How the comparison is made?
GG AT T ATTA
G
A T G T A T G A T G T A T G
A T G T A T G A T G T A T G
BOM: example
• Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG
• Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ...A T G T A T G
• How the comparison is made?
GG AT T ATTA
G
A T G T A T G A T G T A T G
A T G T A T G A T G T A T G
A T G T A T G …
BOM (Backward Oracle Matching)
• How the shifted is determined?
• How the comparison is made?
Text :
Pattern : Automaton: Factor Oracle
Checks if the suffix is a factor of the pattern
a
• a is the first mismatch
But what happens with many patterns?
SBOM
• How the shifted is determined?
• How the comparison is made?
Text :
Pattern : Automaton: Factor Oracle
Checks if the suffix is a factor of any pattern
?
Factor Oracle of many patterns
The AFO of GTATGTA, GTAA, TAATA i GTGTA
T A
A
GG AT TT
T
A
G
A
1,4
32
A
SBOM algorithm
Text :
Patrons:
• How the shift is determined?
• How the comparison is made?
a
Autòmaton………… of lenght lmin
• If the a doesn’t appears in the AFO
• If lmin characters have been read
SBOM algorithm : example
Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TTTA
G A
T A
A1 4
2 3
ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TTTA
G A
T A
A1 4
2 3
ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TTTA
G A
T A
A1 4
2 3
ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TTTA
G A
T A
A1 4
2 3
ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TTTA
G A
T A
A1 4
2 3
ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TTTA
G A
T A
A1 4
2 3
ACATGCTAGCTATAATAATGTATG
A
SBOM algorithm: example
Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG
GG AT TTTA
G A
T A
A1 4
2 3
ACATGCTAGCTATAATAATGT…
A
Alg. Cerca exacta de molts patrons
5 10 15 20 25 30 35 40 45
8
4
2
| |Wu-Manber
SBOMLong. mínima
(5 mots)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM(10 mots)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM (1000 mots)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-ManberSBOM
(100 mots)
Ad AC
PhD. Course
Second part:
Extended string matching
Extended string matching
There are characters in the text that represent sets of simbols
1. Classes of characters in the text.
There are characters in the pattern that represent sets of simbols
2. Classes of characters in the pattern.
There are classes of characters represented by oneSymbol. For instace the IUPAC code for the
DNA alphabet is:R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T}
B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any)
Classes in the text
Algorismes més eficients (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. patró
Horspool
BNDMBOM
w
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
Alg. Cerca exacta d’un patró (text on-line)
Algorismes més eficients (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. patró
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
Classes in the text :Horspool example
Given the pattern ATGTA
• the shift table is:
A 4C 5G 2T 1R ?…N ?
Classes in the text :Horspool example
Given the pattern ATGTA
• the shift table is:
A 4C 5G 2T 1R 2…N ?
Classes in the text :Horspool example
Given the pattern ATGTA
• the shift table is:
A 4C 5G 2T 1R 2…N 1
Given the text : G T A R T R N A A G G A …A T G T A
A T G T A
A T G T A
Classes in the text :Horspool example
Given the pattern ATGTA
• and the shift table:
A 4C 5G 2T 1R 2…N 1
Given the text : G T A R T R N A A G G A ...A T G T A
A T G T A
A T G T A A T G T A
…
Alg. Cerca exacta d’un patró (text on-line)
Algorismes més eficients (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. patró
Horspool
BNDMBOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
Classes in the text: BOM
• Com es determina la següent posició de la finestra?
• Com fa la comparació?
Text :
Patró : Autòmata: Factor Oracle
Comproba si el sufix és factor del patró
Però primer analitzem com fa la comparació…
Classes in the text: BOM example
• Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG
• I la cerca sobre el text : G T A R T R N A A T G…A T G T A T G
• Com fa la comparació?
GG AT T ATTA
G
No és possible cap millora!
Alg. Cerca exacta de molts patrons
5 10 15 20 25 30 35 40 45
8
4
2
| |Wu-Manber
SBOMLong. mínima
(5 mots)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM(10 mots)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM (1000 mots)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-ManberSBOM
(100 mots)
Ad AC
Alg. Cerca exacta de molts patrons
5 10 15 20 25 30 35 40 45
8
4
2
| |Wu-Manber
SBOMLong. mínima
(5 mots)
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM(10 mots)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-Manber
SBOM (1000 mots)
Ad AC
5 10 15 20 25 30 35 40 45
8
4
2
Wu-ManberSBOM
(100 mots)
Ad AC
Classes in the text: Set Horspool
Search for the patterns ATGTATG,TATG,ATAAT,ATGTG
T A
A
G
GA
TTT
T
G
A
A
AA T
In the text: ARTGNCTATGTGACA…
it’s not possible any improvment!
Master Course
Third part:
Regular expressions matching
Expressions regulars
Una expressió regular ℛ és una cadena sobre Σ U { ε, |, · , * , (, ) } definida recursivament com:
• ε és una expressió regular• Un caràcter de Σ és una expressió regular
• ( ) ℛ és una expressió regular
• ℛ1 · ℛ2 és una expressió regular
• ℛ * és una expressió regular
• ℛ1 | ℛ2 és una expressió regular
Llenguatge regular
El llenguatge representat per una expressió regular és el conjunt dels mots que es poden construir a partir
de l’expressió regular.
El problema de buscar una expressió regular dins el text és el de buscar tots els factors que pertanyen
al respectiu llenguatge regular.
Cerca d’una expressió regular
expressió regular
NFA
Cerca de les ocurrències
DFA
Cerca amb autòmat determinista
Cerca amb el bit-paral.lel Thompson
arbre “parser”
PhD. Course
Fourth part:
Approximate string matching
Approximate string matching
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA allowing one error…
… but what is the meaning of “one error”?
Edit distance
We accept three types of errors:
The edit distance d between two strings is the minimum number of
substitutions,insertions and deletionsneeded to transform the first string into the second one
d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
3. Deletion: ACCGTGAT ACCGGAT
2. Insertion: ACCGTGAT ACCGATGAT
1. Mismatch: ACCGTGAT ACCGAGAT
Indel
Edit distance
We accept three types of errors:
The edit distance d between two strings is the minimum number of
substitutions,insertions and deletionsneeded to transform the first string into the second one
3. Deletion: ACCGTGAT ACCGGAT
2. Insertion: ACCGTGAT ACCGATGAT
1. Mismatch: ACCGTGAT ACCGAGAT
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
Edit distance
We accept three types of errors:
The edit distance d between two strings is the minimum number of
substitutions,insertions and deletionsneeded to transform the first string into the second one
3. Deletion: ACCGTGAT ACCGGAT
2. Insertion: ACCGTGAT ACCGATGAT
1. Mismatch: ACCGTGAT ACCGAGAT
d(ACT,ACT)=0 d(ACT,AC)=1 d(ACT,C)=2d(ACT,)= 3 d(AC,ATC)=1 d(ACTTG,ATCTG)=2
Indel
Edit distance
• ACT and ACT : ACT ACT
• ACTTG and ATCTG:
• ACT and AC: ACT AC-
ACTTG ATCTG
ACT - TGA - TCTG
Given d(ACT,ACT)=0 d(ACT,AC)=1 d(ACTTG,ATCTG)=2which is the best alignment in every case?
The Edit distance is related with the best alignment of strings
Edit distance
But which is the distance between the strings
ACGCTATGCTATACG and ACGGTAGTGACGC?
… and the best alignment between them?
1966 was the first time this problem was discussed…
and the algorithm was proposed in 1968,1970,…
using the technique called “Dynamic programming”
Edit distance
C T A C T A C T A C G T ACTGA
The cell contains the distance between AC and CTACT.
Edit distance and alignment of strings
C T A C T A C T A C G T A C T GA
?
Edit distance and alignment of strings
C T A C T A C T A C G T 0 A C T GA
?
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 A C T GA
-C
?
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 A C T GA
- -CT
?
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A C T GA
- - - - - -CTACTA
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A ?C ?T ?GA
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A 1C 2T 3G…A
ACT - - -
Edit distance and alignment of strings
Connect to
http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm
and use the global method.
K-approximate string searching
How this algorithm can be applied
to the approximate search?
to the K-approximate string searching?
K-approximate string searching
C T A C T A C T A C G T A C T G G T G A A …
ACTGA
This cell …
K-approximate string searching
C T A C T A C T A C G T A C T G G T G A A …
ACTGA
This cell gives the distance between (ACTGA, CT…GTA)…
…but we only are interested in the last characters
K-approximate string searching
C T A C T A C T A C G T A C T G G T G A A …
ACTGA
This cell gives the distance between (ACTGA, CT…GTA)…
…but we only are interested in the last characters
K-approximate string searching
* * * * * * C T A C G T A C T G G T G A A …
ACTGA
This cell gives the distance between (ACTGA, CT…GTA)…
…but we only are interested in the last characters…
…no matter where they appears in the text, then…
K-approximate string searching
* * * * * * C T A C G T A C T G G T G A A … 0ACTGA
This cell gives the distance between (ACTGA, CT…GTA)…
…but we only are interested in the last characters…
…no matter where they appears in the text, then…
K-approximate string searching
* * * * * * C T A C G T A C T G G T G A A … 0ACTGA
This cell gives the distance between (ACTGA, CT…GTA)…
…but we only are interested in the last characters…
…no matter where they appears in the text, then…
C T A C T A C T A C G T A C T G G T G A A … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0ACTGA
K-approximate string searching
This cell gives the distance between (ACTGA, CT…GTA)…
…but we only are interested in the last characters…
…no matter where they appears in the text, then
K-approximate string searching
Connect to
http://alggen.lsi.upc.es/docencia/ember/leed/Tfc1.htm
and use the semi-global method.