Master Course
MSc Bioinformatics for Health Sciences
H15: Algorithms on strings and sequences
Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Dep. de Llenguatges i Sistemes InformàticsCEPBA-IBM Research Institute
Universitat Politècnica de Catalunya
Master Course
Fourth lecture:
Sequence assembly
Sequence assembly
It is applied to the following topics:
• EST assembly
• DNA sequencing .
• Hibridization: provide information about l-tuples present in DNA.
DNA sequencing
There are two techniques:
• Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.
• Hibridization: provide information about l-mers present in DNA
DNA sequencing
There are two techniques:
• Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.
Hybridization
Let xxxxxxxxxxxxx be the sequence we want to know,
and the hybridization technique gives us the set of 3-mers that belong to it:
AAC GAT TGCACG CGG GCC TTG GGA ATT
How can the sequence be reconstructed?
Hybridization
As AAC and ACG belong to the sequence,
then AACG belongs to the sequence,
AAC GAT TGCACG CGG GCC TTG GGA ATT
Given the 3-mers of the sequence:
because the longest (proper) suffix of AAC matches the longest (proper) prefix of ACG.
This relation can be represented with a directed graph AAC ACG
Hybridization
Construction of the complete suffix-prefix graph
AAC GAT TGC
ACG CGG GCC TTG
GGA ATT
AACGGATTGCC
that gives us the unknown sequence:
But, is this a realistic case?
Hybridization
Let us introduce a more realistic case:
and the sequence is given by the Hamiltonian path
Which is the cost of the hybridization method?
AAC CAA GAT TGC
ACG CGG GCC TTG
GGC GGA CCG ATT
and whose cost is NP-Complet!
that is the path that traverses all nodes exactly once
2. Searching for the suffix-prefix matches :
Hybridization: cost
Cost: 1. Finding the l-mers AAC, CAA, ACG,... :
There are 4L l-mers of length L that should be generated
If there are m L-mers, then there are O(m2 L2 ) comparisons
3. Searching for the Hamiltonian path
NP- Complet
Excursió: cost
Quadratic cost: O(m2 )
Linear cost: O(m)
Exponencial cost: O(2m )
m t = 1 mseg10m 10t = 10 mseg1000m 1000t = 1 seg
m t = 1mseg.10m 100t = 100 mseg.1000m 1000000t = 16 min
m t = 1 mseg.10m 210 t = 1 seg1000m 21000 t = 1030 t = 1018 anys
2. Searching for the suffix-prefix matches :
Hybridization: cost
Cost: 1. Finding the l-mers AAC, CAA, ACG,... :
There are 4L l-mers of length L that should be generated
If there are m L-mers, then there are O(m2 L2 ) comparisons
3. Searching for the Hamiltonian path
NP- Complet
How the NP-completness can be avoided?
Hybridization:
Search for the Hamiltonian path (NP-complet)
AAC GAT TGC
ACG CGG GCC TTG
GGC GGA CCG ATT
or search for the Eulerian path (lineal) AA
AC
GG
CG
GA
CC
GC
TG
TT
AT
Hybridization: Eulerian path
Unbalanced nodes: indegree = outdegree (Starting or ending nodes )
Balanced nodes: indegree = oudegree (traversed nodes: )
Search for the Eulerian path of the graph:
Hybridization: Eulerian path
Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.
Hybridization: camí Eulerià
Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.
2. Searching for the suffix-prefix matches :
Hybridization: cost
Cost: 1. Finding the l-mers AAC, CAA, ACG,... :
There are 4L l-mers of length L that should be generated
If there are m L-mers, then there are O(m2 L2 ) comparisons
3. Searching for the Eulerian path
Linear cost
Now, which is the limiting factor?
Hybridization: limiting factor
AAC CAA GAT TGC
ACG CGG GCC TTG
GGA ATT
Repeated l-mers:
Which is the probability of a repeat?
CAACGGATTGCC
CAACGGACGGATTGCC
GAC
Given the graph:
How many sequences can be assembled?
Hybridization: statistical model
Model: random sequence of length N with identically distributed bases (1/4),
How the probability of a repeat can be computed?
Given 2 l-mers, the probability to match is : 4-L
Given 3 l-mers, the expected number of 2-matches is : (32)4-L
Given m l-mers, the expected number of 2-matches is: (m2)4-L
If (m2)4-L <1 then m<sqr(2·4L) then for L = 8, m =512!
Conclusion: this technique can be applied only to short sequences.
Hybridization:
Connect to
http://alggen.lsi.upc.edu
And follow links RESEARCH SEARCH MREPATT
Genome sequences are close to random sequences?
• Hibridizationació: provide information about l-mers present in DNA
DNA sequencing
There are two techniques:
• Shot gun: DNA sequences are broken into 100Kb-500Kb random fragments.
Shotgun
With the unknown sequence xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
It is possible :
• to make some copies
• to break it into random and unsorted short segments
What can we do?
Shotgun: algorisme
Assume xxxxx|xxxxxxx|xxxxxxx|xxxx xxxxxxxx|xxxxxx|xxxxxx|xxxxxxx|xxxxxx|xxxxxx|xxxxxxx
The algorithm is:
1st. Compare all pairs searching for suffix-prefix approximate matches.
2nd. Construct the graph suffix-prefix
3th. Find the path
Shotgun
Given the three copies xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The shotgun brokes it into the following segments
accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt
Shotgun
The pairwise comparison that searchs for suffix-prefix approximate matching can be done with:
• Dynamic programming ( quadratic cost)
• two steps:• Find the pairs suspected to be assembled
(Linear cost with the hash algorithm)
• Assembly them with dynamic programming.
Shotgun
accgtaccaccttta
tacctt
tttaac taacga
acgatac
accgaccgt
tacaggt
gataca
Given the graph
accgtacctttaacgatacaggt
but, the Hamiltonian has exponential cost!
Shotgun:
New problems arise
xxxxxxxxxxxxx
xxxxx
xxxxxx xxxxxx xxxxxxxx
accgaccgt
xxxxxxxxxxxxxx
• Consecutive repeats• Lack of coverage•…
Shotgun: properties of the coverage
Given the coverage:
Some questions arisess:
• What is the mean length of contigs?
• How many contigs we have to expect?
• What is the percentage of coverage?
Shotgun: percentage of coverage
Degree of coverage N d / L
Given the modelL
N d
We assume that segments are randomly distributed.
a base was covered by k segments is given by the binomial dsitribution (N,d / L):
The probability that
Prob{X=k}= (d/L)k (1-d/L)n-kNk
Then the probability that at least one segment covers a base is
Prob{X>0}= 1-Prob{X=0}= 1- e-
Shotgun: percentage of coverage
What is the limit of the binomial distribution n i p 0
having np=
Distribució de Poisson P()
Prob{X=k}= e- k
k!
= 1- e(N d / L)
Then, with N d / L = 4.6 we obtain a 99% of coverage
and with N d / L = 6.9 weobtain a 99.9% of coverage.
Assembly of ESTs
Is the same procedure than shotgun sequencing…
…but with a great one advantage:
there are many graphs with a small number of nodes!Connect to
http://alggen.lsi.upc.es
Links RESEARCH ESSEM