DNA Fragment Assembly
CIS 667 Spring 2004February 18
Objectives
• The problem: DNA Fragment Assembly The ideal case The complications
• Models The Shortest Common Superstring Reconstruction Multicontig
• A greedy algorithm• Heuristics
The Problem
• Assumption: We know the length of the target sequence approximately
• The problem: Given a set of fragments from DNA, we want deduce the whole sequence of the DNA. We determine only one of the strands
of the original molecule
The ideal caseInput:1. The set of fragments:
ACCGTCGTGCTTACTACCGT
2. Total length 10bp
Output:_ _ A C C G T _ _ _ _ _ _ C G T G CT T A C _ _ _ _ __ T A C C G T _ _
T T A C C G T G C (consensus by majority of votes)
Complications
1. real problem instance is very large2. errors
• substitutions• insertions• deletions• chimeras
3. unknown orientation of the fragments4. repeated regions
• causes ambiguity in sequencing5. lack of coverage
• causes gaps
Errors: SubstitutionInput:1. The set of fragments:
ACCGTCGTGCTTACTGCCGT substitution
2. Total length 10bp
Output:_ _ A C C G T _ _ _ _ _ _ C G T G CT T A C _ _ _ _ __ T G C C G T _ _
T T A C C G T G C (consensus by majority of votes)
Errors: InsertionInput:1. The set of fragments:
ACCGTCAGTGC insertionTTACTACCGT
2. Total length 10bp
Output:_ _ A C C * G T _ _ _ _ _ _ C A G T G CT T A C _ * _ _ _ __ T A C C * G T _ _
T T A C C * G T G C (consensus by majority of votes)
Errors: ChimerasInput:
1. The set of fragments:
ACCGT, CGTGC, TTAC, TACCGT, TTATGC2. Total length 10bp
Output:_ _ A C C G T _ _ _ _ _ _ C G T G CT T A C _ _ _ _ __ T A C C G T _ _
T T A C C G T G C (consensus)
T T A _ _ _ T G CA chimera arises when two regular fragments
from distinct parts of the target molecule join end-to end
Remedy: recognize them before use!
Repeated Regions
• Unknown orientation with no errors• Unknown orientation with errors• Repeated regions causes ambiguity
P X Q X R X S
P X R X Q X S
Direct repeat
• Direct repeat• More complex are inverted repeat
repeated regions in opposite strands
P X Q Y R X S Y
P X S Y R X QY
Lack of coverage
• causes formation of gaps• compute the mean coverage
add up all the fragments and divide by the target length
• insufficient coverage is covered by sampling more fragments
• How many fragments do I need?• Assume
all fragments have the same length let t be the safe overlap of at least t bases n is the number of fragments T is the target length
Apparent contigs: p = n e –n(l-t)/T
Shortest Common Superstring
Input: A collection F of stringsOutput: A shortest possible string S |
fF, S is a superstring of f.
Example: F={ATG, TGC, GCC}S= ATGCC
Question: Is it the shortest?Observe: u=ATG and v=GCC overlap
in G and TGC is a substring
Shortest Common Superstring
Is it a good problem? Advantages: • The problem finds the PERFECT superstring• Good for most ideal casesDisadvantages:
the problem does not deal with errors good only in some ideal cases
• in presence of no errors and known orientation, it fails in presence of repeat
• repeated identical copies get absorbed in the search of the SHORTEST superstring and produces an assembly of uneven coverage
It does not consider lack of coverage and size of the target
NP-hard
Reconstruction
Objective:We want to consider errors and unknown orientation
Substring Edit Distance
ds(a,b) = min dss(b)(a,s) one unit is charged for insertion, deletion, substitution no charges for deletion in the extremity of 2nd sequence
Example u=CGATGT v=AACTAATGTGC
_ _ C G A * T G T _ _ A A C T A A T G T G C
ds(u,v) = 2
• A string f is an approximate substring of S at error level (between 0, 1) when ds(f,S) |f|
Reconstruction
Input: A collection F of strings, an error tolerance with 0 1
Output: A shortest possible string S | fF, we have
min(ds(f, S), ds(f ,S)) |f|where f is the reverse complement
Advantage: takes into account errors and unknown
orientationDisadvantages:
Is an NP-hard problem It does not model repeats It does not consider lack of coverage and size
of the target
Multicontig
Objective: We want to consider internal linkageNo special assumptions except:
for known orientation, fragment and reverse complement are not both present in the collection.
• We want to have good linkage (overlap between fragments) An overlap is a link if it is not (properly)
contained in a bigger fragment The smallest size of a link in a layout is called a
weakest link A layout is a t-contig if its weakest link is at
least size t We partition F into the minimum number of
collections which admit a t-contig
Multicontig
Idea: Let's partition F in the minimum number of t-contigs!
Example: F={GTAG, TAATG, TGTAA}
for t=3 F1={TAATG, TGTAA} and F2={GTAG}
for t=2 we have two solutions1. F1={TAATG, TGTAA} and F2={GTAG} 2. F1={TAATG, TGTAA} and F2={GTAG}
for t=1 we have the desired solution (the minimum)F1={TAATG, TGTAA, GTAG}
For errors, we use the consensus of the multi-alignment and insist that the edit distance of the fragments be small
MulticontigInput: A collection F of strings, and an integer t0 and an error
tolerance with 0 1Output: A partition of F in the minimum number of subcollections
Ci. 1i k | every Ci admits a t-contig with an -consensus
Advantage: takes into account errors and unknown orientation take into account internal linkage of the fragments
• the answer is formed by several contigsDisadvantages:
Is an NP-hard problem even in the simplest case of no errors and known orientation• It contains as a special case finding a Hamiltonian
path in a restricted class of graphs It has no provision to use information on the
approximate size of the target
Overlap Multigrapht-Overlap:
suffix(a,t) = prefix(b,t) or (at)b = a(tb) or
|a|-ta = b|b|-t
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?
CTAAAG
TACGG
GGACG
GCCC
2 1
Overlap Multigrapht-Overlap:
suffix(a,t) = prefix(b,t) or (at)b = a(tb) or
|a|-ta = b|b|-t
CTAAAG
TACGG
GGACAG
GCCC
2 1
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?
Overlap Multigrapht-Overlap:
suffix(a,t) = prefix(b,t) or (at)b = a(tb) or
|a|-ta = b|b|-t
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?
CTAAAG
TACGG
GGACAG
GCCC
2 1
Overlap Multigrapht-Overlap:
suffix(a,t) = prefix(b,t) or (at)b = a(tb) or
|a|-ta = b|b|-t
CTAAAG
TACGG
GGACAG
GCCC
2 1
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?
Overlap Multigrapht-Overlap:
suffix(a,t) = prefix(b,t) or (at)b = a(tb) or
|a|-ta = b|b|-t
CTAAAG
TACGG
GGACAG
GCCC
2 1
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?
Overlap Multigrapht-Overlap:
suffix(a,t) = prefix(b,t) or (at)b = a(tb) or
|a|-ta = b|b|-t
CTAAAG
TACGG
GGACAG
GCCC
2 1
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?
Overlap Multigrapht-Overlap:
suffix(a,t) = prefix(b,t) or (at)b = a(tb) or
|a|-ta = b|b|-t
CTAAAG
TACGG
GGACAG
GCCC
2 1
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?
Overlap Multigrapht-Overlap:
suffix(a,t) = prefix(b,t) or (at)b = a(tb) or
|a|-ta = b|b|-t
CTAAAG
TACGG
GGACAG
GCCC
2 1
Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices?
Theoretical results
Theorem: the total length of A (set of fragments) is||A|| = w(P) + |S(P)|
where
• ||A||=aA |a|• w(P) is the weight of the path P • |S(P)| is the length of the superstring derived from P.
to convince yourself, read the proof from the book
Other theoretical results: Looking at the shortest common superstring is the same as looking for the Hamiltonian path of maximum weight in a directed multigraph.
The Greedy Methodology
NP-hard problems cannot be solved in reasonable time, but we can look for approximate solutions in reasonable time
To apply a greedy methodology:1. the problem must show optimal substructure
– A problem exhibits optimal substructure if the optimal solution to a problem contains within it optimal solutions to other problems
2. the optimal solution is reached by taking the best "local" choice
Overlap Graph
An overlap graph has only edges with maximum weight
CTAAAG
TACGA
GACA
ACCC
2 1
The Greedy Algorithminput: weighted di-graph OG(F) with n verticesoutput: Hamiltonian path in OG(F)//Initializefor i 1 to n do
in[i]=0 //how many selected edges enter iout[i]=0 //how many selected edges exit iMakeSet(i)
//ProcessSort the edges by weight, heaviest firstfor each edge (f,g) in this order do //test for acceptance if in[g] = 0 and out[f] = 0 and FindSet(f) ≠FindSet(g)
select (f,g)in[g]1out[f] 1
Union(FindSet(f), Findset(g)) if there is only one component breakreturn selected edges
A graph where "greedy" fails
• F={GCAAAG, AGTA,TACGA}
GCAAAG
TACGA
AGTA
2
We order the edges by weight(AGAT, GCAAAG) = 3(GCAAAG, AGTA) =2(AGTA, TACGA) = 2
The algorithm will choose first(AGAT, GCAAAG) = 3 and then is forced to select an edge with weight 0 to complete the path.
Instead the solution should be (GCAAAG, AGTA) =2(AGTA, TACGA) = 2
Observations
Local optimal decisions do not always work. Can we do any better?
Use some heuristics.
Issues:ScoringCoverageLinkage
Heuristics• Scoring
Uniformity is good, variability is bad. Compute the entropy of a column the entropy is the measure of the chaos in a column. There
are 5 possible characters, A, T, C, G, spaceE=-cpclog pc
E=0 if pc=1 for a character; E=log5 if each pc=1/5 To measure the uniformity we want a low entropy per
column• Coverage
minimun, maximum or medium coverage if the coverage reaches 0 for a column I, we do not have a
connected layout if we have more columns with zero coverage, any
permutation of the intervening regions (the contig) is acceptable
Coverage gives confidence to the consensus Linkage
• High coverage with no links is not good. Overlap is required.
More ObservationsLocal optimal decisions do not always work. Can we
do any better?Use some heuristics.
• Assembly in practice consists of:1. Finding overlaps2. Building Layout3. Computing the consensus
Advantages: • We treat each problem separately.
Disadvantages: • It becomes difficult to understand
the relationship between the input and the final output
Heuristics
• Finding overlaps use a dynamic programming approach
with a score system such as 1 for matches -1 for mismatches -2 for spaces Do not charge for space after the first
sequence and before the second one.
Heuristics
Ordering Fragments there is no algorithm simple and general
enough
Considerations: Use the set DF=F F If f=uv g=wx then g =wx f =
vu if f is approximately the same as the
beginning of g we can expect that whatever is the criterion used to assess the similarity between f and g, the same criterion will apply to their reverse complement
Finding overlaps Finding a good ordering of overlapping means
finding a direct path in the overlap graph Both strands are constructed simultaneously Contained fragment are not essential in the path A disconnected graph indicates lack of coverage The presence of cycles indicates repeats Unusual high coverage indicate possible repeats The presence of reverse complement cycles indicates
inverted repeats
Alignment and Consensus
• Use the minimal sum of the distancesSuppose we have f g h
CATAGTCTAACTATAGACTATCC
Two semiglobal aligments for f and g are:C A TAG T C_ _ _ C ATA GT C_ _ _
_ _ TAA _ C TA T _ _TA_ A CT A T
C ATA GT C_ _ _
_ _TA _ A CT A T_ _ _A G A CT A T C C CATA GA C T A T C C
ds(f, S) = 1 ds(g, S) = 1 ds(h, S) = 0 if we use the second aligment and ds(f, S) +ds(g, S) +ds(h, S) = 2
ds(f, S) = 1 ds(g, S) = 2 ds(h, S) = 0 if we use the first aligment and A is chosen for column 6, ds(f, S) +ds(g, S) +ds(h, S) = 3
A Linked List of Bases
• Sometimes we know what is best only later. Is there a structure that helps us? Use a Linked List of Bases
matches bases are unified in one node unmatched bases are left separate
Technique: Traverse this graph in topological order.
G T C A T A C T A T
A
Conclusions• The models fail to address all the issues
involved in the problem• The effective real problem is NP-hard• Approximation gives us some help but fails
in some cases• Heuristics helps and the problem is broken
in 3 smaller problems: 1. finding overlap 2. building layout and 3. computing the consensus
• Are we sure there is nothing else to do? We will look next week at the smaller problem of
comparing only two sequences instead of many. Will we find something better?