Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 218 times |
Download: | 1 times |
Engineering a Scalable Placement Heuristic for DNA
Probe Arrays
A.B. Kahng, I.I. Mandoiu, P. Pevzner, A.B. Kahng, I.I. Mandoiu, P. Pevzner,
S. Reda (all UCSD), A. Zelikovsky (GSU)S. Reda (all UCSD), A. Zelikovsky (GSU)
Outline
• DNA probe arrays and unwanted illumination• Synchronous array design (2-D placement)• Asynchronous array design (3-D placement)• Experimental results• Extensions• Conclusions
Outline
• DNA probe arrays and unwanted illumination• Synchronous array design (2-D placement)• Asynchronous array design (3-D placement)• Experimental results• Extensions• Conclusions
DNA Probe Arrays
• Used in wide range of genomic analyses– Gene expression monitoring, SNP mapping, sequencing by
hybridization,…
• Arrays with up to 1000x1000 probes in commercial use, 108 probes envisioned for next generation arrays– Highly scalable algorithms required for array design
Simplified DNA Array Flow
Probe Selection
Probe Placement
Probe Alignment (Mask Design)
Array Manufacturing
Hybridization Experiment
Gene sequences, position of SNPs, etc.
This talk
Analysis of Hybridization Intensities
Mask Manufacturing
Soft/Computational Domain
Hard/Biochemistry Domain
Array Manufacturing Process
Very Large-Scale Immobilized Polymer Synthesis:
1. Treat substrate with chemically protected “linker” molecules, creating rectangular array
– Site size = approx. 10x10 microns
2. Selectively expose array sites to light
– Light deprotects exposed molecules, activating further synthesis
3. Flush chip surface with solution of protected A,C,G,T
– Binding occurs at previously deprotected sites
4. Repeat steps 2&3 until desired probes are synthesized
Probe Synthesis
Nu
cle
otid
e d
epo
sitio
n s
eq
uen
ce A
CG
G M3
C M2
A M1
CG
AC
CG
AC
ACG
AG
G
AG
C
Placed probes
A
A
A
A
A
C
C
C
C
C
C
G G
G G
G G
Measuring Unwanted Illumination
Nu
cle
otid
e d
epo
sitio
n s
eq
uen
ce A
CG
G M3
C M2
A M1
A
A
A
A
A
C
C
C
C
C
C
G G
G G
G G
border
Unwanted illumination border length
CG
AC
CG
AC
ACG
AG
G
AG
C
Placed probes
Synchronous vs. Asynchronous Synthesis
(a) periodic deposition sequence
(b) Synchronous embedding of CTG
(c) Asynchronous leftmost embedding of CTG
(d) Another asynchronous embedding
T
GC
A
T
G
T
G
C
A
…
C
A
4-group
(a)
C
G
T
(b)
C
T
G
(c)
G
C
T
(d)
Outline
• DNA probe arrays and unwanted illumination• Synchronous array design (2-D placement)• Asynchronous array design (3-D placement)• Experimental results• Extensions• Conclusions
Problem Formulation (Synchronous Case)
Synchronous Array Design (2-D Placement) Problem:• Minimize placement cost of Hamming graph H
(vertices = probes, distance = Hamming)
• On 2-dimensional grid graph G2 (N x N array, edges b/w distance 1 neighbors)
H
probe
G2site
2-D Placement Lower Bound
• Sum of Hamming distances to 4 closest neighbors minus weight of 4N heaviest arcs
H
probe
G2
TSP+1-Threading Placement
Hubbell 90’s• Find TSP tour/path over given probes w.r.t.
Hamming distance • Thread TSP path in the grid row by row
Hannenhalli,Hubbell,Lipshutz, Pevzner’02• Place the probes according to 1-Threading • Further decreases total border by 20%
Lexicographical Sorting +1-Threading
A
A
T
G
C
A
A
T
G
A
T
G
G
Radix-sort the probes in lexicographical order
1 2 3
C
C
Thread on the chip
Matching Based Probe Placement
1
3
2
5
4
Select an independent (mutually nonadjacent) set of
placed probes
Re-embed using optimal
perfect matching
2
2
3
1
4
Total cost can only decrease or remain the same
Runtime: roughly proportional to square of independent set size
Sliding Window Matching
There is a trade-off between solution quality and size/overlap of windows
Iterate SlidingWindowMatching over the chip until improvement drops below 0.1%
Effect of Window Size on Solution Quality
Increased window size/overlap decreases number of conflicts, but increases runtime
Epitaxial Placement Algorithm
• Simulates crystal-growth
• Start with arbitrary probe placed at center
• Maintain a best probe-candidate (i.e, a probe with min number of conflicts to the already placed neighbors) for each border site
• Iteratively fill the border site with minimum increase in border length
- give priority to sites with more neighbors filled
Tile- and Row- Epitaxial
• Tile-epitaxial– Divide array into 100x100 tiles– Run Epitaxial within each tile– Take into account border of already placed tiles
• Row-epitaxial– Place probes by a fast method, e.g., sort+1-thread– Re-place probes row by row, sequentially filling
sites within a row– Assign to each site a probe with min number of
conflicts among the unplaced probes from following K rows
2-D Placement Algorithm Comparison:
Border Conflict
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
100 200 300 500
LB
Row-EPTX
EPTX
Tile-EPTX
TSP+1Thr
SWM 6x6
2-D Placement Algorithm Comparison:
Runtime
1
10
100
1000
10000
100000
1000000
100 200 300 500 1000
TSP+1Thr
Row-EPTX
EPTX
Tile-EPTX
SWM
Outline
• DNA probe arrays and unwanted illumination• Synchronous array design (2-D placement)• Asynchronous array design (3-D placement)• Experimental results• Extensions• Conclusions
Problem Formulation (Asynchronous Case)
• Asynchronous synthesis:– Periodic nucleotide deposition sequence, e.g., (ACTG)p
– Every probe grows asynchronously
Border length = Hamming distance between embedded probes • Asynchronous Array (3-D Placement) Design Problem:
– Minimize placement cost of embedded-probe Hamming graph H (vertices=probes, distance = Hamming b/w embedded probes)
– on 2-dimensional grid graph G2 (N x N array, edges b/w neighbors)
H
probe
G2
site
Lower Bound
• Sum of distances to 4 closest neighbors minus weight of 4N heaviest arcs– Distance between two probes of length p = 2p - |Longest Common Subsequence|
• Non-tight bound: example with LB = 8 and best placement cost = 10
2M
5M
4M
AC
CT TG
GA
Optimum placement
AC
CT TG
GA1
1
1
1
1 111
Nuc
leot
ide
depo
sitio
n se
quen
ce S
=A
CT
GA
A
G
T
C
A
3M
1M
A
G
G
TT
C
C
A
(c)
Optimal Probe Alignment
A
C
T
A C G T A C G TSource
Sink
• Find best alignment of probe wrt embedded neighbors• Dynamic Programming:
– Source-sink paths corresponds to feasible embeddings
– O[(probe length) x (deposition sequence length)]
• Can be extended to simultaneous alignment of two adjacent probes (2x1) with increase by O(probe length)
3-D Placement Flows
- Simultaneous placement and alignment- asynchronous epitaxial (slow and low quality)
- Synchronous placement followed by in-place probe alignment (analogous to standard for VLSI flow partition)- using previous DP to do in-place probe alignment
- Synchronous placement followed by probe alignment with reshuffle (analogous to feedback loops in VLSI flows)- asynchronous sliding window matching
Algorithms for In-Place Probe Alignment
• Asynchronous re-embedding after 2-dim placement– Greedy Algorithm
• While there exist probes to re-embed with gain– Optimally re-embed the probe with the largest gain
– Batched greedy: speed-up by avoiding recalculations– Chessboard Algorithm
• While there is gain– Re-embed probes in green sites– Re-embed probes in red sites
Comparison of In-Place Probe Alignments
Chip size
LB TSP+1Thr Greedy Chessboard 2x1 Chessboard
%LB %LB %LB CPU %LB CPU %LB CPU
100 100 152.0 125.7 40 120.5 54 119.4 480
200 100 150.2 126.3 154 120.9 221 119.7 1915
300 100 149.1 126.7 357 121.5 522 121.6 4349
500 100 147.9 127.1 943 121.4 1423 120.2 15990
• Post-placement LB = sum of distances to adjacent probes– Distance between two probes of length p = 2p - |LCS |– Useful for assessing quality of algorithms that change probe
embeddings but do not change probe placement
Outline
• DNA probe arrays and unwanted illumination• Synchronous array design (2-D placement)• Asynchronous array design (3-D placement)• Experimental results• Extensions• Conclusions
3-D vs. 2-D Placement Results
Chip size
TSP+1Thr TSP+1Thr+
Chessboard
Epitaxial+
Chessboard
SyncSWM+
Chessboard
AsyncSWM
Cost Cost CPU Cost CPU Cost CPU Cost CPU
100 554849 439829 113 419069 274 433274 1 417890 875
200 2140903 1723352 1901 1624988 4441 1693658 46 1636658 3676
300 4667882 3801765 12028 --- --- 3746722 112 3615282 8406
500 12702474 10426237 109648 --- --- 10049442 302 9686918 22351
1000 --- --- --- --- --- 38898792 1307 38005039 54501
3-D Placement Algorithm Comparison:
Border Conflict
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
100 200 300 500 1000
TSP+1Thr
TSP+1Thr+Chess
RowEPTX+Chess
EPTX+Chess
TileEPTX+Chess
SyncSWM+Chess
AsyncSWM+Chess
3-D Placement Algorithm Comparison:
Runtime
1
10
100
1000
10000
100000
1000000
100 200 300 500 1000
TSP+1Thr+Chess
RowEPTX+Chess
EPTX+Chess
TileEPTX+Chess
SyncSWM+Chess
AsyncSWM+Chess
Outline
• DNA probe arrays and unwanted illumination• Synchronous array design (2-D placement)• Asynchronous array design (3-D placement)• Experimental results• Extensions• Conclusions
Practical Extensions
• Distant-dependent border conflict weights
Take into account conflicts between 2-,3-hop neighbors rather than only immediate neighbors
• Position-dependent border conflict weights
In alignment DP for two sequences take into account importance of conflicts in the middle of probes – alignment cost has weights on conflicts which depend on conflict position
• Polymorphic probes
Chip contains SNP’s, e.g. pairs of probes different in a single position – they should be placed together and alignment DP should align them simultaneously
Simplified DNA Array Flow
Probe Selection
Probe Placement
Probe Alignment (Mask Design)
Array Manufacturing
Hybridization Experiment
Gene sequences, position of SNPs, etc.
This talk
Analysis of Hybridization Intensities
Mask Manufacturing
Soft/Computational Domain
Hard/Biochemistry Domain
Summary
• Contributions:– Epitaxial placement reduces by extra 10% over the previously best
known method– Asynchronous placement problem formulation– Postplacement improvement by extra 15.5-21.8%– Lower bounds– Scalable Placements (1000x1000 in 20min)
• Ongoing work– Comparison on industrial benchmarks– Experiments with algorithms for extended formulations (SNPs,
distance-dependent weights, etc.)