Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | mary-kirkland |
View: | 38 times |
Download: | 1 times |
Quality and Error Control Coding for DNA
Microarrays
Olgica MilenkovicECE Department
University of Colorado, Boulder
IEEE Denver ComSoc
Outline
• DNA Microarrays• VLSIPS (Very Large Scale Immobilized Polymer
Synthesis)• Production of DNA Microarrays (http://www.affymetrix.com/)
– Base Scheduling– Mask Design– Quality-Control Coding
• Error-Correcting DNA Microarrays (Multiplexed Arrays)• Production of Multiplexed DNA Microarrays
– Base/Color Scheduling– Mask Design– Quality-Control Coding
IEEE Denver ComSoc
DNA microarrays I
Protein Coding Sequence
Protein
Tra
nscri
pti
on
Tra
nsla
tio
n
Control of Transcription &
Translation
Gene expression and co-regulation
Goal: Determining which genes are expressed (active) and which are unexpressed (inactive)
Comparative gene expression study of multiple cells
Protein Coding Sequence
Protein
Tra
nscri
pti
on
Tra
nsla
tio
nIEEE Denver ComSoc
Slide #1
DNA microarrays II
DNA Subsequence
mRNA
cDNA
3’- AATTT CGC… - 5’
5’ - UUAAAGCG… - 3’
3’ - AATTTCGC… - 5’
3’ - AATTTCGC… - 5’
Creation of tagged cDNA sequences from first cell type
“Color Coding”
`Green’ Cell Culture :
Creating the `cell cultures’ to be compared…
DNA Subsequence
mRNA
cDNA
Creation of tagged cDNA sequences from first cell type
“Color Coding”
`Red’ Cell Culture:
3’- AATTT CGC… - 5’
3’ - UUAAAGCG… - 5’
3’- AATTT CGC… - 5’
3’- AATTT CGC… - 5’
IEEE Denver ComSoc
Slide #2
DNA microarrays III
Complementary sequences hybridize with each other,
forming stable double-helices
Hybridization:
3’-AAGCT-5’
5’-TTCGA-3’
DNA microarray is scanned by laser light of different wave-lengths
Gene ProbesSpots
IEEE Denver ComSoc
Slide #3
Probe synthesis in microarrays IVLSIPS (Gene Chip, AFFYMETRIX, Array
Manufacturing Manual)
Quartz Wafer
Linkers
Linker Activation
Mask
IEEE Denver ComSoc
Slide #4
Probe synthesis in microarrays IIVLSIPS (Gene Chip, AFFYMETRIX, Array
Manufacturing Manual)
Solution of one DNA base
(A or T or G or C)Solution of one DNA base
(A)
IEEE Denver ComSoc
Slide #5
A T G C A T G C A T G C A T G C
Spots
1
2
3
4
5
Production steps
Synchronous
schedule
(length 4N)
Base scheduling I
IEEE Denver ComSoc
CTGA
ACAA
Slide #6
Fixed probe length: N
A G G C T T G C T T G C C C G C
Spots
1
2
3
4
5
Production steps
Asynchronous
schedule
Base scheduling II
IEEE Denver ComSoc
Slide #7
Base Scheduling III
• Shortest asynchronous base schedule– Shortest common super-sequence of set of M sequences (NP-
hard)
ESN(M,k) – expected length of a longest common subsequence of M randomly chosen sequences of length N over an alphabet of size k
N)k,M(ES
lim N
N
)l(k
12
11
10
0
0
M/
M)M(
k
z
)))z((klog()kzlog(M
No significant gain for N≈20-30
Periodic schedule used instead (length 4N)
IEEE Denver ComSoc
Slide #8
Mask Design
Border-length minimization
Feldman and Pevzner, 1994
Hannehalli et.al., 2002
Kahng et.al. 2003, 2004
Key idea:
Arrange the probes on the array in such a way that the border-length of all masks is minimal
Border-length graph: complete graph on M vertices, weight of edges equal to the Hamming distance between probes
Greedy traveling salesman algorithm+ threading (discrete space-filing curve)
IEEE Denver ComSoc
Slide #9
Quality Control
Quality control (fidelity) spots
Hubbell and Pevzner, 1999
Sengupta and Tompa, 2002
Colbourn et.al., 2002
Manufacture identical probes at several quality-control spots in order to test
precision of production steps
IEEE Denver ComSoc
Slide #10
Relevant coding-theoretic ideas Balanced code (Sengupta and Tompa, 2002):
An b×v binary matrix of zeros and ones with
• each row has weight k;
• each column has weight bounded between l and b-l, for some constant l;
• any pair of columns is at least at Hamming distance d apart;
Superimposed designs in Renyi’s search model (Kautz and Singleton, 1964, Dyachkov and Rykov, 1983):
An b×v binary matrix of zeros and ones with
• all Boolean sums composed of no more than s columns are distinct;
•each row has weight exactly t;
Additional constraints: the Boolean sums form an error-correcting code with prescribed minimum distance d; IEEE Denver ComSoc
Slide #11
Error-correcting microarray design
• Probe multiplexing (Khan et.al, 2003)
0011
1001
1010
1100
0101
0110
G
Probes
s
p
o
t
s
X – vector of RNA levels corresponding to N genes
Y – total concentration of RNA at all spots
.constc)j,i(G
k)G(rank
kn,matrixGkn
G)GG(*G
)Gtr(Gmin
j
TT
*T*
10
1
S - hybridization affinity matrix, T - spot quality matrix
Decoding algorithm: numerical optimization
IEEE Denver ComSoc
Slide #12
Excluding hybridization effects, spot formation quality and under iid measurement noise,
XY TGS
VLSIPS/analysis for multiplexed arrays
Features: • Multiple polymer synthesis at one given spot (for simplicity, will consider only
two probes per spot)• Can use two different classes of linkers sensitive to different wavelengths so to
select probes for extension (say, `blue’ and `green’ and `cyan’)
A T G C A T G C A T G C A T G C
Spots
1
2
3
4
5
6
g b g b c c b c g g c b b g b g
Slide #13
Slide #14VLSIPS/analysis for multiplexed
arraysScheduling: shortest schedule of bases/colors
(Using results from V. Dancık, Expected Length of Longest Common Subsequences, 1994)
Set-up: two identical sets of M `blue’ and M `green’ randomly and uniformly chosen sequences of length N over the alphabet of size four
Length of shortest schedule
)(lim )M()M()M(
N122 444
Synchronous schedule, no `cyan’ colored steps: 8N
Chvatal-Sankoff
constants
IEEE Denver ComSoc
s1 s4
s3 s2
s1 s3
s2 s4
A C G T A C G T
b g c c g c c b
S1
S2
S3
S4
AT,CA
AC,CC
GT,GA
TT,TA
L(M)=4, L(M)=4, L(M)=2, L(M)=2, L(M)=2, L(M)=2, L(M)=2
L(M)=2, L(M)=2, L(M)=2, L(M)=2, L(M)=2, L(M)=2, L(M)=2
Slide #15
Mask design:
Mask Design / Scheduling
))(),(()),(),(( 22211211 pppp
Neighborhood graph: complete graph with M vertices labeled by two distinct sequences
No `cyan’ steps: weight of edge between two vertices
sums of Hamming distances
))(),(( 21 pp
Issues: For reasons of controlled hybridization, different probes (blue and green) at the same spot should have fairly large Hamming distance
(Milenkovic and Kashyap, 2005)
Border-length minimization becomes less effective
With cyan colored steps involved, the distance measure also depends on the longest common subsequence of the probes at the same spot
Slide #16
IEEE Denver ComSoc
Quality Control CodingSlide #17
IEEE Denver ComSoc
Theorem: Assume that there exists a linear error-control code with parameters [n,k,d] containing the all-ones codeword. Then one can
construct a quality control array for a multiplexed DNA chip with 2(2k-2) disjoint blue and green production steps and M probes such that the
length of each quality control probe is 2(k-1)-1, and that the weights w of the columns in the quality control array satisfy
Furthermore, with such an array any collection of less than n/(n-d) failed blue or green steps, respectively, can be uniquely identified.
Open question: how does one extend this result for schedules involving `cyan’ colored production steps, and under `spot’ failures.
dnwd