Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 218 times |
Download: | 2 times |
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for
DNA Microarrays
Nazif Cihan Tas
ctas@csCMSC 838 Presentation
CMSC 838T – Presentation
Motivation
DNA microarrays techniques are used intensely for identification of biological agents Gene Expression Studies Diagnostic Purposes
Identification of Microorganisms in samples Item Extraction
Complex Problem Find the necessary probes and the temperature Probe sets should be reliably detect and differentiate target
sequences Large Databases NEW!! Homologous Genes (how to find specific probes)
CMSC 838T – Presentation
Talk Overview
Overview of talk Motivation Problem Statement Algorithm Mathematical Aspects Experimentation Discussion
CMSC 838T – Presentation
Problem Statement
Positive Probes Database set S0
Target S1
For each sequence in S1, find at least one probe
For S0 - S1 try to avoid it (but do not care if happens)
High Specificity: # of non-target matches are minimized High Sensitivity: # of covered target seq. is maximized
S0S1
CMSC 838T – Presentation
Problem Statement
Negative Probes Determine as few as possible probes which together hybridizes
with all sequences in S0 - S1 but with NONE in S1.
High Specificity: No seq. in S1 may hybridize
High Sensitivity: Max # of seq. in S0 - S1 be covered
S1S0
CMSC 838T – Presentation
Probe Design Constraints
Sequence Related Length of probes Deviation of melting temperature of probe-target hybrids must
be low (for physical reasons) No self complementary regions longer than four nucleotides
(not descriptive enough) Melting temperatures of target and non-target seq. must be
larger than a predefined (too close, too hard to identify) Ensuring a minimum number of mismatches is enough
(homologous sequences)
System Related Execution Time Usability
CMSC 838T – Presentation
Algorithm
Overview Probe Generation Hybridization Prediction Probe Selection
CMSC 838T – Presentation
AlgorithmProbe Generation
Subproblem: Generate probe candidates for the sequences Keep the set as small as possible without losing any optimal
candidate (exclude infeasible ones)
Suffix Tree Why?
Allows fast recognition of repetitive subsequences Identifies non-unique probes (i.e. with more than one target) Efficient for memory and for T computation (reduce time)
How? Tree is constructed from the sequences Traversed (Watson-Crick complement)
CMSC 838T – Presentation
Suffix Tree
Input: TACTACA TACTACA ACTACA CTACA TACA ACA CA A
$ denotes end of string
Constructed in linear time
CMSC 838T – Presentation
Probe Generation
Further Improvements Filters applied for cut off
Probe length (predefined) G-C content (for temperature) Self-complementarity
Probes should not contain complements as subsequences
Finally, remove highly conserved (non-specific) regions Insert into hashtables according to their lengths
CMSC 838T – Presentation
Algorithm
CMSC 838T – Presentation
AlgorithmHybridization Prediction
Subproblem: Search for the right probe Search is expensive, Intelligent Hashing used
Design A frame is moved over target and nontarget seqs. with several lengths
Previous algorithm (Kaderali 2002): Use the suffix tree At each step, hash values are calculated. If hit, predict melting
temperature, store in hybridization matrix. If there are too many hits for a probe, then it is not unique, remove it Why intelligent?
Hash time is linear Allows inexact matching because of hashing (No analysis)
Parallelization Several threads are searching for probe targets.
Tree and hashtables are fixed. One thread writes to the final matrix
CMSC 838T – Presentation
Hybridization Prediction
Empirical Simulation: One million random probe-target pairings generated Four mismatches or one insertion or deletion plus one strong
central mismatch chosen T<20 C for 93%
Complexity is O ( |S0| |S1| ) Possible probe candidates is |S1| (linear) Each position of database S0 must be checked
CMSC 838T – Presentation
Algorithm
CMSC 838T – Presentation
AlgorithmProbe Selection
Subproblem: Use the hybridization matrix to finalize the probe selection
We have positive probes and negative probes to proceed Algorithm Analysis:
For each probe candidate g: #of matches in S1
b: #of matches in S0 - S1
t: highest melting point in S1
Probes for which g or b values is too large, are removed Sort according to g,b and t. Apply Depth First Search
Advantages Performs well (No comparison though) Guarantees to choose all specific probes if any were found.
Disadvantages can NOT guarantee optimal selection in terms of coverage
CMSC 838T – Presentation
Negative Probe Selection
Let S2 =S0 - S1 and B subset of S2 . The probes that
detect S1 also detects some of B elements.
Algorithm for Negative Probes Apply probe generating and preselection for B.
Conduct hybridization on B U S1 .
Remove the probes which hybridizes with S1 .
Sort the remaining probes according to their hit number. Successively select the probes which covers most target seq.
Guarantees optimal solution for coverage and probe number usage
CMSC 838T – Presentation
AlgorithmProbe Selection
CMSC 838T – Presentation
Mathematical Aspects
CMSC 838T – Presentation
Experimentation
Parallelized on SMP platform Classic worker-producer Intel Dual Pentium III 933 MHz, 1 GB memory
Test data ssu rRNA of ARB project 20.282 ssu rRNA sequences 1.401 < lengths < 4.179 %97 of them are shorter than 2.000 bases
CMSC 838T – Presentation
Discussion
High Performance Execution is linear with size of database, decreases if longer
probes are used
Low Memory Consumption Depends on the size of the sequence selection, NOT database
size
Automatic Design of Group Probes and negative probes
High Quality Probe Design
CMSC 838T – Presentation
Discussion
Comparison with previous work vs. ARB
Not suited for large scale probe design vs. LCF
Does not consider highly conserved data Memory consumption is high Works well with short probes only
vs. others Mostly can not deal with insertion and deletions Execution is slow