Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif...

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for

DNA Microarrays

Nazif Cihan Tas

ctas@csCMSC 838 Presentation

http://www.cs.umd.edu/~tseng/chauwenben.jpg

CMSC 838T – Presentation

Motivation

DNA microarrays techniques are used intensely for identification of biological agents Gene Expression Studies Diagnostic Purposes

Identification of Microorganisms in samples Item Extraction

Complex Problem Find the necessary probes and the temperature Probe sets should be reliably detect and differentiate target

sequences Large Databases NEW!! Homologous Genes (how to find specific probes)


Talk Overview

Overview of talk Motivation Problem Statement Algorithm Mathematical Aspects Experimentation Discussion


Problem Statement

Positive Probes Database set S0

Target S1

For each sequence in S1, find at least one probe

For S0 - S1 try to avoid it (but do not care if happens)

High Specificity: # of non-target matches are minimized High Sensitivity: # of covered target seq. is maximized

S0S1


Problem Statement

Negative Probes Determine as few as possible probes which together hybridizes

with all sequences in S0 - S1 but with NONE in S1.

High Specificity: No seq. in S1 may hybridize

High Sensitivity: Max # of seq. in S0 - S1 be covered

S1S0


Probe Design Constraints

Sequence Related Length of probes Deviation of melting temperature of probe-target hybrids must

be low (for physical reasons) No self complementary regions longer than four nucleotides

(not descriptive enough) Melting temperatures of target and non-target seq. must be

larger than a predefined (too close, too hard to identify) Ensuring a minimum number of mismatches is enough

(homologous sequences)

System Related Execution Time Usability


Algorithm

Overview Probe Generation Hybridization Prediction Probe Selection


AlgorithmProbe Generation

Subproblem: Generate probe candidates for the sequences Keep the set as small as possible without losing any optimal

candidate (exclude infeasible ones)

Suffix Tree Why?

Allows fast recognition of repetitive subsequences Identifies non-unique probes (i.e. with more than one target) Efficient for memory and for T computation (reduce time)

How? Tree is constructed from the sequences Traversed (Watson-Crick complement)


Suffix Tree

Input: TACTACA TACTACA ACTACA CTACA TACA ACA CA A

$ denotes end of string

Constructed in linear time


Probe Generation

Further Improvements Filters applied for cut off

Probe length (predefined) G-C content (for temperature) Self-complementarity

Probes should not contain complements as subsequences

Finally, remove highly conserved (non-specific) regions Insert into hashtables according to their lengths


Algorithm


AlgorithmHybridization Prediction

Subproblem: Search for the right probe Search is expensive, Intelligent Hashing used

Design A frame is moved over target and nontarget seqs. with several lengths

Previous algorithm (Kaderali 2002): Use the suffix tree At each step, hash values are calculated. If hit, predict melting

temperature, store in hybridization matrix. If there are too many hits for a probe, then it is not unique, remove it Why intelligent?

Hash time is linear Allows inexact matching because of hashing (No analysis)

Parallelization Several threads are searching for probe targets.

Tree and hashtables are fixed. One thread writes to the final matrix


Hybridization Prediction

Empirical Simulation: One million random probe-target pairings generated Four mismatches or one insertion or deletion plus one strong

central mismatch chosen T<20 C for 93%

Complexity is O ( |S0| |S1| ) Possible probe candidates is |S1| (linear) Each position of database S0 must be checked


Algorithm


AlgorithmProbe Selection

Subproblem: Use the hybridization matrix to finalize the probe selection

We have positive probes and negative probes to proceed Algorithm Analysis:

For each probe candidate g: #of matches in S1

b: #of matches in S0 - S1

t: highest melting point in S1

Probes for which g or b values is too large, are removed Sort according to g,b and t. Apply Depth First Search

Advantages Performs well (No comparison though) Guarantees to choose all specific probes if any were found.

Disadvantages can NOT guarantee optimal selection in terms of coverage


Negative Probe Selection

Let S2 =S0 - S1 and B subset of S2 . The probes that

detect S1 also detects some of B elements.

Algorithm for Negative Probes Apply probe generating and preselection for B.

Conduct hybridization on B U S1 .

Remove the probes which hybridizes with S1 .

Sort the remaining probes according to their hit number. Successively select the probes which covers most target seq.

Guarantees optimal solution for coverage and probe number usage


AlgorithmProbe Selection


Mathematical Aspects


Experimentation

Parallelized on SMP platform Classic worker-producer Intel Dual Pentium III 933 MHz, 1 GB memory

Test data ssu rRNA of ARB project 20.282 ssu rRNA sequences 1.401 < lengths < 4.179 %97 of them are shorter than 2.000 bases


Discussion

High Performance Execution is linear with size of database, decreases if longer

probes are used

Low Memory Consumption Depends on the size of the sequence selection, NOT database

size

Automatic Design of Group Probes and negative probes

High Quality Probe Design


Discussion

Comparison with previous work vs. ARB

Not suited for large scale probe design vs. LCF

Does not consider highly conserved data Memory consumption is high Works well with short probes only

vs. others Mostly can not deal with insertion and deletions Execution is slow

Date post:	21-Dec-2015
Category:	Documents
View:	218 times
Download:	2 times

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif...

Documents