+ All Categories
Home > Documents > Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer...

Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer...

Date post: 15-Jan-2016
Category:
View: 212 times
Download: 0 times
Share this document with a friend
32
Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: [email protected] Joint work with WangSen FENG and Lusheng WA NG
Transcript
Page 1: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Identification of Distinguishing Motifs

Zhanyong WANG(Master Degree Student)

Dept. of Computer Science, City University of Hong KongE-mail: [email protected]

Joint work with WangSen FENG and Lusheng WANG

Page 2: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Outline

• The Definitions of Problems• Applications• Previous work• Our work• Algorithm for Single Group• Algorithm for Two Groups• Simulation Results for Single Group• Simulation Results for Two Groups

Page 3: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Motif Identification

• Two versions

1. Single Group

2. Two Groups

Page 4: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Single Group

• Instance: a group of n sequences.

• Objective: find a length-L motif that appears in each of the given sequences and those occurrences of the motif are similar

Page 5: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Two Groups

• Instance: two groups of sequences:

B (Bad) and G (Good)

• Objective: find a motif of length-L that appears in every sequence in group B and does not appear in anywhere of the sequences in G

the occurrences of the motif have errors

Page 6: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Applications

1. Finding Targets for Potential Drugs

(T. Jiang, C. Trendall, S, Wang, T. Wareham, X. Zhang, 98) (K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999)

-- bad strings in B are from Bacteria. -- good strings in G are from Humans

-- find a substring s of length L that is conserved in all bad strings, but not conserved in good strings.

-- use s to screen chemicals -- those selected chemicals can then be tested as potential broad-range antibiotics.

Page 7: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Applications

2. Creating Diagnostic Probes for Bacterial Infection

(T. Brown, G.A. Leonard, E.D. Booth, G. Kneale, 1990)

-- a group of closely related pathogenic bacteria

-- find a substring that occurs in each of the bacterial sequences (with as few substitutions as possible) and does not occur in the human sequences

Page 8: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Applications

3. Locating binding sites and regulatory signals

4. Creating Universal PCR Primers

5. Creating Unbiased Consensus Sequences

6. Anti-sense Drug Design

Page 9: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Previous work

• The closest substring problem was proved to be NP-hard. So are the single group and two groups

(K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999)

• Polynomial time approximation schemes -theoretical results

-speed is slow in order to solve practical instances

Page 10: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Previous Programs

• Bailey and Elkan: MEME (1994) uses a modified EM algorithm, allows the motif

to be absent in some of the given sequences • Waterman: Extended sample-driven approach (1984)• Keich and Pavel Pevzner: two programs (2002)• Buhler and Tompa : Projection (2002)

combine EM and random projection• Price, Ramabhadran and Pevzner: PatternBranching uses branching from sample strings (2003)

faster than the previously best known program: projection

Page 11: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Previous Programs (continued)

• Do not allow indels

• Only for the one group problem

• Some algorithms can handle one gap

Page 12: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Our work

• An extension of the EM approach

• A randomized algorithm for the single group problem which can handle indels

• We give an algorithm for the two groups problem

Page 13: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Representation of motifs• Consensus pattern: choosing the letter that appears the most in each

of the L columns (Figure a)• Profile: 4×L matrix W (ACGT), each cell W(i,j) is a number indicating th

e occurrence rate of letter i in column j.(Figure b)

• Use the profile representation in the early stage of the EM algorithm• Use the consensus pattern representation to improve the accuracy

caaccca caacccc catcccg catccct cacccca

--------------------consensus pattern caacccaAnother con. Pattern catccca (a)

A 0 1 0.4 0 0 0 0.4

C 1 0 0.2 1 1 1 0.2

G 0 0 0.0 0 0 0 0.2

T 0 0 0.4 0 0 0 0.2 (b)

Page 14: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Computing the single group problem

The EM (Expectation Maximization) Algorithm(Wang,L. Dong,L. and Fan,H. 2004)

Input:– n sequences S1,S2,...,Sn

– a 4L matrix W (the initial guess of the motif)

Output:– new matrix W that is a local maximal solution

A 0.25 0.0 1.0

C 0.25 1.0 0.0

G 0.25 0.0 0.0

T 0.25 0.0 0.0

Page 15: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Step 1: L-mer: Sij, a length-L substringFor each L-mer Sij, calculate the likelihood that Sij is theoccurrence of the motif:

P(i,j)=x=1 to L W(Sij(x),x)To avoid zero weights, a fixed small number is added to W(i,j) (0.1)

Step 2: Normalize the likelihood:

P'(i, j)=P(i,j) / x=1m-L+1

P(i, x)

s. t. j=1 to m-L+1P'(i,j)=1

Sij= c a a

W=a 0.25 0 1 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0

P(i,j): 0.25*0.1*1=0.025

Page 16: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Step 3: Re-estimate the motif matrix W.

W= i=1 n j=1

m-L+1 Wij

Where Wij is constructed from Sij

Sij= c a a

W=a 0.25 0 1 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0

P(i,j): 0.25*0.1*1=0.025

Sij(1) Sij(2) Sij(3) Sij = c a a

Wij= a 0 0.025 0.025 c 0.025 0 0 g 0 0 0 t s 0 0 0

Page 17: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Step 4

Normalize W

W'(b,x)= W(b,x)/b=A,C,G,TW(b,x)

Replace W with W'

Page 18: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Step 5

Steps 1 to 4 is called a cycle. If W changes very little from last cycle, then

EM converges and the algorithm ends. otherwise, goto step 1 and start next cycle

Determine the amount of change:

max|Wq(b,x)-Wq-1(b,x)|< set =0.05 such that the algorithm stops within few

cycles

Page 19: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Our Algorithm For Single Group(with indels)

General frame is the same as the previous algorithm

1. We get a initial guess of the motif W

2. With W as initial value, use the new EM algorithm to update W

3. Repeat 1–2 several (Maxtrials) times and choose the best result.

Page 20: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Incorporating Indels

• We add the “space” as a letter, so the matrix for EM algorithm became 5×L

• K: the maximum total number of indels

• For each starting position, consider all length L+h substrings, h=0,1,-1,…,k,-k is the number of indels.

• For each length L+h substring, align it with the matrix

Page 21: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Align a length L+h string with a 5×L matrix

• Dynamic programming• similar to pair wise string alignment• d[i, j] is the score of aligning the first i columns in the ma

trix with the first j letters in the string

d[i, j]=max{d[i-1, j-1] ×W[x,i],

d[i-1,j] ×w[ ,i],△ d[i, j-1] ×e}

Buttom-up order: d[L, L+h]

Best alignment (with indel)

Page 22: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Continued

After calculated the motif W (profile representation: matrix) , we use the matrix W to find the occurrence of the motif in each sequence

Page 23: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Find the motif occurrences

• find the occurrence of the motif in each string

∑i=1LW(ai,i)

a1a2a3…aL is a length-L substring (L-mer) and W is the matrix for the motif

Page 24: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Algorithm for the two Groups (no indels)

• We follow the basic steps of EM method

• Modify the formula to re-construct W

• Re-estimate the matrix W from both group B and G

Page 25: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Main idea

When the motif represented by the matrix W is too close to some L-mers from group G (p(i,j)>ave), we scoop the pattern from the matrix by subtracting the corresponding matrix Wij

Page 26: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Experiment Results (Single Group)

• Input: (1) randomly generate sequences

n = 20m= 600

(2) insert motif into the sequences Center string s (length L) Mutate d positions (insertion, deletion, mutation) Implant the mutated copy into the sequences

• Output:Use our program to find the implanted pattern.

Page 27: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Experiment Results (Single Group)

Table 1: 15 sequences: no indel 5 sequences: one deletion

Table 2:10 sequences: no indel5 sequences : one deletion 5 sequences : one insertion

In table 2, the running time increases significantly and accuracy in many cases is slightly worse than that in Table 1

Page 28: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Experiment Results (Single Group)

•Table 3:5 sequences : one deletion5 sequences : two deletions10 sequences: no indel

•Table 4:5 sequences : one insertion5 sequences : two insertions10 sequences: no indel

The results in Table 4 are slightly better than those in Table 3. The reason might be that the case in Table 4 needs to insert two columns in the matrix for the motif, whereas the case in Table 3 needs to insert two spaces in the motif sequences

Page 29: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Experiment Results (Single Group)

•Table 5, the mixed case:

Probability:

one insertion : 1/8 one deletion : 1/8

two insertions : 1/8 two deletions: 1/8

one insertion and one deletion: 1/8

no indel: 3/8

Page 30: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Experiment Results (Two Groups)

• Center (m=600):

c1: the center for group B, random sequence

c2: the center for group G, randomly mutate

200 positions from c1

• Generate two groups

n=10

Randomly mutate 200 positions from the center

Page 31: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Experiment Results (Two Groups)

From Table 6, we can see that it is easy to find a motif that can distinguish the two groups when L is large

Compare Table 7 with Table 6, we can see that it is easy to find a distinguishing motif when the distance between the two centers is large

Table 7 shows the results when the average Hamming distance between c1 and c2 is about 175

Table 6 shows the results when the average Hamming distance between c1 and c2 is about 128

Page 32: Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk.

Summary

• An algorithm for the single group problem that can handle indels

• An algorithm for the two groups problem


Recommended