Fast Searching in Biological Sequences Using Multiple Hash Functions

Fast Search in Biological Sequences using Multiple Hash Functions

Algorithms & Complexity Evaluation

A T A C G T T C A G A T T G C C A G C A C G T T

Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità

Grasping the problemGrasping the problem

string matching??? what’s this?

We are going to deal with a very tiny alphabet representing nucleotydes in a genetic sequence.

T G T C G

A G G C A

T G A G C

A T G A C G A C T

C

G

T

A DENINE

HYMINE

UANINE

YTOSINE

A T G A C G A C T

T G T C G

A G G C A

T G A G C

Searching in a sequence for more patterns.

After veryfing matches, advance window: pos++

search window

patterns to search

DNA sequence

shift window by 1 position


Let‛s talk about Wu &Mamber

Let‛s talk about Wu & Mamber

don’t worry! It’s not a

magic spell... it’s just an algorithm

First we have pre - processing stage...

T G A G C A C T G

T G A G C A C T G

T G A

T G AHASH( )= #@!*$%£&?

sh[ ]=#@!*$%£&? shift

Then we can move to real search...

C T G A C C G C T C C

T G A G T A G T A G A

G T A G C G T G A G C

A C A A C T G G C G A

A C A A C T G G C

HASH( )= ^@!*%£$?#G G C

sh[ ]= ^@!*%£$?#shift

G G C

a patternNOT A TEXT!!!

gram dim q = 3

extracting the first q-gram

feeding the hash function with the extracted q-gram, hash is returned: 0 <= hash <= MAX

calculated hash is used as index in

shift array value used to shift the window

now... a text!

window size = pattern size = m

extracting the last q-gram only

hash function gets the q-gram, hash returned: 0 <= hash <= MAX

shift index

F[HASH(’CTG’)] = patterns[cur]

0? trueNAIVE CHECK


W-M limitW-M limitcannot increase them both...

0 1 0 1 00 0 0001 1 11 11

T G A

wkkk

k = Math.floor(w/q);

Increase q

Increase k

More text to analize

More bits per char

Decrease number of false positives

to be continued...


Enhancing W-M...Enhancing W-M...pre-processing

T G A G C A C T G

T G A G C A C T G

T G AHASH( )= #@!*$%

sh [ ]= m-q-i#@!*$%1

HASH(’CTG’) = h1

T G A G C A C T G

T G AHASH( )= #@!*$%

sh [ ]= m-2q-i#@!*$%2

HASH(’GCA’) = h2

γ = 1γ = 2

h = h1 << 1( ) + h2F[h] = patterns[cur]

...now you can’t go back


...Enhancing W-M...Enhancing W-M

search

In the end...

C T G A C C G C T C C

T G A G T A G T A G A

G T A G C G T G A G C

A C A A C T G G C G A

A C A A C T G G C

HASH( )= ^@!*%£$?#A C T

]= ^@!*%£$?#shift2 sh [2

HASH( )= §+!#*£$?%G G C

]= §+!#*£$?%shift1 sh [1

h1

h2

window

a text

if (shift1 == 0 && shift2 == 0) foreach (p in F[h]) checkOccurrInWin(p);

h = h1 << 1( ) + h2


ComplexitiesComplexities

O MAX 1 +( ) + r( ) = Space requirementO MAX + r m q( ) = Time requirement

O m(1) n( ) = Time requirement

Pre-processing

Search phase

m(1) =i=1

r

len pi( )( )


Experimental resultsExperimental results

Showing comparison on execution times among WM(q,γ) and one of the current fastest algorithms in literature

8 16 32 64 1280

5

10

15

20

25

30

35

|P| = 100

time

w

WM(6,1)

WM(8,1)

WM(8,1)WM(8,1) WM(8,1)

8 16 32 64 1280

20

40

60

80

100

|P| = 1000

time

w

WM(4,2)

WM(8,1)

WM(8,2) WM(8,3) WM(8,3)

8 16 32 64 1280

200

400

600

800

1000

1200

|P| = 10000

time

w

WM(4,2)

WM(8,2) WM(8,2) WM(8,2) WM(8,2)

best WM(q,γ)MBNDM

The End

A T A C G T T C A G A T T G C C A G C A C G T T

Date post:	29-Jul-2015
Category:	Education
Upload:	simone-tino
View:	56 times
Download:	0 times

Fast Searching in Biological Sequences Using Multiple Hash Functions

Education