+ All Categories
Home > Software > Aho-Corasick algorithm parallelization

Aho-Corasick algorithm parallelization

Date post: 14-Aug-2015
Category:
Upload: alessandro-liparoti
View: 39 times
Download: 1 times
Share this document with a friend
14
Parallelization of a string-matching algorithm Advanced Algorithms Alessandro Liparoti
Transcript

Parallelization of a string-matching algorithmAdvanced Algorithms

Alessandro Liparoti

<Name Surname>

2String-matching: AC algorithm

String-matching algorithms are a class of algorithms that aim to find occurrences of words (patterns) within a larger string (text)

Aho-Corasick algorithm (AC) is a classic solution to exact set matching.

Given

pattern set 𝑃 = { 𝑃1, . . . , π‘ƒπ‘˜ }

text 𝑇[1β€¦π‘š]

total length of patterns n = 𝑖=1π‘˜ |𝑃𝑖|

the AC algorithms complexity is 𝑂(𝑛 +π‘š + 𝑧), where 𝑧is the number of pattern occurrences in 𝑇

<Name Surname>

3AC algorithm: finite-state machine

The AC algorithm builds a finite-state machine to efficiently memorize the pattern set

The FSA is memorized along with three functions

the goto function 𝑔(π‘ž, π‘Ž) gives the state entered

from current state π‘ž by matching target char π‘Ž

the failure function 𝑓 π‘ž , π‘ž β‰  0 gives the state

entered at a mismatch

the output function out π‘ž gives the set of patterns

recognized when entering state q

<Name Surname>

4AC algorithm: FSA example

𝑃 = β„Žπ‘’, π‘ β„Žπ‘’, β„Žπ‘–π‘ , β„Žπ‘’π‘Ÿπ‘ 

Dashed arrows are fail transitions

<Name Surname>

5AC algorithm: matching phase

The AC algorithm uses the FSA to match the text against the keywords

𝐴𝐢_π‘šπ‘Žπ‘‘π‘β„Žπ‘–π‘›π‘” 𝑇 1β€¦π‘š

π‘ž ≔ 0; // initial state (root)

𝒇𝒐𝒓 𝑖 ≔ 1 𝒕𝒐 π‘š 𝒅𝒐

π’˜π’‰π’Šπ’π’† 𝑔 π‘ž, 𝑇 𝑖 = 0 𝒅𝒐

π‘ž ≔ 𝑓 π‘ž ; // follow a fail

π‘ž ≔ 𝑔 π‘ž, 𝑇 𝑖 ; // follow a goto

π’Šπ’‡ π‘œπ‘’π‘‘ π‘ž β‰  0 𝒕𝒉𝒆𝒏 π’‘π’“π’Šπ’π’• 𝑖, π‘œπ‘’π‘‘ π‘ž ;

𝒆𝒏𝒅𝒇𝒐𝒓

The number of steps of the loop is equal to the length of the text

<Name Surname>

6Parallelization step

Idea: parallelize the matching phase of the AC algorithm (the FSA can be built once for each pattern data set)

The π‘š steps of the loop can be split in π‘˜ chunks, each one of length 𝑙 = π‘š π‘˜ and then each chunk can be processed by a thread

Feasible because a chunk can be independently analyzed

π‘š = 19 π‘˜ = 3 𝑙 = 7

<Name Surname>

7Parallelization: problems

The splitting phase as performed before can lead to missing occurrences

Let assume 𝑃 = π‘Žπ‘‘π‘£, π‘œπ‘Ÿπ‘–π‘‘, 𝑒𝑑

Each thread would run AC on its related chunk

Thread 1: 𝑇 = π‘Žπ‘‘π‘£π‘Žπ‘›π‘π‘’

Thread 2: 𝑇 = 𝑑 π‘Žπ‘™π‘”π‘œπ‘Ÿ

Thread 3: 𝑇 = π‘–π‘‘β„Žπ‘šπ‘ 

None of them would find the occurrences of the second and third keyword

Needed a redundancy for text overlapping twochunks

<Name Surname>

8Parallelization: solutions

The maximum needed overlap o is the lenght of the longest word in the pattern data set – 1

Each chunk will contain the last o characters of the previous one

However: orit correctly found by thread 3 but ed incorrectly matched twice (threads 1 and 2)

Correction: start counting matches only after o characters read

<Name Surname>

9Implementation

AC has been implemented in C using openMP; the matching-phase has been split among threads using the pragma for structure

Input: text, keywords, number of threads

Output: number of occurences

The chunk size 𝑙 is computed with the following formula

𝑙 = π‘š + π‘œπ‘£ ( π‘˜ βˆ’ 1)

The output variable is aggregated after the end of the loop ( reduction statement )

<Name Surname>

10Implementation

Each read character is converted in its ASCII code

Therefore, the FSA

allows 256 different

transitions

It allows to use the AC

algorithm even with

non-textual files

Binary files must be

read bytewise

<Name Surname>

11Test

Very large input files have been used in order to test the algorithm’s performance

a text file containing the English version of the bible

a dictionary including the 10000 most common English words

A single test consists of an aggregation measure of 10 different runs of the algorithm on the inputs using the same number of threads

<Name Surname>

12Test

1 2 3 4 5 6 7 8 9 10

0

50

100

150

200

250

300

350

number of threads

execution t

ime (

sec)

i7 4700MQ - 4 cores/8 threads

Mean

Minimum

<Name Surname>

13Test

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

0

15

30

45

60

75

90

105

number of threads

execution t

ime (

sec)

12 cores/24 threads machine

mean

min

<Name Surname>

14Conclusion

In this work it has been showed a parallelization procedure for a serial-designed algorithm

The more threads are used the faster the algorithm runs until a certain point after which we do not get any improvements

Parallelization improves performance but requires modifications not always clear from the beginning that often lead to overheads


Recommended