Date post: | 14-Aug-2015 |
Category: |
Software |
Upload: | alessandro-liparoti |
View: | 39 times |
Download: | 1 times |
<Name Surname>
2String-matching: AC algorithm
String-matching algorithms are a class of algorithms that aim to find occurrences of words (patterns) within a larger string (text)
Aho-Corasick algorithm (AC) is a classic solution to exact set matching.
Given
pattern set π = { π1, . . . , ππ }
text π[1β¦π]
total length of patterns n = π=1π |ππ|
the AC algorithms complexity is π(π +π + π§), where π§is the number of pattern occurrences in π
<Name Surname>
3AC algorithm: finite-state machine
The AC algorithm builds a finite-state machine to efficiently memorize the pattern set
The FSA is memorized along with three functions
the goto function π(π, π) gives the state entered
from current state π by matching target char π
the failure function π π , π β 0 gives the state
entered at a mismatch
the output function out π gives the set of patterns
recognized when entering state q
<Name Surname>
4AC algorithm: FSA example
π = βπ, π βπ, βππ , βπππ
Dashed arrows are fail transitions
<Name Surname>
5AC algorithm: matching phase
The AC algorithm uses the FSA to match the text against the keywords
π΄πΆ_πππ‘πβπππ π 1β¦π
π β 0; // initial state (root)
πππ π β 1 ππ π π π
πππππ π π, π π = 0 π π
π β π π ; // follow a fail
π β π π, π π ; // follow a goto
ππ ππ’π‘ π β 0 ππππ πππππ π, ππ’π‘ π ;
πππ πππ
The number of steps of the loop is equal to the length of the text
<Name Surname>
6Parallelization step
Idea: parallelize the matching phase of the AC algorithm (the FSA can be built once for each pattern data set)
The π steps of the loop can be split in π chunks, each one of length π = π π and then each chunk can be processed by a thread
Feasible because a chunk can be independently analyzed
π = 19 π = 3 π = 7
<Name Surname>
7Parallelization: problems
The splitting phase as performed before can lead to missing occurrences
Let assume π = πππ£, ππππ‘, ππ
Each thread would run AC on its related chunk
Thread 1: π = πππ£ππππ
Thread 2: π = π πππππ
Thread 3: π = ππ‘βππ
None of them would find the occurrences of the second and third keyword
Needed a redundancy for text overlapping twochunks
<Name Surname>
8Parallelization: solutions
The maximum needed overlap o is the lenght of the longest word in the pattern data set β 1
Each chunk will contain the last o characters of the previous one
However: orit correctly found by thread 3 but ed incorrectly matched twice (threads 1 and 2)
Correction: start counting matches only after o characters read
<Name Surname>
9Implementation
AC has been implemented in C using openMP; the matching-phase has been split among threads using the pragma for structure
Input: text, keywords, number of threads
Output: number of occurences
The chunk size π is computed with the following formula
π = π + ππ£ ( π β 1)
The output variable is aggregated after the end of the loop ( reduction statement )
<Name Surname>
10Implementation
Each read character is converted in its ASCII code
Therefore, the FSA
allows 256 different
transitions
It allows to use the AC
algorithm even with
non-textual files
Binary files must be
read bytewise
<Name Surname>
11Test
Very large input files have been used in order to test the algorithmβs performance
a text file containing the English version of the bible
a dictionary including the 10000 most common English words
A single test consists of an aggregation measure of 10 different runs of the algorithm on the inputs using the same number of threads
<Name Surname>
12Test
1 2 3 4 5 6 7 8 9 10
0
50
100
150
200
250
300
350
number of threads
execution t
ime (
sec)
i7 4700MQ - 4 cores/8 threads
Mean
Minimum
<Name Surname>
13Test
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0
15
30
45
60
75
90
105
number of threads
execution t
ime (
sec)
12 cores/24 threads machine
mean
min
<Name Surname>
14Conclusion
In this work it has been showed a parallelization procedure for a serial-designed algorithm
The more threads are used the faster the algorithm runs until a certain point after which we do not get any improvements
Parallelization improves performance but requires modifications not always clear from the beginning that often lead to overheads