PARALLEL CONSTRUCTION OF
SIMULTANEOUS DETERMINISTIC
FINITE AUTOMATA ON SHARED-
MEMORY MULTICORES
Minyoung Jung1, Jinwoo Park1,
Johann Blieberger2 and Bernd Burgstaller1
1Yonsei University, Korea
2Vienna University of Technology, Austria
46th International Conference of Parallel Processing
Bristol, United Kingdom in August 14 - 17, 2017
Motivation
2
String pattern matching with finite automata (FAs) is a
well-established method across many areas.
Text editors
Compiler front-ends
Internet search engines
Security and DNA sequence analysis
The sequential FA algorithm has linear complexity in
the size of the input.
Significant research effort has been spent on parallelizing
FA matching to improve the sequential performance
Hard to be parallelized due to the dependency between
state transitions
Limitation of parallel FA matching
Motivation (cont.)
3
DFA
Limitation of parallel FA matching
Motivation (cont.)
4
Limitation of parallel FA matching
Motivation (cont.)
5
What is the start state?
Limitation of parallel FA matching
Motivation (cont.)
6
Limitation of parallel FA matching
Motivation (cont.)
7
Limitation of parallel FA matching
Motivation (cont.)
8
Limitation of parallel FA matching
Motivation (cont.)
9
SFA construction
Simultaneous Finite Automata (SFAs)
Accumulated state transition information
Simulates the parallel execution of |Q| DFAs on a
single DFA
10
DFA SFA
Motivation (cont.)
Parallel FA matching
Parallel SFA matching
Motivation (cont.)
11
Parallel FA matching
Parallel SFA matching
Motivation (cont.)
12
Parallel FA matching
Parallel SFA matching
Motivation (cont.)
13
Parallel FA matching
Parallel SFA matching
Motivation (cont.)
14
Parallel FA matching
Parallel SFA matching
Motivation (cont.)
15
Motivation (cont.)
16
3 states
6 states
Our contributions
17
Introduce fingerprint-based hashing of SFA-
states to speed up state comparisons.
Provide x86 SIMD-based transposition
kernels for SFA-state construction to leverage
data-parallelism and cache-locality.
Perform in-memory compression of SFA-states
to mitigate the space constraints of large problems.
Parallelize SFA construction for shared-memory
multicores with lock-free synchronization on all
data-structures including thread-local queues supporting work-stealing.
1.
2.
3.
4.
Start with the initial state .
DFA over
Sequential SFA construction
18SFA
DFA over
Sequential SFA construction
19
Until no more states to process
SFA
Sequential SFA construction
20
DFA over
SFA
Sequential SFA construction
21
Insert into the processed set
DFA over
SFA
Sequential SFA construction
22
Iterate with every symbol
DFA over
SFA
Sequential SFA construction
23
Find new states
DFA over
SFA
Sequential SFA construction
24
Update the SFA transition function
DFA over
SFA
Sequential SFA construction
25
Check existence &
add new state to the set
(set membership test)
DFA over
SFA
Sequential SFA construction
26
Generate a next state with symbol
DFA over
SFA
Sequential SFA construction
27
Generate a next state with symbol
DFA over
SFA
DFA over
Sequential SFA construction
28
Choose the unprocessed state
SFA
DFA over
Sequential SFA construction
29SFA
Generate a next state with symbol
DFA over
Sequential SFA construction
30
Until no more states to process
SFA
Sequential SFA construction
31
Set the initial and the final state
DFA over
SFA
Optimizing SFA construction
32
Optimizing SFA construction
33
Parameterized transposition
Fingerprint-based hashing
Fingerprint-based hashing
34
Fingerprints ( ) Short bit-strings for larger objects (SFA-states)
CityHash, FarmHash, Rabin’s method, etc. create fingerprints
Speed up comparisons of SFA-states
exhaustive SFA-state comparisons
Fingerprint-based hashing
35
Fingerprints ( ) Short bit-strings for larger objects (SFA-states)
CityHash, FarmHash, Rabin’s method, etc. create fingerprints
Speed up comparisons of SFA-states
fingerprint comparisons
Fingerprint-based hashing
36
Fingerprints ( ) Short bit-strings for larger objects (SFA-states)
CityHash, FarmHash, Rabin’s method, etc. create fingerprints
Speed up comparisons of SFA-states
Fingerprint-collisions
It follows from the properties of the hash function that if fingerprints are
different, SFA-states are different.
No exhaustive comparison necessary.
With small probability, different SFA-states generate same fingerprint.
Fingerprint-collision
If fingerprints are the same, SFA-states may be the same.
exhaustive comparisons are required.
Fingerprint-based hashing (cont.)
37
Hashing of SFA-states Speed up lookups, reduces number of SFA-state comparisons
Hash key: fingerprint % size of the hash-table
Value: fingerprint, SFA-state
0
1
2
Hash-table (size=3)
Fingerprint-based hashing (cont.)
38
Hash-collisions Different SFA-states may map to the same hash-key due to the modulo-
operation.
0
1
2
Hash-table (size=3)
Hash-collision
Fingerprint-based hashing (cont.)
39
Hash-collisions Different SFA-states may map to the same hash-key due to the modulo-
operation.
Resolved by closed addressing with chaining
0
1
2
Hash-table (size=3)
Parameterized transposition
40
Speed up creating next SFA-states of each SFA-state
1 0 0
1 0 2
2 2 2
a b c
0
1
2
Non-optimized:
compute next states one by one
DFA transition table
Parameterized transposition
41
Speed up creating next SFA-states of each SFA-state
1 2 1
0 2 0
0 2 0
a
b
c
1 0 0
1 0 2
2 2 2
a b c
0
1
2
DFA transition table
Optimized: transpose the table to the table
according to the DFA-states of the source SFA-state
1 2 1
0 2 0
0 2 0
Parameterized transposition
42
Speed up creating next SFA-states of each SFA-state
a
b
c
1 0 0
1 0 2
2 2 2
a b c
0
1
2
DFA transition table
Optimized: transpose the table to the table
according to the DFA-states of the source SFA-state
Parameterized transposition (cont.)
43
DFA transition table (17x20)
8x8 8x8
1x1
8x8 8x8
4x8 4x8
x86 SIMD-intrinsics-based transposition kernels
20 next SFA-states (20x17)
Example transposed transition table
# DFA-states: 17, # symbols: 20
Parameterized transposition (cont.)
44
DFA transition table (17x20)
8x8 8x8
1x1
8x8 8x8
4x8 4x8
x86 SIMD-intrinsics-based transposition kernels
20 next SFA-states (20x17)
Example transposed transition table
# DFA-states: 17, # symbols: 20
Parameterized transposition (cont.)
45
DFA transition table (17x20)
8x8 8x8
1x1
8x8 8x8
4x8 4x8
x86 SIMD-intrinsics-based transposition kernels
20 next SFA-states (20x17)
Example transposed transition table
# DFA-states: 17, # symbols: 20
Parameterized transposition (cont.)
46
DFA transition table (17x20)
8x8 8x8
1x1
8x8 8x8
4x8 4x8
x86 SIMD-intrinsics-based transposition kernels
20 next SFA-states (20x17)
Example transposed transition table
# DFA-states: 17, # symbols: 20
Work (SFA-state) distribution
47
New SFA-states are pushed to the global queue:
Thread 1: Thread 2:
Highly contentedFront Back
Observations:
1) The amount of work changes dynamically.
Few available states at the beginning, but soon all cores are saturated.
2) Switching the work distribution scheme dynamically adapts to the
changing load condition and reduces the cache-coherence overhead.
Scheme 1: static distribution via a global queue:
Advantage: avoid coherence-overhead at front of the queue from work-
stealing attempts of idle threads
Back of the queue is not contended because initially little work is
available.
Work (SFA-state) distribution (cont.)
48
Scheme 2: dynamic distribution via thread-local queues
Work-stealing: steal work from the other’s queue once the local queue
is empty
Work will be popped exactly once by a thread because of lock-free
synchronization using compare-and-swap (CAS) operation
Advantage: avoid coherence-overhead from the highly contended back
of the global queue
Dequeuing SFA-states from other thread-local queues (work-stealing)
makes front of the queue highly contended (cache coherence overhead)
when little work is available
Thread-local queues:
Thread 1:
(owner)
Thread 2:
(thief)
CAS fails
CAS succeeds
Thread 0:
(thief)
In-memory compression
49
SFA-state compression mitigates state explosion problem
Dictionary-based compression shows high compression
ratios due to structural properties of FAs
FA-states tend to repeat in SFA-states
Compression requires additional costly computation
Initiate once a critical memory threshold is reached
27 KB per SFA-state
Compress
In-memory compression (cont.)
50
Mitigate intractable problem sizes
Conduct SFA construction in three phases
First phase: construct an SFA with un-compressed SFA-states
Dictionary-based
lossless compression
In-memory compression (cont.)
51
Mitigate intractable problem sizes
Conduct SFA construction in three phases
First phase: construct an SFA with un-compressed SFA-states
Second phase: compress all generated SFA-states once a critical
memory threshold is reached
Mitigate intractable problem sizes
Conduct SFA construction in three phases
First phase: construct an SFA with un-compressed SFA-states
Second phase: compress all generated SFA-states once a critical
memory threshold is reached
Third phase: resume SFA construction with compressed SFA-states
In-memory compression (cont.)
52
Decompress
CompressSet membership test
Experimental evaluation
53
Benchmarks: 1250 patterns from PROSITE protein database
Their minimal DFAs are generated by Grail+.
Exclude patterns take several days to convert to minimal DFAs.
Proposed algorithm implemented in C11 using POSIX threads.
Performance results are obtained by PAPI allows accesing
hardware performance counters.
Evaluation platforms:
4-CPU (64 cores) AMD Opteron system
2-CPU (44 cores, 2 hyperthreads per core) Intel Xeon Broadwell E5-
2699 v4 system
Linux CentOS version 7
Experimental evaluation (cont.)
54
Speedups of optimized sequential algorithm over the previous algorithm
Hashing: max 4.1x on AMD, 3.1x on Intel
Combination of hashing and transposition:
max 6.8x on AMD, 5.2x on Intel
On the AMD system On the Intel system
Experimental evaluation (cont.)
55
Speedups of parallelization
Based on our fastest sequential algorithm using hashing and
parameterized transposition
On the AMD
system
(Max. 108.9x)
On the Intel
system
(Max. 46.1x)
Experimental evaluation (cont.)
56
Performance and size comparison with and w/o compression
Six benchmarks on the Intel system (four benchmarks are intractable
w/o compression and two benchmarks are added to compare them)
Set our memory manager’s threshold to 200 GB to force compression
of two tractable benchmarks
Intractable w/o compression
Conclusion
57
Introduced fingerprints and hashing to reduce state
comparisons and set membership tests.
Parameterized transposition of the transition table ensures
cache locality of memory accesses.
Dynamic switch from global work queue to thread local
queues with work-stealing avoids contention of cache-lines at
front and back of queue.
Dynamically switch to in-memory compression of SFA-states
once they cannot fit into the main memory.
Overall speedups including fingerprint-based hashing,
parameterized transposition and parallelization without
compression are up to 312x on AMD and 193x on Intel.
Compression ratios are up to 30 on the Intel system.
This research was supported by:
the Austrian Science Fund (FWF) project I
1035N23
the Next-Generation Information Computing
Development Program through the National
Research Foundation of Korea (NRF), funded by
the Ministry of Science, ICT & Future Planning
under grant NRF2015M3C4A7065522
Acknowledgments
58
Thank you!
Q&A