PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS...

PARALLEL CONSTRUCTION OF

SIMULTANEOUS DETERMINISTIC

FINITE AUTOMATA ON SHARED-

MEMORY MULTICORES

Minyoung Jung1, Jinwoo Park1,

Johann Blieberger2 and Bernd Burgstaller1

1Yonsei University, Korea

2Vienna University of Technology, Austria

46th International Conference of Parallel Processing

Bristol, United Kingdom in August 14 - 17, 2017

Motivation

2

String pattern matching with finite automata (FAs) is a

well-established method across many areas.

Text editors

Compiler front-ends

Internet search engines

Security and DNA sequence analysis

The sequential FA algorithm has linear complexity in

the size of the input.

Significant research effort has been spent on parallelizing

FA matching to improve the sequential performance

Hard to be parallelized due to the dependency between

state transitions

Limitation of parallel FA matching

Motivation (cont.)

3

DFA


Motivation (cont.)

4


Motivation (cont.)

5

What is the start state?


Motivation (cont.)

6


Motivation (cont.)

7


Motivation (cont.)

8


Motivation (cont.)

9

SFA construction

Simultaneous Finite Automata (SFAs)

Accumulated state transition information

Simulates the parallel execution of |Q| DFAs on a

single DFA

10

DFA SFA

Motivation (cont.)

Parallel FA matching

Parallel SFA matching

Motivation (cont.)

11



Motivation (cont.)

12



Motivation (cont.)

13



Motivation (cont.)

14



Motivation (cont.)

15

Motivation (cont.)

16

3 states

6 states

Our contributions

17

Introduce fingerprint-based hashing of SFA-

states to speed up state comparisons.

Provide x86 SIMD-based transposition

kernels for SFA-state construction to leverage

data-parallelism and cache-locality.

Perform in-memory compression of SFA-states

to mitigate the space constraints of large problems.

Parallelize SFA construction for shared-memory

multicores with lock-free synchronization on all

data-structures including thread-local queues supporting work-stealing.

1.

2.

3.

4.

Start with the initial state .

DFA over

Sequential SFA construction

18SFA

DFA over


19

Until no more states to process

SFA


20

DFA over

SFA


21

Insert into the processed set

DFA over

SFA


22

Iterate with every symbol

DFA over

SFA


23

Find new states

DFA over

SFA


24

Update the SFA transition function

DFA over

SFA


25

Check existence &

add new state to the set

(set membership test)

DFA over

SFA


26

Generate a next state with symbol

DFA over

SFA


27


DFA over

SFA

DFA over


28

Choose the unprocessed state

SFA

DFA over


29SFA


DFA over


30

Until no more states to process

SFA


31

Set the initial and the final state

DFA over

SFA

Optimizing SFA construction

32

Optimizing SFA construction

33

Parameterized transposition

Fingerprint-based hashing


34

Fingerprints ( ) Short bit-strings for larger objects (SFA-states)

CityHash, FarmHash, Rabin’s method, etc. create fingerprints

Speed up comparisons of SFA-states

exhaustive SFA-state comparisons


35




fingerprint comparisons


36




Fingerprint-collisions

It follows from the properties of the hash function that if fingerprints are

different, SFA-states are different.

No exhaustive comparison necessary.

With small probability, different SFA-states generate same fingerprint.

Fingerprint-collision

If fingerprints are the same, SFA-states may be the same.

exhaustive comparisons are required.

Fingerprint-based hashing (cont.)

37

Hashing of SFA-states Speed up lookups, reduces number of SFA-state comparisons

Hash key: fingerprint % size of the hash-table

Value: fingerprint, SFA-state

0

1

2

Hash-table (size=3)


38

Hash-collisions Different SFA-states may map to the same hash-key due to the modulo-

operation.

0

1

2

Hash-table (size=3)

Hash-collision


39

Hash-collisions Different SFA-states may map to the same hash-key due to the modulo-

operation.

Resolved by closed addressing with chaining

0

1

2

Hash-table (size=3)


40

Speed up creating next SFA-states of each SFA-state

1 0 0

1 0 2

2 2 2

a b c

0

1

2

Non-optimized:

compute next states one by one

DFA transition table


41


1 2 1

0 2 0

0 2 0

a

b

c

1 0 0

1 0 2

2 2 2

a b c

0

1

2


Optimized: transpose the table to the table

according to the DFA-states of the source SFA-state

1 2 1

0 2 0

0 2 0


42


a

b

c

1 0 0

1 0 2

2 2 2

a b c

0

1

2


Optimized: transpose the table to the table

according to the DFA-states of the source SFA-state

Parameterized transposition (cont.)

43

DFA transition table (17x20)

8x8 8x8

1x1

8x8 8x8

4x8 4x8

x86 SIMD-intrinsics-based transposition kernels

20 next SFA-states (20x17)

Example transposed transition table

# DFA-states: 17, # symbols: 20


44


8x8 8x8

1x1

8x8 8x8

4x8 4x8






45


8x8 8x8

1x1

8x8 8x8

4x8 4x8






46


8x8 8x8

1x1

8x8 8x8

4x8 4x8





Work (SFA-state) distribution

47

New SFA-states are pushed to the global queue:

Thread 1: Thread 2:

Highly contentedFront Back

Observations:

1) The amount of work changes dynamically.

Few available states at the beginning, but soon all cores are saturated.

2) Switching the work distribution scheme dynamically adapts to the

changing load condition and reduces the cache-coherence overhead.

Scheme 1: static distribution via a global queue:

Advantage: avoid coherence-overhead at front of the queue from work-

stealing attempts of idle threads

Back of the queue is not contended because initially little work is

available.

Work (SFA-state) distribution (cont.)

48

Scheme 2: dynamic distribution via thread-local queues

Work-stealing: steal work from the other’s queue once the local queue

is empty

Work will be popped exactly once by a thread because of lock-free

synchronization using compare-and-swap (CAS) operation

Advantage: avoid coherence-overhead from the highly contended back

of the global queue

Dequeuing SFA-states from other thread-local queues (work-stealing)

makes front of the queue highly contended (cache coherence overhead)

when little work is available

Thread-local queues:

Thread 1:

(owner)

Thread 2:

(thief)

CAS fails

CAS succeeds

Thread 0:

(thief)

In-memory compression

49

SFA-state compression mitigates state explosion problem

Dictionary-based compression shows high compression

ratios due to structural properties of FAs

FA-states tend to repeat in SFA-states

Compression requires additional costly computation

Initiate once a critical memory threshold is reached

27 KB per SFA-state

Compress

In-memory compression (cont.)

50

Mitigate intractable problem sizes

Conduct SFA construction in three phases

First phase: construct an SFA with un-compressed SFA-states

Dictionary-based

lossless compression


51




Second phase: compress all generated SFA-states once a critical

memory threshold is reached




Second phase: compress all generated SFA-states once a critical

memory threshold is reached

Third phase: resume SFA construction with compressed SFA-states


52

Decompress

CompressSet membership test

Experimental evaluation

53

Benchmarks: 1250 patterns from PROSITE protein database

Their minimal DFAs are generated by Grail+.

Exclude patterns take several days to convert to minimal DFAs.

Proposed algorithm implemented in C11 using POSIX threads.

Performance results are obtained by PAPI allows accesing

hardware performance counters.

Evaluation platforms:

4-CPU (64 cores) AMD Opteron system

2-CPU (44 cores, 2 hyperthreads per core) Intel Xeon Broadwell E5-

2699 v4 system

Linux CentOS version 7

Experimental evaluation (cont.)

54

Speedups of optimized sequential algorithm over the previous algorithm

Hashing: max 4.1x on AMD, 3.1x on Intel

Combination of hashing and transposition:

max 6.8x on AMD, 5.2x on Intel

On the AMD system On the Intel system


55

Speedups of parallelization

Based on our fastest sequential algorithm using hashing and

parameterized transposition

On the AMD

system

(Max. 108.9x)

On the Intel

system

(Max. 46.1x)


56

Performance and size comparison with and w/o compression

Six benchmarks on the Intel system (four benchmarks are intractable

w/o compression and two benchmarks are added to compare them)

Set our memory manager’s threshold to 200 GB to force compression

of two tractable benchmarks

Intractable w/o compression

Conclusion

57

Introduced fingerprints and hashing to reduce state

comparisons and set membership tests.

Parameterized transposition of the transition table ensures

cache locality of memory accesses.

Dynamic switch from global work queue to thread local

queues with work-stealing avoids contention of cache-lines at

front and back of queue.

Dynamically switch to in-memory compression of SFA-states

once they cannot fit into the main memory.

Overall speedups including fingerprint-based hashing,

parameterized transposition and parallelization without

compression are up to 312x on AMD and 193x on Intel.

Compression ratios are up to 30 on the Intel system.

This research was supported by:

the Austrian Science Fund (FWF) project I

1035N23

the Next-Generation Information Computing

Development Program through the National

Research Foundation of Korea (NRF), funded by

the Ministry of Science, ICT & Future Planning

under grant NRF2015M3C4A7065522

Acknowledgments

58

Thank you!

Q&A

Date post:	30-Sep-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

PARALLEL CONSTRUCTION OF SIMULTANEOUS DETERMINISTIC … · PARALLEL CONSTRUCTION OF SIMULTANEOUS...

Documents