picking pesky parameters · Logic based solution Automaton Caches Processing units ... Multi-stride...

Department of Electrical and Computer Engineering

September 14, 2016

Picking Pesky Parameters: Optimizing Regular Expression

Matching in Practice

2Department of Electrical and Computer Engineering

Outline

§ Introduction to regular expressions§ Design space exploration§ Results§ Optimal Regular Expression Matching

Configuration§ Conclusion


What is regular expression matching

§ A regular expression (abbreviated regex ) patterns a match to a string.• E.g. this regex matches a valid IP address:• (([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-

9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])

§ Application of regular expression matching:• bibliographic search• Intrusion detection system• Protocol identification• Content filtering

§ Many network security software, such as Snort and Bro, use rule sets of regular expressions that match attacks.

§ These software need to operate at multiple to tens of Gigabit per second link rates to meet the performance requirements of the network.


How to implement a regex lookup engine?

1. Transform the rule set into a state machine (finite automaton).2. packet payloads are scanned by traversing the state machine.

§ Automaton can be non-deterministic (NFA) or deterministic (DFA)§ Example: NFA and DFA of .*ab+[cd]e

0 1 2 3 4a

*b

b[cd] eNFA

DFA 0 1 2 3 4a

a

b

b

[cd] ea

a a

Accepting state


What is the problem?

§ There are too many algorithms proposed to tune regex matching.§ There are too many different systems implementations for regex

matching:• Different hardware;• Different types of processors; • Different memory configurations.

§ The performance metrics used in previous publications differ:• reduce memory requirements;• improve the average and worst case throughput;• reduce power and energy consumption.

§ It is very difficult to determine which technique or system implementation to use.


What does our work do?

§ Our work addresses the problem of choosing which regular expression technique to use for a given system, rule set, and traffic configuration.

§ We present a systematic evaluation of many widely used regular expression techniques using real-world rule sets.

§ We evaluate the throughput, memory size, energy consumption, and estimated chip area of each configuration.

§ We provide a method for choosing the right configuration based on the results from our experiments.


Outline




Two types of solutions

§ Memory based solution

§ Logic based solution

Automaton

Caches

Processing units

……

……

MemoryBus

Input / Match


Design space Regular expression ruleset

2-DFADFA

NFA2-NFA

Non-compr. layout

Linear encoding

Bitmapped encoding

Memory-based

Result

A-DFA 2-A-DFA

Logic-based

FPGA clock rate

Automaton

partitioned ruleset

Inputs

Implementation

HW-based multi-stride

SW-based multi-stride Stride-1

Cache size Memory bandwidth

Number of cores

4 configurations 9 configurations

System

EvaluationSynthesis

toolProcessor simulator

Real processor

Throughput speed

Memory & area cost

Power consumpt.

Traffic traces


Automaton domain

§ NFA (Nondeterministic Finite Automaton)• Generated from regex ruleset.• The number of states is small, but it allows multiple state activations at the

same time.§ DFA (Deterministic Finite Automaton)

• Generated from NFA.• Allows only one active state at the same time: stable performance.• Size could grow exponentially if some complex patterns exist (called state

explosion).• Large rulesets need to be partitioned into several parts, and generate

multiple DFAs.• A-DFA: a compression technique that allow a DFA state use less than 256

transitions. Should use with a compressed memory layout.§ Multi-stride NFA/DFA (or k-NFA/k-DFA)

• Process k input characters at a time• If the initial alphabet is Σ, a k-NFA/k-DFA is equivalent to a FA defined on

alphabet Σk.


Implementation domain -- Memory based solution

§ Three memory layouts:1. Non-compressed layout

• Uses all |Σ| transitions in a state.2. Linear encoding

• Only encodes the existing transitions in an NFA, or default transition and other transitions in an A-DFA.

• Linear search is performed until a transition matching the input character is found or its absence is verified.

3. Bitmapped encoding• Similar to linear encoding, but use a bitmap to

avoid linear search.• Only apply to stride-1 DFA

§ 9 configurations in total• Non-compressed – NFA, DFA, 2-NFA, 2-DFA• Linear encoding – NFA, A-DFA, 2-NFA, 2-A-DFA• Bitmapped encoding – A-DFA

Tx for 0x00

32-bits

Tx for 0x01Tx for 0x02

Tx for 0xFF

256 words

……

state

DFA non-compressed layout

Tx for 0x00

32-bits

Tx for 0x01Tx for 0x03Tx for 0xFF

addr of state 0……

addr of state n

Tx address map

stateDefault Tx

DFA linear encoding

Tx for 0x00

32-bits

Tx for 0x01Tx for 0x03Tx for 0xFF

stateLevel1 Bitmap

Level2 Bitmap

1 word

8 words

Default Tx

DFA bitmapped encoding


Implementation domain -- Logic based solution

§ Logic based solutions only use NFA• Stride-1 implementation• Software-based multi-stride approach

• First generate a k-NFA, then encode it in logic.• Resource costly, can only support stride-2

• Hardware-based multi-stride approach• Have a stride-one NFA and the corresponding alphabet translation table• Resource efficient, can support up to stride-4

§ 4 configurations in total• Stride-1 implementation• Software-based -- 2-NFA• Hardware-based -- 2-NFA, 4-NFA


System domain

§ Memory based solution• Different cache sizes for level-1 and level-2 cache• Memory bandwidth• Different number of cores

§ Logic based solution• Different FPGA clock rates


Design space Regular expression ruleset

2-DFADFA

NFA2-NFA

Non-compr. layout

Linear encoding

Bitmapped encoding

Memory-based

Result

A-DFA 2-A-DFA

Logic-based

FPGA clock rate

Automaton

partitioned ruleset

Inputs

Implementation

HW-based multi-stride

SW-based multi-stride Stride-1

Cache size Memory bandwidth

Number of cores

4 configurations 9 configurations

System

EvaluationSynthesis

toolProcessor simulator

Real processor

Throughput speed

Memory & area cost

Power consumpt.

Traffic traces


Outline




Evaluation Methodology

§ Real hardware• TI OMAP 4460 ARM processor• Xilinx Virtex 5 FPGA (XC5VLX50)• Speed, memory usage/slice usage and power are

measured

§ Simulator• SimpleScalar simulator, calibrated with real hardware.• To study the parameters which can not be changed on

real hardware• Cache size• Memory bandwidth

§ Inputs• We use both real rulesets (from Snort, L7-filter, and Bro)

and some synthetic rulesets with different characteristics.• Traffic traces are generated by the traffic generator

(written by Becchi et.al.)


Results from real hardware – Memory based solutions

§ TI OMAP 4460 ARM processor§ Rulesets with very high mNFA and very low mDFA should

use DFA, and a ruleset with very high mDFA and very low mNFA should use NFA.• mNFA: the average number of active states in NFA• mDFA: the number of DFAs

0

50

100

150Sp

eed

(Mbp

s)

0.1

1

10

100

1000

Mem

ory

(MB)

snort l7-filter bro exact-match dotstar 0.1 dotstar 0.2 dotstar 0.3 dotstar 0.60

500

1000

1500

Powe

r (m

W)

NFA NCNFA LE2-NFA NC2-NFA LEDFA NCA-DFA LEA-DFA BM2-DFA NC2-A-DFA LE

Ruleset #reg-ex

Length mDFA mNFAmin max avgsnort 462 10 202 44.1 12 2.76l7-filter 111 6 438 63.2 7 6.02bro 782 5 211 34.8 8 20.34exact-match 500 10 256 49.2 2 1.76dotstar 0.1 500 10 243 49.6 11 8.42dotstar 0.2 500 11 212 49.0 24 15.64dotstar 0.3 500 11 251 47.1 33 12.76dotstar 0.6 500 11 274 50.3 49 26.76


Results from real hardware – Logic based solutions

§ Xilinx Virtex 5 (XC5VLX50)§ 𝑠𝑝𝑒𝑒𝑑 = 𝑐𝑙𝑘)*+,×𝑠𝑡𝑟𝑖𝑑𝑒×8 𝑏𝑖𝑡𝑠§ Smaller circuit can operate at higher frequency§ Hardware-based stride 4 implementation leads to the best results

0

2000

4000

6000

Spee

d (M

bps)

stride-1 SW stride-2 HW stride-2 HW stride-4

0

10000

20000

30000

Slic

e Us

age

snort l7-filter bro exactmatch dotstar0.1 dotstar0.2 dotstar0.3 dotstar0.60

1000

2000

Powe

r (m

W)

mis

sing

mis

sing

mis

sing


Results from real hardware – Logic based solutions

§ Different frequency: power vs. speed trade-off;§ 𝑃 = 𝑃567689 + 𝑃;<=7>89 = 𝑃567689 + 𝛼𝐶𝑉B𝑐𝑙𝑘)*+,§ Should choose highest achievable 𝑐𝑙𝑘)*+, to get highest speed/power ratio.

0 500 1000 1500 2000 2500 3000 3500 4000400

500

600

700

800

900

1000

Speed (Mbps)

Powe

r (m

W)

stride-1SW stride-2HW stride-2HW stride-4


Results from Processor Simulation – Cache

§ SimpleScalar simulator§ We select the best cache size based on speed/area.

1 2 4 8 16 32 64 128 256 512

32641282565121024204840968192163840

5

10

15

L1 data cache size (KB)L2 data cache size (KB)

Spee

d/ar

ea (M

bps/

mm

2)

L1 size(KB)

L2 size(KB)

NFA 16 64NFA linear 16 322-NFA 64 10242-NFA linear 64 512DFA 64 128D2FA linear 64 64D2FA bitmap 32 642-DFA 128 40962-D2FA linear 128 4096

Best cache size for different configurationsSelected by maximum speed/area


Results from Processor Simulation – Memory bandwidth

§ Most cache miss rates are below 1%§ Low memory bandwidth utilization§ High parallelism is possible

Utilizationof bwmem(%)

Maxthreadssupported

NFA 0.25 81NFA linear 0.17 1202-NFA 0.38 522-NFA linear 0.23 88DFA 0.17 118D2FA linear 0.04 454D2FA bitmap 0.04 4802-DFA 0.26 762-D2FA linear 0.20 101

Demonstration of scalability on Intel x86 CPU.


Outline




Optimal Memory-Based Configurations§ Select the optimal configuration by speed/area§ Parallel processing is allowed§ When mNFA/mDFA<0.35, an NFA-based implementation is preferable;§ Otherwise DFA-based implementations are preferable.§ For some simple rulesets, 2-DFA is faster than DFA.


Optimal Logic-Based Configurations

§ Hardware-based multi-stride is the best.§ There seems to be a peak speed/slice value at higher stride, but

this is beyond the chip's resource to validate.

1 2 3 4 5 6 7 88

10

12

14

16

18

20

stride

Mbp

s/K

slic

es

speed/slice for different hardware based stride


Conclusion

§ The key problem in regular expression matching is not the lack of innovative techniques, but the difficulty of deciding which technique actually works best in a given system setting.

§ In this work, we:• define the regular expression matching design space• propose a benchmark of configurations that evaluate the design space both

on simulator and on real hardware.• present the analysis of ruleset to obtain optimal configuration.


Thank you!


0 20 40 60 80 100 1200

20

40

60

80

100

120

simulator speed (Mbps)

real

spe

ed (M

bps)

Calibration

Date post:	10-Aug-2020
Category:	Documents
Upload:	others
View:	13 times
Download:	0 times

picking pesky parameters · Logic based solution Automaton Caches Processing units ... Multi-stride...

Documents