Department of Electrical and Computer Engineering
September 14, 2016
Picking Pesky Parameters: Optimizing Regular Expression
Matching in Practice
2Department of Electrical and Computer Engineering
Outline
§ Introduction to regular expressions§ Design space exploration§ Results§ Optimal Regular Expression Matching
Configuration§ Conclusion
3Department of Electrical and Computer Engineering
What is regular expression matching
§ A regular expression (abbreviated regex ) patterns a match to a string.• E.g. this regex matches a valid IP address:• (([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-
9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])
§ Application of regular expression matching:• bibliographic search• Intrusion detection system• Protocol identification• Content filtering
§ Many network security software, such as Snort and Bro, use rule sets of regular expressions that match attacks.
§ These software need to operate at multiple to tens of Gigabit per second link rates to meet the performance requirements of the network.
4Department of Electrical and Computer Engineering
How to implement a regex lookup engine?
1. Transform the rule set into a state machine (finite automaton).2. packet payloads are scanned by traversing the state machine.
§ Automaton can be non-deterministic (NFA) or deterministic (DFA)§ Example: NFA and DFA of .*ab+[cd]e
0 1 2 3 4a
*b
b[cd] eNFA
DFA 0 1 2 3 4a
a
b
b
[cd] ea
a a
Accepting state
5Department of Electrical and Computer Engineering
What is the problem?
§ There are too many algorithms proposed to tune regex matching.§ There are too many different systems implementations for regex
matching:• Different hardware;• Different types of processors; • Different memory configurations.
§ The performance metrics used in previous publications differ:• reduce memory requirements;• improve the average and worst case throughput;• reduce power and energy consumption.
§ It is very difficult to determine which technique or system implementation to use.
6Department of Electrical and Computer Engineering
What does our work do?
§ Our work addresses the problem of choosing which regular expression technique to use for a given system, rule set, and traffic configuration.
§ We present a systematic evaluation of many widely used regular expression techniques using real-world rule sets.
§ We evaluate the throughput, memory size, energy consumption, and estimated chip area of each configuration.
§ We provide a method for choosing the right configuration based on the results from our experiments.
7Department of Electrical and Computer Engineering
Outline
§ Introduction to regular expressions§ Design space exploration§ Results§ Optimal Regular Expression Matching
Configuration§ Conclusion
8Department of Electrical and Computer Engineering
Two types of solutions
§ Memory based solution
§ Logic based solution
Automaton
Caches
Processing units
……
……
MemoryBus
Input / Match
9Department of Electrical and Computer Engineering
Design space Regular expression ruleset
2-DFADFA
NFA2-NFA
Non-compr. layout
Linear encoding
Bitmapped encoding
Memory-based
Result
A-DFA 2-A-DFA
Logic-based
FPGA clock rate
Automaton
partitioned ruleset
Inputs
Implementation
HW-based multi-stride
SW-based multi-stride Stride-1
Cache size Memory bandwidth
Number of cores
4 configurations 9 configurations
System
EvaluationSynthesis
toolProcessor simulator
Real processor
Throughput speed
Memory & area cost
Power consumpt.
Traffic traces
10Department of Electrical and Computer Engineering
Automaton domain
§ NFA (Nondeterministic Finite Automaton)• Generated from regex ruleset.• The number of states is small, but it allows multiple state activations at the
same time.§ DFA (Deterministic Finite Automaton)
• Generated from NFA.• Allows only one active state at the same time: stable performance.• Size could grow exponentially if some complex patterns exist (called state
explosion).• Large rulesets need to be partitioned into several parts, and generate
multiple DFAs.• A-DFA: a compression technique that allow a DFA state use less than 256
transitions. Should use with a compressed memory layout.§ Multi-stride NFA/DFA (or k-NFA/k-DFA)
• Process k input characters at a time• If the initial alphabet is Σ, a k-NFA/k-DFA is equivalent to a FA defined on
alphabet Σk.
11Department of Electrical and Computer Engineering
Implementation domain -- Memory based solution
§ Three memory layouts:1. Non-compressed layout
• Uses all |Σ| transitions in a state.2. Linear encoding
• Only encodes the existing transitions in an NFA, or default transition and other transitions in an A-DFA.
• Linear search is performed until a transition matching the input character is found or its absence is verified.
3. Bitmapped encoding• Similar to linear encoding, but use a bitmap to
avoid linear search.• Only apply to stride-1 DFA
§ 9 configurations in total• Non-compressed – NFA, DFA, 2-NFA, 2-DFA• Linear encoding – NFA, A-DFA, 2-NFA, 2-A-DFA• Bitmapped encoding – A-DFA
Tx for 0x00
32-bits
Tx for 0x01Tx for 0x02
Tx for 0xFF
256 words
……
state
DFA non-compressed layout
Tx for 0x00
32-bits
Tx for 0x01Tx for 0x03Tx for 0xFF
addr of state 0……
addr of state n
Tx address map
stateDefault Tx
DFA linear encoding
Tx for 0x00
32-bits
Tx for 0x01Tx for 0x03Tx for 0xFF
stateLevel1 Bitmap
Level2 Bitmap
1 word
8 words
Default Tx
DFA bitmapped encoding
12Department of Electrical and Computer Engineering
Implementation domain -- Logic based solution
§ Logic based solutions only use NFA• Stride-1 implementation• Software-based multi-stride approach
• First generate a k-NFA, then encode it in logic.• Resource costly, can only support stride-2
• Hardware-based multi-stride approach• Have a stride-one NFA and the corresponding alphabet translation table• Resource efficient, can support up to stride-4
§ 4 configurations in total• Stride-1 implementation• Software-based -- 2-NFA• Hardware-based -- 2-NFA, 4-NFA
13Department of Electrical and Computer Engineering
System domain
§ Memory based solution• Different cache sizes for level-1 and level-2 cache• Memory bandwidth• Different number of cores
§ Logic based solution• Different FPGA clock rates
14Department of Electrical and Computer Engineering
Design space Regular expression ruleset
2-DFADFA
NFA2-NFA
Non-compr. layout
Linear encoding
Bitmapped encoding
Memory-based
Result
A-DFA 2-A-DFA
Logic-based
FPGA clock rate
Automaton
partitioned ruleset
Inputs
Implementation
HW-based multi-stride
SW-based multi-stride Stride-1
Cache size Memory bandwidth
Number of cores
4 configurations 9 configurations
System
EvaluationSynthesis
toolProcessor simulator
Real processor
Throughput speed
Memory & area cost
Power consumpt.
Traffic traces
15Department of Electrical and Computer Engineering
Outline
§ Introduction to regular expressions§ Design space exploration§ Results§ Optimal Regular Expression Matching
Configuration§ Conclusion
16Department of Electrical and Computer Engineering
Evaluation Methodology
§ Real hardware• TI OMAP 4460 ARM processor• Xilinx Virtex 5 FPGA (XC5VLX50)• Speed, memory usage/slice usage and power are
measured
§ Simulator• SimpleScalar simulator, calibrated with real hardware.• To study the parameters which can not be changed on
real hardware• Cache size• Memory bandwidth
§ Inputs• We use both real rulesets (from Snort, L7-filter, and Bro)
and some synthetic rulesets with different characteristics.• Traffic traces are generated by the traffic generator
(written by Becchi et.al.)
17Department of Electrical and Computer Engineering
Results from real hardware – Memory based solutions
§ TI OMAP 4460 ARM processor§ Rulesets with very high mNFA and very low mDFA should
use DFA, and a ruleset with very high mDFA and very low mNFA should use NFA.• mNFA: the average number of active states in NFA• mDFA: the number of DFAs
0
50
100
150Sp
eed
(Mbp
s)
0.1
1
10
100
1000
Mem
ory
(MB)
snort l7-filter bro exact-match dotstar 0.1 dotstar 0.2 dotstar 0.3 dotstar 0.60
500
1000
1500
Powe
r (m
W)
NFA NCNFA LE2-NFA NC2-NFA LEDFA NCA-DFA LEA-DFA BM2-DFA NC2-A-DFA LE
Ruleset #reg-ex
Length mDFA mNFAmin max avgsnort 462 10 202 44.1 12 2.76l7-filter 111 6 438 63.2 7 6.02bro 782 5 211 34.8 8 20.34exact-match 500 10 256 49.2 2 1.76dotstar 0.1 500 10 243 49.6 11 8.42dotstar 0.2 500 11 212 49.0 24 15.64dotstar 0.3 500 11 251 47.1 33 12.76dotstar 0.6 500 11 274 50.3 49 26.76
18Department of Electrical and Computer Engineering
Results from real hardware – Logic based solutions
§ Xilinx Virtex 5 (XC5VLX50)§ 𝑠𝑝𝑒𝑒𝑑 = 𝑐𝑙𝑘)*+,×𝑠𝑡𝑟𝑖𝑑𝑒×8 𝑏𝑖𝑡𝑠§ Smaller circuit can operate at higher frequency§ Hardware-based stride 4 implementation leads to the best results
0
2000
4000
6000
Spee
d (M
bps)
stride-1 SW stride-2 HW stride-2 HW stride-4
0
10000
20000
30000
Slic
e Us
age
snort l7-filter bro exactmatch dotstar0.1 dotstar0.2 dotstar0.3 dotstar0.60
1000
2000
Powe
r (m
W)
mis
sing
mis
sing
mis
sing
19Department of Electrical and Computer Engineering
Results from real hardware – Logic based solutions
§ Different frequency: power vs. speed trade-off;§ 𝑃 = 𝑃567689 + 𝑃;<=7>89 = 𝑃567689 + 𝛼𝐶𝑉B𝑐𝑙𝑘)*+,§ Should choose highest achievable 𝑐𝑙𝑘)*+, to get highest speed/power ratio.
0 500 1000 1500 2000 2500 3000 3500 4000400
500
600
700
800
900
1000
Speed (Mbps)
Powe
r (m
W)
stride-1SW stride-2HW stride-2HW stride-4
20Department of Electrical and Computer Engineering
Results from Processor Simulation – Cache
§ SimpleScalar simulator§ We select the best cache size based on speed/area.
1 2 4 8 16 32 64 128 256 512
32641282565121024204840968192163840
5
10
15
L1 data cache size (KB)L2 data cache size (KB)
Spee
d/ar
ea (M
bps/
mm
2)
L1 size(KB)
L2 size(KB)
NFA 16 64NFA linear 16 322-NFA 64 10242-NFA linear 64 512DFA 64 128D2FA linear 64 64D2FA bitmap 32 642-DFA 128 40962-D2FA linear 128 4096
Best cache size for different configurationsSelected by maximum speed/area
21Department of Electrical and Computer Engineering
Results from Processor Simulation – Memory bandwidth
§ Most cache miss rates are below 1%§ Low memory bandwidth utilization§ High parallelism is possible
Utilizationof bwmem(%)
Maxthreadssupported
NFA 0.25 81NFA linear 0.17 1202-NFA 0.38 522-NFA linear 0.23 88DFA 0.17 118D2FA linear 0.04 454D2FA bitmap 0.04 4802-DFA 0.26 762-D2FA linear 0.20 101
Demonstration of scalability on Intel x86 CPU.
22Department of Electrical and Computer Engineering
Outline
§ Introduction to regular expressions§ Design space exploration§ Results§ Optimal Regular Expression Matching
Configuration§ Conclusion
23Department of Electrical and Computer Engineering
Optimal Memory-Based Configurations§ Select the optimal configuration by speed/area§ Parallel processing is allowed§ When mNFA/mDFA<0.35, an NFA-based implementation is preferable;§ Otherwise DFA-based implementations are preferable.§ For some simple rulesets, 2-DFA is faster than DFA.
24Department of Electrical and Computer Engineering
Optimal Logic-Based Configurations
§ Hardware-based multi-stride is the best.§ There seems to be a peak speed/slice value at higher stride, but
this is beyond the chip's resource to validate.
1 2 3 4 5 6 7 88
10
12
14
16
18
20
stride
Mbp
s/K
slic
es
speed/slice for different hardware based stride
25Department of Electrical and Computer Engineering
Conclusion
§ The key problem in regular expression matching is not the lack of innovative techniques, but the difficulty of deciding which technique actually works best in a given system setting.
§ In this work, we:• define the regular expression matching design space• propose a benchmark of configurations that evaluate the design space both
on simulator and on real hardware.• present the analysis of ruleset to obtain optimal configuration.
26Department of Electrical and Computer Engineering
Thank you!
27Department of Electrical and Computer Engineering
0 20 40 60 80 100 1200
20
40
60
80
100
120
simulator speed (Mbps)
real
spe
ed (M
bps)
Calibration