Date post: | 21-May-2015 |
Category: |
Education |
Upload: | aaron-karper |
View: | 848 times |
Download: | 3 times |
Efficient Regular Expressions that produce Parse Trees
Aaron Karper Niko Schwarz
University of Bern
January 7, 2014
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 1 / 38
Regular expressions so far
Regular expressions
https? : // (([a− z ] + \.) + ([a− z ]+))︸ ︷︷ ︸domain
((/[a− z0− 9]+)/?)︸ ︷︷ ︸path segments
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 2 / 38
Regular expressions so far
Regular expressions
https? : // (([a− z ] + \.) + ([a− z ]+))︸ ︷︷ ︸domain
((/[a− z0− 9]+)/?)︸ ︷︷ ︸path segments
http : // www︸ ︷︷ ︸domain
. reddit︸ ︷︷ ︸domain
. com︸︷︷︸domain
/ r︸︷︷︸path
/ computerscience︸ ︷︷ ︸path
/ comments︸ ︷︷ ︸path
/ 1sg69d︸ ︷︷ ︸path
/
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 2 / 38
Regular expressions so far
Regular expressions are greedy by default:(a+)(a?) on "aaa" → "aaa" in group 0 and "" in group 1.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 3 / 38
Regular expressions so far
Regular expressions so far
Posix gives only one match.Regular languages are recognized, but parsing with combinatorical parserstakes O(n3).Backtracking implementations (Java, python, perl, . . . ) are exponentiallyslow in the worst case.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 4 / 38
Benchmarks
Parsing with https?://(([a-z]+\.)+([a-z]+))((/[a-z0-9]+)/?)
2http:// www. reddit. com /r /computerscience /comments /1sg69d
143
0
Figure : Posix
http:// www. reddit. com /r /computerscience /comments /1sg69d2
0
221 3
4 4 4 4
Figure : Our approach
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 5 / 38
Benchmarks
Benchmarks
Matching ((a+b)+c)+ against(a200bc)2000.
Tool Time
JParsec 4,498java.util.regex 1,992
Ours 5,332
Extract all class names from our projectwith complex regular expression1.
Tool Time
java.util.regex 11,319Ours 8,047
1(.*?([a-z]+\.)*([A-Z][a-zA-Z]*))*.*?Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 6 / 38
Benchmarks Optimizations of the algorithm
Benchmarks – Optimizations of the algorithm
Typically most time is spent in long repetitions, we optimize for that case by:Lazily compile deterministic FA.Avoiding to recreate state if seen similar state.Use compressed representation if in static repetition.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 7 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Parse(a?(a)b)+
over”a0a1b2a3b4”
a a b a b0 1 2 3 4
1 122
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 8 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2 q3 q4
q9
q5 q6 q7 q8
-
-
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 9 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3 q4
q9
q5 q6 q7 q8
-
-
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 10 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3
[[0], [], [], []]
q4
q9
q5 q6 q7 q8
-
-
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 11 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3
[[0], [], [], []]
q4
[[0], [], [0], []]
q9
q5 q6 q7 q8
-
-
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 12 / 38
Benchmarks NFA interpretation
Threads
h1h1 h2 h3 h4 h5 h6
State:
Histories:
qCopy of thread is modified.Copy of array of histories makesreading a character O(m2)
Need faster persistent datastructure to get O(m logm).
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 13 / 38
Benchmarks NFA interpretation
Optimized thread forking
Set entry 2 to 20:
1
2
3
4 5
6
7 8
9
10
11 12
13
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 14 / 38
Benchmarks NFA interpretation
Optimized thread forking
Set entry 2 to 20:
1
2
3
4 5
6
7 8
9
10
11 12
13
1
20
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 15 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3
[[0], [], [], []]
q4
[[0], [], [0], []]
q9
q5 q6 q7 q8
-
-
For each character read, threads start hungry and must eat immediately.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 16 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3
[[0], [], [], []]
q4
q9
q5
[[0], [], [0], []]
q6 q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 17 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3
[[0], [], [], []]
q4
q9
q5
[[0], [], [0], []]
q6
[[0], [], [0], [0]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 18 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3 q4
q9
q5
[[0], [], [0], []]
q6
[[0], [], [0], [0]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 19 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3
[[0], [], [], []]
q4
[[0], [], [1], []]
q9
q5
[[0], [], [0], []]
q6
[[0], [], [0], [0]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 20 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3 q4
[[0], [], [1], []]
q9
q5 q6
[[0], [], [0], [0]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 21 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3 q4
[[0], [], [1], []]
q9
q5 q6
[[0], [], [0], [0]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 22 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3 q4
q9
q5
[[0], [], [1], []]
q6
[[0], [], [1], [1]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 23 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3 q4
q9
q5 q6
[[0], [], [1], [1]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 24 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[0], [2], [1], [1]]
q2
[[0,2], [2], [1], [1]]
q3
[[0,2], [2], [1], [1]]
q4
[[0,2], [2], [1,3], [1]]
q9
[[0], [2], [1], [1]]
q5 q6 q7
[[0], [], [1], [1]]
q8
[[0], [2], [1], [1]]
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 25 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[0], [2], [1], [1]]
q2
[[0,2], [2], [1], [1]]
q3
[[0,2], [2], [1], [1]]
q4
[[0,2], [2], [1,3], [1]]
q9
[[0], [2], [1], [1]]
q5 q6 q7
[[0], [], [1], [1]]
q8
[[0], [2], [1], [1]]
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 26 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3
[[0,2], [2], [1], [1]]
q4
[[0,2], [2], [1,4], [1]]
q9
q5
[[0,2], [2], [1,3], [1]]
q6
[[0,2], [2], [1,3], [1,3]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 27 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3
[[0,2], [2], [1], [1]]
q4
[[0,2], [2], [1,4], [1]]
q9
q5
[[0,2], [2], [1,3], [1]]
q6
[[0,2], [2], [1,3], [1,3]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 28 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[0,2], [2,4], [1,3], [1,3]]
q2
[[0,2,5], [2,4], [1,3], [1,3]]
q3
[[0,2,5], [2,4], [1,3], [1,3]]
q4
[[0,2,5], [2,4,5], [1,3], [1,3]]
q9
[[0,2], [2,4], [1,3], [1,3]]
q5 q6 q7
[[0,2], [2], [1,3], [1,3]]
q8
[[0,2], [2,4], [1,3], [1,3]]
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 29 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q9
[[0,2], [2,4], [1,3], [1,3]]
a a b a b0 1 2 3 4
1 122
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 30 / 38
Download
https://github.com/nes1983/tree-regex
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 31 / 38
NFA construction
S2
S1
-
AlternationS1|S2
S
-
OptionalS?
S
Capture group(S)
S
-
Star operationS*?
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 32 / 38
Backtracking’s nightmare
(a + a+) + b
against”anb”
will backtrack Θ(2n) times.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 33 / 38
Backtracking’s nightmare
Extract the first cell in a CSV that starts with "P"1:
∧(.∗?, ) + (P.∗?),
failing against”1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13”
is exponential.
1From http://www.regular-expressions.info/catastrophic.htmlAaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 34 / 38
Thread execution order matters
.*(a?)
q1start
q2
q3 q4 q5
any
τ1 ↑ a τ1 ↓
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 35 / 38
Priority matters
(a)|(a)
q1start
q2
q3
q4
q5
q6
τ1 ↑
τ2 ↑
a
a
τ1 ↓
τ2 ↓
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 36 / 38
Optimization Pipeline
1 Convert to nondeterministic FA2 Interpret nondeterministic FA, building deterministic FA lazily.3 Find similar/mappable states to avoid creating infinite DFA.4 Run on DFA if possible5 Compactify DFA if creation of new states wasn’t necessary for a while.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 37 / 38
NFA interpretation
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 38 / 38