+ All Categories
Home > Documents > PPDP 2010 [. 13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

PPDP 2010 [. 13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

Date post: 19-Dec-2015
Category:
View: 218 times
Download: 4 times
Share this document with a friend
Popular Tags:
30
Claus Brabrand (ITU Copenhagen) & Jakob G. Thomsen (Aarhus University) PPDP 2010 Typed and Unambiguous Pattern Matching on Strings using Regular Expressions [http://xkcd.com/208
Transcript
Page 1: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

Claus Brabrand (ITU Copenhagen) & Jakob G. Thomsen (Aarhus University)PPDP 2010

Typed and Unambiguous Pattern Matching on Strings

using Regular Expressions

[http://xkcd.com/208/]

Page 2: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

2

Introduction & Motivation

Parsing dynamic input is an ubiquitous problem

URLs:

Log Files:

The solution is pattern matching

http://www.cs.au.dk/index.php?id=141&view=details

13/02/2010 66.249.65.107 get /support.html20/02/2010 42.116.32.64 post /search.html

protocol host path query-string

(list of key-value pairs)

Page 3: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

3

Motivating example

Example:

Matching against string: yields:

<day = [0-9]{2} > "/" <month = [0-9]{2} > "/" <year = [0-9]{4} >

"26/06/1992"day = 26

month = 06

year = 1992

Page 4: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

4

Our setup<URL = [a-z]*>;...

url.rex

URL.java...

Compile (our tool)

Compile (javac)

URL.classFoo.class...

import URL;class Foo { ...}

Foo.java URL.javaFoo.java...

Page 5: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

5

.

Outline The Chomsky Hierarchy (1956)

Regular Expressions: The Recording Construction

Ambiguity: Disambiguation

Type Mapping

Conclusion

Page 6: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

6

Language classes (+formalisms):

Type-3 regular expressions "enough" for: URLs, log files, ...

"Trade" (excess) expressivity for: declarativity, simplicity, and static safety !

The Chomsky Hierarchy (1956)

Not widely used.No static guarantees.Example: java.net.URL have had 88 bugs spanning a decade and source code still contains a //fixme

Conceptually harder than regular expressions (regular expressions plus recursion).

Simple, declarative and decidable properties(containment, ambiguity, etc.).

Oldie but goodie

Page 7: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

7

Outline The Chomsky Hierarchy (1956)

Regular Expressions: The Recording Construction

Ambiguity: Disambiguation

Type Mapping

Conclusion

Page 8: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

8

Regular Expressions Syntax:

Semantics:

where: L1 L2 is concatenation (i.e., { 1 2 | 1L1, 2L2 }) L* = i0 Li where L0 = { } and Li = L

Li-1

Usual extensions : Any character ”.” as c1|c2|...|cn,

ci Character ranges ”[a-z]” as

a|b|...|z Repetitions ”R{2,3}” as RR|RRR

Page 9: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

9

Outline The Chomsky Hierarchy (1956)

Regular Expressions: The Recording Construction

Ambiguity: Disambiguation

Type Mapping

Conclusion

Page 10: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

Recording Syntax:

” ” is a recording identifier (it "remembers" the substring it matches)

Semantics:

Example (simplified emails):

Matching against string:yields:

[a-z]+ "@" [a-z]+ ("." [a-z]+)*

"[email protected]"

user = "obama" domain = "whitehouse.gov"&

<user = > <domain = >

10

Related: "x as R" in XDuce; "x::R" in CDuce; and "x@R" in Scala and HaRP

Page 11: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

11

Recording (lists) Another example (yielding lists):

Matching against string:

yields a list structure:

<name = [a-z]+ > " & " <name = [a-z]+ >

"obama & bush"

name = [obama,bush]

( <name = [a-z]+ > "\n" )*

<name = [a-z]+ > (" & " <name = [a-z]+ > )*

Page 12: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

12

Recording (structured)

Yet another example :

Matching against string: yields:

<person = <name = [a-z]+ > ", " <age = [0-9]+ >>

"obama, 48"

person.name = obama

Person.age = 48

person = obama, 48

Page 13: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

14

Outline The Chomsky Hierarchy (1956)

Regular Expressions: The Recording Construction

Ambiguity: Disambiguation

Type Mapping

Conclusion

Page 14: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

15

Ambiguity Some regular expressions are ambiguous:

matched on the string “101” gives rise to: day = 1 and month = 01 (ie. 1st of

January) day = 10 and month = 1 (ie. 10th of January)

Multiple ways of matching => ambiguous

<day = [0-9]{1,2} > <month = [0-9]{1,2} >

Page 15: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

17

Characterization of Ambiguity

Theorem: R unambiguous iff NB: sound & complete !

Page 16: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

18

Characterization of Ambiguity

Theorem: R unambiguous iff

and

<foo = a > | <bar = a* >

For the string ”a”, 2 ways: foo = ”a” or bar = ”a”

Page 17: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

19

Characterization of Ambiguity

R* = | RR*

<foo = a|aa >*

<foo = a* > <bar = a* >

For the string ”a”, 2 ways: foo = ”a” or bar = ”a”

For the string ”aa”, 2 ways: foo = [a,a] or foo = [aa]

Related work: [Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDuce but indirectly via NFAa, not directly (syntax-directed).

Page 18: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

20

Outline The Chomsky Hierarchy (1956)

Regular Expressions: The Recording Construction

Ambiguity: Disambiguation

Type mapping

Conclusion

Page 19: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

21

2) Restriction: R1 - R2

And then encode...: RC as: * - R R1 & R2 as: (R1

C|R2C)C

4) Default disambiguation: concat, choice, and star

are all left-biased (by default) !

(Our tool does this)

1) Manual rewriting: Always possible :-) Tedious :-( Error-prone :-( Not structure-preserving :-(

3) Disambiguators: Three basic operators choice:'|L', '|R' concat: 'L', 'R' star: '*L', '*R'

What to do about it?

<foo = a > | <bar = a* >

is rewritten to

<foo = a > | <bar = |aaa* >

<foo = a > | <bar = a* >

using restriction

<foo = a > | <bar = a*-a >

<foo = a > | <bar = a* >

using restriction we get <foo = a > |L <bar = a* >

<foo = a > | <bar = a* >

no need to rewrite

Related work: [Vansummeren'06] but with global, not local disambiguation

Page 20: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

22

Outline The Chomsky Hierarchy (1956)

Regular Expressions: The Recording Construction

Ambiguity: Disambiguation

Type Mapping

Conclusion

Page 21: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

Type Mapping Our date example

Type of the recordings date, day, month, and year? Strings (=> many type casts) Infer the type

<date = <day = [0-9]{2} > "/" <month = [0-9]{2} > "/" <year = [0-9]{4} >>

23

Page 22: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

Type Mapping A recording has three type components:

a linguistic type (language of the recording - maps to String, int, float, etc).

a structural type (nested recordings – maps to (nested) classes).

a type modifier (maps to lists).

24

Related work: Exact type inference in XDuce & CDuce(soundness+completeness proof in [Vansummeren'06])but not for stand-alone and non-intrusive usage (Java)

Page 23: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

25

Type Mapping Example

Person = <name = > " (" <age = > ")"[a-z]+ [0-9]+

class Person { // auto-generated String name; int age; static Person match(String s) { ... } public String toString() { ... }}

compile(our tool)

String s = "obama (48)";

Person p = Person.match(s);print(p.name + " is " + p.age + "y old");

Usage

Page 24: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

26

Usage:

People = ( $Person "\n" )*

class People { // auto-generated String[] name; int[] age; static Person match(String s) { ... } public String toString() { ... }}

compile(our tool)

String s = "obama (48) \n bush (63) \n ";

People p = People.match(s);println("Second name is " + p.name[1]);

Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"

Type Mapping

Page 25: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

27

Usage:

People = ( <person = $Person > "\n" )* ;

class People { // auto-generated Person[] person; class Person { // nested class String name; int age; }... }

compile(our tool)

String s = "obama (48) \n bush (63) \n ";

People people = People.match(s);for (p : people.person) println(p.name);

Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")"

Type Mapping

Page 26: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

28

ConclusionRegular expressions are alive and well.

This paper:

Precise ambiguity analysis

Type mapping

Future work: improve performance, subtype of recordings

"trade (excess) expressivity for safety+simplicity”

Thank you. Questions?

Page 27: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

29

Abstract Syntax Trees (ASTs)

Page 28: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

30

Ambiguity Definition:

R ambiguous iffT,T'ASTR: T T' ||T|| = ||T'||

where ||||: AST * (the flattening) is:

T

R

T'

R'

=

Page 29: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

31

Characterization of Ambiguity

Theorem: R unambiguous iff

NB: sound & complete !

R* = | RR*

Page 30: PPDP 2010 [.  13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64.

32

Type Inference Type Inference:

R : (L,S)


Recommended