Building a Regular Expression Parser, Compiler and Processor

Building a Regular Expression Parser, Compiler and Processor

By Steve Horsfield, revision 2, version 1.0, 4 August 2009

Copyright This document contains material originally published at http://stevehorsfield.wordpress.com in July and August of

2009.

The material contained in this document is copyright the author, except where otherwise indicated, and is subject to

a Creative Commons license which can be found here: http://creativecommons.org/licenses/by-nc-

sa/3.0/deed.en_GB.

Information is provided as-is without warranty or guarantee of any kind.

Revision information This is revision 2. Revision 1 was an incomplete draft.

Related articles F#: A Complete Regular Expression Processor (http://stevehorsfield.wordpress.com/2009/08/04/f-a-

complete-regular-expression-processor/)

F#: Compiling a Regular Expression Syntax (http://stevehorsfield.wordpress.com/2009/08/03/f-compiling-a-

regular-expression-syntax/)

F#: Building a Regular Expression Pattern Parser (http://stevehorsfield.wordpress.com/2009/07/25/f-

building-a-regular-expression-pattern-parser/)

F#: A Data Structure For Modelling Directional Graphs (http://stevehorsfield.wordpress.com/2009/07/27/f-

a-data-structure-for-modelling-directional-graphs/)

F#: Graphing with GLEE (http://stevehorsfield.wordpress.com/2009/07/30/f-graphing-with-glee/)

F#: Sequence Comprehensions and Iteration (http://stevehorsfield.wordpress.com/2009/08/01/f-sequence-

comprehensions-and-iteration/)

Sources Regular Expression Matching Can Be Simple And Fast, Russ Cox, January 2007,

http://swtch.com/~rsc/regexp/regexp1.html

Microsoft F# Developer Centre, http://msdn.microsoft.com/en-us/fsharp/default.aspx

Microsoft Advanced Graph Layout (and samples), http://research.microsoft.com/en-

us/downloads/f1303e46-965f-401a-87c3-34e1331d32c5/default.aspx

http://stevehorsfield.wordpress.com/

http://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_GB

http://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_GB

http://stevehorsfield.wordpress.com/2009/08/04/f-a-complete-regular-expression-processor/

http://stevehorsfield.wordpress.com/2009/08/04/f-a-complete-regular-expression-processor/

http://stevehorsfield.wordpress.com/2009/08/03/f-compiling-a-regular-expression-syntax/

http://stevehorsfield.wordpress.com/2009/08/03/f-compiling-a-regular-expression-syntax/

http://stevehorsfield.wordpress.com/2009/07/25/f-building-a-regular-expression-pattern-parser/

http://stevehorsfield.wordpress.com/2009/07/25/f-building-a-regular-expression-pattern-parser/

http://stevehorsfield.wordpress.com/2009/07/27/f-a-data-structure-for-modelling-directional-graphs/

http://stevehorsfield.wordpress.com/2009/07/27/f-a-data-structure-for-modelling-directional-graphs/

http://stevehorsfield.wordpress.com/2009/07/30/f-graphing-with-glee/

http://stevehorsfield.wordpress.com/2009/08/01/f-sequence-comprehensions-and-iteration/

http://stevehorsfield.wordpress.com/2009/08/01/f-sequence-comprehensions-and-iteration/

http://swtch.com/~rsc/regexp/regexp1.html

http://msdn.microsoft.com/en-us/fsharp/default.aspx

http://research.microsoft.com/en-us/downloads/f1303e46-965f-401a-87c3-34e1331d32c5/default.aspx

http://research.microsoft.com/en-us/downloads/f1303e46-965f-401a-87c3-34e1331d32c5/default.aspx

The code

Project configuration I am using a pre-release version of F# (1.9.6.16) with Visual Studio 2008.

Create a new F# Console Application (I use both console and windowed output).

You will need to install MS GLEE (a predecessor of Microsoft Advanced Graph Layout) available from the link above.

Your project will need references to the following .NET assemblies:

Microsoft.GLEE

Microsoft.GLEE.Drawing

Microsoft.Glee.GraphViewerGDI

System.Drawing

System.Windows.Forms

As well as the standard assemblies.

Compile order in F# is explicit. The files should be listed in the project in the order presented below, and Program.fs

should always be last as it depends on the predecessors. Files with a “.fsi” extension should always come before the

corresponding files with “.fs” extensions.

Hex module

Hex.fsi

#light module Hex = val hex4 : char -> int val hex8 : char -> char -> int val hex16 : char -> char -> char -> char -> int

Hex.fs

#light module Hex = let hex4 A = match A with | '0' -> 0 | '1' -> 1 | '2' -> 2 | '3' -> 3 | '4' -> 4 | '5' -> 5 | '6' -> 6 | '7' -> 7 | '8' -> 8 | '9' -> 9 | 'A' -> 10 | 'B' -> 11 | 'C' -> 12 | 'D' -> 13 | 'E' -> 14 | 'F' -> 15 | 'a' -> 10 | 'b' -> 11 | 'c' -> 12 | 'd' -> 13 | 'e' -> 14 | 'f' -> 15 | _ -> failwith "invalid hexadecimal digit" let hex8 A B = (16 * (hex4 A)) + (hex4 B) let hex16 A B C D = (256 * (hex8 A B)) + (hex8 C D)

Text module

Text.fsi

#light module Text = val asciiChar : int -> char val unicodeChar : int -> char

Text.fs

#light module Text = let asciiChar i = System.Convert.ToChar(byte i) let unicodeChar i = System.Convert.ToChar(int16 i)

Graph module

Graph.fsi

#light module Graph (* In this module, a graph is represented by an adjacency list. This module is not intended to be a high performance implementation. This module can represent a directed or undirected graph depending on the method of handling edges. The default behaviour results in a directed graph. *) type VertexData<'V> = int (* identifier *) * 'V (* vertex data *) type EdgeData<'E> = int (* identifier *) * int (* priority *) * int (* vertex target *) * 'E (* edge data *) (* The graph uses adjacency list notation *) type Adjacency<'E> = EdgeData<'E> list (* The Vertex type represents the internal structure of the graph *) type Vertex<'V, 'E> = VertexData<'V> * Adjacency<'E> (* A Graph is a Vertex list. The nextNode allows for consistent addressing of nodes *) type Graph<'V, 'E> = (int (* nextNode identifier *) * int (* nextEdge identifier *)) * Vertex<'V, 'E> list (* Empty graph construction *) val empty: Graph<_,_> (* Helper methods for getting the data from a Vertex *) val vertexId : Vertex<_,_> -> int val vertexData : Vertex<'V,_> -> 'V (* Helper methods for getting the data from an Edge *) val edgeId : EdgeData<_> -> int val edgePriority : EdgeData<_> -> int val edgeTarget : EdgeData<_> -> int val edgeData : EdgeData<'E> -> 'E (* Getting a vertex from a graph by id *) val getVertex : int -> Graph<'V,'E> -> Vertex<'V,'E> (* Getting all edges from a graph by a vertex id *) val getEdges : int -> Graph<'V,'E> -> Adjacency<'E> (* Try getting a vertex from a graph by id *) val tryGetVertex : int -> Graph<'V,'E> -> Vertex<'V,'E> option (* Add a new vertex *) val addVertex : 'V -> Graph<'V,'E> -> int (*new id*) * Graph<'V,'E> (* Add a new edge. Edges include a priority value *) val addEdge : int (*priority*) -> int (*source vertex*) -> int (*target vertex*) -> 'E (*edge data*) -> Graph<'V,'E> -> int (*new id*) * Graph<'V,'E> (* The edges aren't sorted by default so this function sorts them by priority *) val sortEdges : Adjacency<'E> -> Adjacency<'E> (* Removes an edge from a graph by id *) val removeEdge : int -> Graph<'V,'E> -> Graph<'V,'E> (* Removes a vertex from a graph by id and removes any related edges *) val removeVertex : int -> Graph<'V,'E> -> Graph<'V,'E> (* Applies a function to each vertex data element *) val mapVertices : (int -> 'V -> 'V) -> Graph<'V,'E> -> Graph<'V,'E>

Graph.fs

#light module Graph (* In this module, a graph is represented by an adjacency list. This module is not intended to be a high performance implementation. This module can represent a directed or undirected graph depending on the method of handling edges. The default behaviour results in a directed graph. *) type VertexData<'V> = int (* identifier *) * 'V (* vertex data *) type EdgeData<'E> = int (* identifier *) * int (* priority *) * int (* vertex target *) * 'E (* edge data *) (* The graph uses adjacency list notation *) type Adjacency<'E> = EdgeData<'E> list (* The Vertex type represents the internal structure of the graph *) type Vertex<'V, 'E> = VertexData<'V> * Adjacency<'E> (* A Graph is a Vertex list. The nextNode allows for consistent addressing of nodes *) type Graph<'V, 'E> = (int (* nextNode identifier *) * int (* nextEdge identifier *)) * Vertex<'V, 'E> list (* Empty graph construction *) let empty: Graph<_,_> = ((0,0), []) (* Helper methods for getting the data from a Vertex *) let vertexId (v:Vertex<_,_>) = v |> fst |> fst let vertexData (v:Vertex<_,_>) = v |> fst |> snd (* Helper methods for getting the data from an Edge *) let edgeId ((x,_,_,_):EdgeData<_>) = x let edgePriority ((_,x,_,_):EdgeData<_>) = x let edgeTarget ((_,_,x,_):EdgeData<_>) = x let edgeData ((_,_,_,x):EdgeData<_>) = x (* Getting a vertex from a graph by id *) let getVertex v (g:Graph<_, _>) : Vertex<_,_> = snd g |> List.find (fun V -> vertexId V = v) (* Getting all edges from a graph by a vertex id *) let getEdges v (g:Graph<_, _>) = g |> getVertex v |> snd (* Try getting a vertex from a graph by id *) let tryGetVertex v (g:Graph<_, _>) : Vertex<_,_> option = snd g |> List.tryFind (fun V -> vertexId V = v) (* Add a new vertex *) let addVertex (v:'V) (g:Graph<'V, _>) : (int*Graph<'V,_>) = let (nid,eid) = fst g let s = snd g let newVD : VertexData<_> = (nid, v) let newA : Adjacency<_> = [] let newV = (newVD, newA) (nid, ((nid + 1,eid), newV::s)) (* Add a new edge. Edges include a priority value *) let addEdge priority (v:int) (v':int) (e:'E) (g:Graph<'V, 'E>) : (int*Graph<'V,'E>) = let (nid,eid) = fst g let s = snd g let newE : EdgeData<_> = (eid, priority, v', e) (eid, ((nid,eid + 1), s |> List.map (fun V -> if (vertexId V) = v then (fst V, newE::(snd V)) else V))) (* The edges aren't sorted by default so this function sorts them by priority *) let sortEdges (a:Adjacency<_>) = a |> List.sortBy edgePriority (* Removes an edge from a graph by id *) let removeEdge (id:int) (g:Graph<_,_>) : Graph<_,_> = let next = fst g let s = snd g (next, s |> List.map ( fun (v, a) -> (v, a |> List.filter (fun x -> (edgeId x) <> id))))

(* Removes a vertex from a graph by id and removes any related edges *) let removeVertex (id:int) (g:Graph<_,_>) : Graph<_,_> = let next = fst g let s = snd g (next, s |> ([] |> List.fold (fun s' (v, a) -> if (fst v) = id then s' else let f = fun x -> ((edgeTarget x) <> id) let newA = a |> List.filter f let newV = (v, newA) newV::s'))) (* Applies a function to each vertex data element *) let mapVertices (f:int -> 'V -> 'V) (g:Graph<'V,'E>) : Graph<'V,'E> = let applyVertex (v:Vertex<'V,'E>) : Vertex<'V,'E> = let a = snd v let (id, vd) = fst v let vd' = f id vd ((id, vd'), a) (fst g, (snd g) |> List.map applyVertex)

GleeGraph module

GleeGraph.fsi

#light open Graph open Microsoft.Glee.Drawing module Graph = type NodeAdapter<'V,'E> = Vertex<'V,'E> -> Node -> unit type EdgeAdapter<'E> = EdgeData<'E> -> Edge -> unit val toGlee<'V,'E> : string -> NodeAdapter<'V,'E> -> EdgeAdapter<'E> -> Graph<'V,'E> -> Graph val graphForm : string -> NodeAdapter<'V,'E> -> EdgeAdapter<'E> -> Graph<'V,'E> -> System.Windows.Forms.Form

GleeGraph.fs

#light open Graph open Microsoft.Glee.Drawing open System.Windows.Forms module Graph = type NodeAdapter<'V,'E> = Vertex<'V,'E> -> Node -> unit type EdgeAdapter<'E> = EdgeData<'E> -> Edge -> unit (* function to create an empty GLEE graph *) let newGlee = fun name -> new Microsoft.Glee.Drawing.Graph(name) (* add a GLEE node to the graph *) let newNode = fun (id:string) -> fun (nodeAdapter:Node->unit) -> fun (g:Microsoft.Glee.Drawing.Graph) -> let n = g.AddNode(id) nodeAdapter n (* add a GLEE edge to the graph *) let newEdge = fun (v1:string) -> fun (v2:string) -> fun (edgeAdapter:Edge->unit) -> fun (g:Microsoft.Glee.Drawing.Graph) -> let e = g.AddEdge(v1, v2) edgeAdapter e (* create a GLEE representation of a Graph<_,_> *) let toGlee<'V,'E> name (nodeAdapter:NodeAdapter<'V,'E>) (edgeAdapter:EdgeAdapter<'E>) (g:Graph<'V,'E>) = let x = newGlee name let l = snd g let processNode y = x |> newNode ((y |> vertexId).ToString()) (nodeAdapter y) let processEdge start (y:EdgeData<_>) : unit = match y with | (a,b,c,d) -> x |> newEdge (start.ToString()) (c.ToString()) (edgeAdapter y) l |> List.rev |> List.iter processNode; l |> List.rev |> List.iter ( fun z -> (snd z) |> List.rev |> List.iter (processEdge (fst z |> fst))); x (* Create a Windows Form to display a GLEE graph *) let graphForm name (nodeAdapter:NodeAdapter<'V,'E>) (edgeAdapter:EdgeAdapter<'E>) (g:Graph<'V,'E>) = let frm = new System.Windows.Forms.Form() frm.Text <- name let gc = new Microsoft.Glee.GraphViewerGdi.GViewer() gc.Graph <- toGlee name nodeAdapter edgeAdapter g gc.Dock <- System.Windows.Forms.DockStyle.Fill; frm.Controls.Add gc frm

RegExParsing module

RegExParsing.fsi

#light namespace RegExParsing (* RegExpCharacterRange represents either a single character or a contiguous range of characters within a regular expression chacter range *) type RegExpCharacterRange = | Character of char | Range of char*char (* RegExpCharacterRangeUnit represents the content of a regular expression chacter range *) type RegExpCharacterRangeUnit = | Inversion of RegExpCharacterRange list | Matching of RegExpCharacterRange list (* RegExpParseToken represents an element of a (partially) parsed regular expression. Note that a RegExpParseToken list may be valid or invalid depending on its content *) type RegExpParseToken = | StartSubPattern | SubPattern of RegExpParseToken list | Alternation of RegExpParseToken list list | Option of RegExpParseToken | AnyNumber of RegExpParseToken | AtLeastOne of RegExpParseToken | CharacterRange of RegExpCharacterRangeUnit | AnyCharacter | StartMarker | EndMarker | Literal of char (* Shorthand used to simplify code *) type RegExpParseSyntax = RegExpParseToken list module RegExParsing = val parseRegExp : string -> RegExpParseSyntax

RegExParsing.fs

#light namespace RegExParsing open Hex open Text (* RegExpCharacterRange represents either a single character or a contiguous range of characters within a regular expression chacter range *) type RegExpCharacterRange = | Character of char | Range of char*char (* RegExpCharacterRangeUnit represents the content of a regular expression chacter range *) type RegExpCharacterRangeUnit = | Inversion of RegExpCharacterRange list | Matching of RegExpCharacterRange list (* RegExpParseToken represents an element of a (partially) parsed regular expression. Note that a RegExpParseToken list may be valid or invalid depending on its content *) type RegExpParseToken = | StartSubPattern (* used during the build of a RegExpParseToken list *) | SubPattern of RegExpParseToken list | Alternation of RegExpParseToken list list | Option of RegExpParseToken | AnyNumber of RegExpParseToken | AtLeastOne of RegExpParseToken | CharacterRange of RegExpCharacterRangeUnit | AnyCharacter | StartMarker | EndMarker | Literal of char (* Shorthand used to simplify code *) type RegExpParseSyntax = RegExpParseToken list module RegExParsing = (* processes an escaped character in the body of the regular expression *) let parseRegExpEscape ps : RegExpParseToken * char list = match ps with | '\\'::ps' -> (Literal '\\', ps') | '['::ps' -> (Literal '[', ps') | ']'::ps' -> (Literal ']', ps') | '^'::ps' -> (Literal '^', ps') | '$'::ps' -> (Literal '$', ps') | '*'::ps' -> (Literal '*', ps') | '+'::ps' -> (Literal '+', ps') | '?'::ps' -> (Literal '?', ps') | '|'::ps' -> (Literal '|', ps') | '('::ps' -> (Literal '(', ps') | ')'::ps' -> (Literal ')', ps') | '.'::ps' -> (Literal '.', ps') | 't'::ps' -> (Literal '\t', ps') | 'n'::ps' -> (Literal '\n', ps') | 'r'::ps' -> (Literal '\r', ps') | 'x'::A::B::ps' -> (Literal (Text.asciiChar (Hex.hex8 A B)), ps') | 'u'::A::B::C::D::ps' -> (Literal (Text.unicodeChar (Hex.hex16 A B C D)), ps') | 'X'::A::B::ps' -> (Literal (Text.asciiChar (Hex.hex8 A B)), ps') | 'U'::A::B::C::D::ps' -> (Literal (Text.unicodeChar (Hex.hex16 A B C D)), ps') | _ -> failwith "invalid escape sequence" (* processes an escaped character within a character range block *) let processRangeEscape ps = match ps with | '\\'::ps' -> ('\\', ps') | ']'::ps' -> (']', ps') | '^'::ps' -> ('^', ps') | 't'::ps' -> ('\t', ps') | 'n'::ps' -> ('\n', ps') | 'r'::ps' -> ('\r', ps') | '-'::ps' -> ('-', ps') | 'x'::A::B::ps' -> (Text.asciiChar (Hex.hex8 A B), ps') | 'u'::A::B::C::D::ps' -> (Text.unicodeChar (Hex.hex16 A B C D), ps') | 'X'::A::B::ps' -> (Text.asciiChar (Hex.hex8 A B), ps') | 'U'::A::B::C::D::ps' -> (Text.unicodeChar (Hex.hex16 A B C D), ps') | _ -> failwith "invalid escape sequence" (* main processing for a character range with tail recursion finisher -> turns a list of terms into a RegExpParseToken current -> a character that has already been subject to escape processing and other special handling ps -> the remainder of the pattern cases of 'ps': 1) the range hasn't been closed

2) ]... end of the range 3) -]... end of the range with a literal '-' 4) -\... range of current to an escaped character 5) -X]... range of current to X and end of the range 6) -X\... range of current to X followed by an escaped character 7) -XY... range of current to X followed by normal processing of Y 8) \... literal of current followed by an escaped character 9) X... literal of current followed by normal processing of X *) let rec parseRegExpRangeBody1 finisher terms current ps = let unclosed () = failwith "unclosed character range expression" match ps with | [] -> unclosed () | ']'::rs -> (finisher (Character current::terms), rs) | '-'::']'::rs -> (finisher (Character current::Character '-'::terms), rs) | '-'::'\\'::rs -> // Range of current to escaped character c and remainder cs let (c, cs) = processRangeEscape rs match cs with | [] -> unclosed () | ']'::cs' -> (finisher (Range (current, c)::terms), cs') | '\\'::cs' -> parseRegExpRangeBody1 finisher (Character current::terms) c cs' | c'::cs' -> parseRegExpRangeBody1 finisher (Character current::terms) c' cs' | '-'::r::']'::rs -> // Range of current to normal character r followed by closure (finisher (Range (current, r)::terms), rs) | '-'::r::'\\'::rs -> // Range of current to normal character r followed by escaped character let (c, cs) = processRangeEscape rs parseRegExpRangeBody1 finisher (Range (current, r)::terms) c cs | '-'::r::r'::rs -> // Range of current to normal character r, next character r' parseRegExpRangeBody1 finisher (Range (current, r)::terms) r' rs | '\\'::rs -> let (c, cs) = processRangeEscape rs parseRegExpRangeBody1 finisher (Character current::terms) c cs | r::rs -> parseRegExpRangeBody1 finisher (Character current::terms) r rs (* starts the processing of a character range block *) let parseRegExpRangeBody finisher ps = match ps with | [] -> failwith "unclosed character range expression" | ']'::']'::rs -> (finisher [Character ']'], rs) | ']'::rs -> parseRegExpRangeBody1 finisher [] ']' rs | '-'::rs -> parseRegExpRangeBody1 finisher [] '-' rs | '\\'::rs -> let (c, cs) = processRangeEscape rs parseRegExpRangeBody1 finisher [] c cs | r::rs -> parseRegExpRangeBody1 finisher [] r rs (* a finisher function for an inverted character range *) let buildInvertedCharacterRange terms = CharacterRange (Inversion terms) (* a finisher function for an matching character range *) let buildMatchingCharacterRange terms = CharacterRange (Matching terms) (* processes a character range expression in the body of a regular expression pattern *) let parseRegExpRange ps = match ps with | '^'::rs -> parseRegExpRangeBody buildInvertedCharacterRange rs | _ -> parseRegExpRangeBody buildMatchingCharacterRange ps (* handles common functionality for multiplicity constructs *) let parseRegExpMultiplicity (context : RegExpParseToken list) = let error () = failwith "invalid position for multiplicity character (?, +, *)" let testMultiplicityValid (t:RegExpParseToken) = match t with | SubPattern _ -> () | Literal _ -> () | CharacterRange _ -> () | AnyCharacter -> () | _ -> error () match context with | [] -> error () | c::cs -> testMultiplicityValid c (c, cs)

(* progressively builds a parsed pattern with tail recursion context contains the parsed content so far as a stack *) let rec parseRegExpFragment (pattern : char list) (context : RegExpParseToken list list) = match pattern with | [] -> match context with | c :: (Alternation alts::ccs)::ds -> let newAlt = Alternation (List.rev (List.rev c::alts)) [newAlt] :: ds | _ -> context | p :: ps -> match context with | [] -> failwith "invalid input" | c :: cs -> match p with | '[' -> let (parseToken:RegExpParseToken, pattern2: char list) = parseRegExpRange ps parseRegExpFragment pattern2 ((parseToken::c)::cs) | '(' -> parseRegExpFragment ps ([]::((StartSubPattern::c)::cs)) | ')' -> match cs with | [] -> failwith "not in a sub expression" | (StartSubPattern::ccs)::ds -> let previous = (SubPattern (List.rev c))::ccs let newContext = previous::ds parseRegExpFragment ps newContext | [Alternation alts]::(StartSubPattern::ccs)::ds -> let newAlt = Alternation (List.rev ((List.rev c)::alts)) let previous = (SubPattern [newAlt])::ccs let newContext = previous::ds parseRegExpFragment ps newContext | _ -> failwith "algorithm error" | '\\' -> let (parseToken:RegExpParseToken, pattern2: char list) = parseRegExpEscape ps parseRegExpFragment pattern2 ((parseToken::c)::cs) | '|' -> match cs with | [] -> let newContext = []::([Alternation [List.rev c]])::cs parseRegExpFragment ps newContext | (Alternation alts::ccs)::ds -> let newAlt = Alternation ((List.rev c)::alts) let newContext = []::(newAlt::ccs)::ds parseRegExpFragment ps newContext | ccs::ds -> let previous = Alternation [List.rev c] let newContext = []::[previous]::(ccs::ds) parseRegExpFragment ps newContext | '?' -> let (d, ds) = parseRegExpMultiplicity c parseRegExpFragment ps ((Option d::ds)::cs) | '*' -> let (d, ds) = parseRegExpMultiplicity c parseRegExpFragment ps ((AnyNumber d::ds)::cs) | '+' -> let (d, ds) = parseRegExpMultiplicity c parseRegExpFragment ps ((AtLeastOne d::ds)::cs) | '.' -> parseRegExpFragment ps ((AnyCharacter::c)::cs) | '^' -> parseRegExpFragment ps ((StartMarker::c)::cs) | '$' -> parseRegExpFragment ps ((EndMarker::c)::cs) | _ -> parseRegExpFragment ps ((Literal p::c)::cs) (* main function for converting a pattern into a parsed regular expression syntax *) let parseRegExp (pattern : string) : RegExpParseSyntax = let ps = Seq.to_list pattern (parseRegExpFragment ps [ [] ]) |> List.hd |> List.rev

RegExCompiling module

RegExCompiling.fsi

#light namespace RegExCompiling open RegExParsing open Graph (* RegExMatchMode determines matching or searching semantics *) type RegExMatchMode = FullMatch | FirstMatch (* Vertices represent states in the state machine. Start indicates the starting state. Normal is a non-finishing states Closure is a state that can complete the state machine *) type NdfaNodeType = Normal | Start | Closure (* This is used for additional semantic information to attach to states *) type NdfaNodeSubPatternData = StartSubPattern | EndSubPattern (* This is the type used for vertex data *) type NdfaNode = NdfaNodeType * NdfaNodeSubPatternData option (* This is the type used for edge data *) type NdfaEdge = | Auto (* used to represent free transitions *) | Simple of char | AnyChar | StartAssertion | EndAssertion | CharacterTest of RegExpCharacterRangeUnit | NonGreedyAnyCharacter (* used for partial matching *) (* shorthand for the state machine graph type *) type NdfaGraph = Graph<NdfaNode, NdfaEdge> module RegExCompiling = val compile : RegExMatchMode -> RegExpParseSyntax -> NdfaGraph val parseAndCompile : RegExMatchMode -> string -> NdfaGraph

RegExCompiling.fs

#light namespace RegExCompiling open RegExParsing open Graph (* RegExMatchMode determines matching or searching semantics *) type RegExMatchMode = FullMatch | FirstMatch (* Vertices represent states in the state machine. Start indicates the starting state. Normal is a non-finishing states Closure is a state that can complete the state machine *) type NdfaNodeType = Normal | Start | Closure (* This is used for additional semantic information to attach to states *) type NdfaNodeSubPatternData = StartSubPattern | EndSubPattern (* This is the type used for vertex data *) type NdfaNode = NdfaNodeType * NdfaNodeSubPatternData option (* This is the type used for edge data *) type NdfaEdge = | Auto (* used to represent free transitions *) | Simple of char | AnyChar | StartAssertion | EndAssertion | CharacterTest of RegExpCharacterRangeUnit | NonGreedyAnyCharacter (* used for partial matching *) (* shorthand for the state machine graph type *) type NdfaGraph = Graph<NdfaNode, NdfaEdge> type CompileState = NdfaGraph * (* graph *) int list * (* trailing nodes *) int * (* next edge priority *) bool (* should force a node *) module RegExCompiling = (* curried function to build a CompileState tuple *) let toCompileState g outs initialPriority forceNode : CompileState = (g, outs, initialPriority, forceNode) (* connects trailing edges to the node v p is the edge priority e is the edge data *) let patch (g:NdfaGraph) (outs:int list) v p e = outs |> List.fold (fun g' out -> g' |> Graph.addEdge p out v e (* returns the new edge id and the new graph, but just want the graph *) |> snd) g (* simplifies the graph by joining multiple trailing edges to a single marker state, but does not change the graph for a single trailing edge unless forceNode is specified which is useful in some cases to prevent incorrect graphs resulting.*) let join ((g, outs, initialPriority, forceNode):CompileState) = match outs with | [] -> failwith "invalid graph" | [out] when forceNode = false -> (out, g) | _ -> let (v, g') = g |> Graph.addVertex (Normal, None) let g'' = patch g' outs v initialPriority Auto (v, g'') (* adds a single node to the end of the graph *) let add1 ((g,outs,initialPriority,forceNode):CompileState) (finalPriority:int) (vertex:NdfaNode) (edge:NdfaEdge) = let (v, g') = g |> Graph.addVertex vertex let (out, g'') = join (toCompileState g' outs initialPriority forceNode) let (_, g''') =

g'' |> Graph.addEdge initialPriority out v edge toCompileState g''' [v] finalPriority false (* progressively builds the NDFA graph the first parameter is the state going into the processing of a step. It comprises: g: the graph so far outs: the tailing nodes that should be connected initialPriority: used to decide among multiple alternations forceNode: used to prevent errors involving alternations and multiplicities It returns this tuple as the result and as the accumulator going into List.fold *) let rec stepNdfa (currentState:CompileState) (next:RegExpParseToken) : CompileState = let (g, outs, initialPriority, forceNode) = currentState (* Note: the Option, AnyNumber and AtLeastOne match types have this code in common. The function adds the subpattern to the existing graph and returns the new graph with the identifiers for the nodes representing the entry and exit from the subpattern. Each of the match types handle these nodes slightly differently *) let multiplicity subpat = (* simplify the graph with join *) let (v, g') = join currentState (* connect the subpattern *) let state' = subpat |> stepNdfa (toCompileState g' [v] initialPriority false) (* simplify the output of the subpattern *) let (out', g'') = join state' (g'', v, out') match next with | Literal l -> add1 currentState 0 (Normal, None) (Simple l) | AnyCharacter -> add1 currentState 0 (Normal, None) AnyChar | CharacterRange range -> add1 currentState 0 (Normal, None) (CharacterTest range) | StartMarker -> add1 currentState 0 (Normal, None) StartAssertion | EndMarker -> add1 currentState 0 (Normal, None) EndAssertion | Option subpat -> let (g', vin, vout) = multiplicity subpat (* connect the input to the output directly since this is an optional step *) let g'' = patch g' [vin] vout 0 Auto toCompileState g'' [vout] 0 true | AnyNumber subpat -> let (g', vin, vout) = multiplicity subpat (* connect the output to the input to allow for more *) let g'' = patch g' [vout] vin 0 Auto (* the tailing vertex is the input in this case *) toCompileState g'' [vin] 0 true | AtLeastOne subpat-> let (g', vin, vout) = multiplicity subpat (* connect the output to the input to allow for more *) let g'' = patch g' [vout] vin 0 Auto (* the tailing vertex is the output of the subpattern in this case *) toCompileState g'' [vout] 0 true | SubPattern subpat -> (* v and v' mark the subpattern region *) let (v, g') = g |> Graph.addVertex (Normal, Some StartSubPattern) let (v', g'') = g' |> Graph.addVertex (Normal, Some EndSubPattern) (* connect v to the tail of the graph *) let g''' = patch g'' outs v initialPriority Auto (* build the subpattern *) let (g'''', outs', _, _) = subpat |> Seq.fold stepNdfa (toCompileState g''' [v] 0 true) (* attach the subpattern closure *) let g''''' = patch g'''' outs' v' 0 Auto (* return the new process state *) toCompileState g''''' [v'] 0 true | Alternation alts -> (* priority is used by alternation to choose a match when multiple are possible *) (* first join the inputs *)

let (v, g') = join currentState (* this creates the graph for an alternation *) let applyAlt g'alt p'alt subpat = subpat |> List.fold stepNdfa (toCompileState g'alt [v] p'alt true) (* but for folding, we need to build the combined list of tailing edges, which is what this function does *) let foldAlt (ins'alt, g'alt, p'alt, outs'alt:int list) subpat = let (g'alt', outs'alt':int list, _, _) = applyAlt g'alt p'alt subpat (ins'alt, g'alt', p'alt + 1, List.append outs'alt' outs'alt) (* apply foldAlt across the alternations *) let (_,g'',_,outs') = alts |> List.fold foldAlt ([v], g', 0, []) toCompileState g'' outs' 0 false | _ -> failwith "invalid syntax tree" (* this function turns the trailing vertices to be closures *) let completeNdfa ((g, outs, initialPriority, forceNode):CompileState) = let mutate (vId:int) ((nodeType,subMatchData):NdfaNode) = if List.exists (fun x -> x = vId) outs then (Closure, subMatchData) else (nodeType, subMatchData) g |> Graph.mapVertices mutate (* this is the main module entry point for building the NDFA *) let toNDFA (mode:RegExMatchMode) (syntax:RegExpParseSyntax) = let (v, g) = Graph.empty |> Graph.addVertex (Start, None) let startState = match mode with | FullMatch -> toCompileState g [v] 0 false | FirstMatch -> add1 (toCompileState g [v] 0 true) 0 (Normal, None) NonGreedyAnyCharacter syntax |> Seq.fold stepNdfa startState |> completeNdfa (* This produces the NDFA from a parsed syntax *) let compile (mode:RegExMatchMode) (syntax:RegExpParseSyntax) = syntax |> toNDFA mode (* This produces the NDFA from a pattern string *) let parseAndCompile (mode:RegExMatchMode) (pattern:string) = pattern |> RegExParsing.parseRegExp |> compile mode

RegExProcessor module

RegExProcessor.fsi

#light open Graph open RegExParsing open RegExCompiling module RegExProcessor = (* used to store captured text *) type CharList = System.Collections.Generic.LinkedList<char> (* used to represent the result of findFirst *) type RegExMatchResult = { matchState: int; matchPositions: (int*int) list buffer: CharList; bufferStart: int } (* tests whether the input sequence can match the NDFA *) val test : RegExCompiling.NdfaGraph -> char seq -> bool (* finds the first match and submatch data *) val firstMatch : RegExCompiling.NdfaGraph -> char seq -> RegExMatchResult option (* helper for building the ndfa directly *) val parseAndCompile : (RegExCompiling.RegExMatchMode -> string -> RegExCompiling.NdfaGraph)

RegExProcessor.fs

#light open Graph open RegExParsing open RegExCompiling module RegExProcessor = (* ------------------- TYPES ------------------- *) (* used to store captured text *) type CharList = System.Collections.Generic.LinkedList<char> (* used to represent the result of findFirst *) type RegExMatchResult = { matchState: int; matchPositions: (int*int) list buffer: CharList; bufferStart: int } (* represents a token in the input stream *) type RegExInput = | StartOfInput | EndOfInput of int (* index of last character *) | Char of char*int (* index of character *) (* represents a single possible state of the NDFA *) type RegExStateData = int (* state *) * int option (* index of start *) * int option (* index of end *) * int list (* submatch starts *) * int list (* submatch ends *) (* ------------------- GENERAL HELPERS ------------------- *) (* syntactic sugar for option types equivalent to the C# ?? coalescing operator *) let (=??) (x:'a option) (y:'a) = match x with | Some x -> x | _ -> y (* position identification from a RegExInput *) let inline currentIndex c = match c with | StartOfInput -> 0 | EndOfInput i -> i | Char (_, i) -> i (* successor position identification from a RegExInput *) let inline nextIndex c = match c with | StartOfInput -> 0 | EndOfInput i -> i | Char (_, i) -> i + 1 (* prior position identification from a RegExInput *) let inline previousIndex c = match c with | Char (_, i) when i > 0 -> i - 1 | EndOfInput i -> i - 1 | _ -> 0 (* start of a state *) let inline startOfState currentPos s = match s with | (_,Some x,_,_,_) -> x | _ -> currentPos (* start of a state option *) let inline startOfStateWrapped currentPos s = match s with | Some (_,Some x,_,_,_) -> x | _ -> currentPos

(* ------------------- SEQUENCE FLOWS ------------------- *) (* convert the input sequence into a RegExToken sequence *) let sequenceInput (i:char seq) = seq { yield StartOfInput let n = 0 let rn = ref n for x in i do yield (Char (x, !rn)) rn := !rn + 1 yield EndOfInput (!rn - 1) } (* FlowState<'T> wraps IEnumerator<'T> Note that this implementation does not dispose the IEnumerator and should not be used for production code *) type FlowState<'T> = { iter: System.Collections.Generic.IEnumerator<'T> ; isStart: bool ; isEnd: bool; count: int } (* get a new FlowState<'T> *) let seqToFlow (s:'T seq) = let enumerator = s.GetEnumerator() { iter = enumerator; isStart = true; isEnd = false; count = 0 } (* get the next input token from the flow *) let inline flowChar (s:FlowState<char>) : RegExInput * FlowState<char> = if (s.isStart) then (StartOfInput, {iter = s.iter; isStart = false; isEnd = false; count = 0 }) elif (s.isEnd) then (EndOfInput s.count, {iter = null; isStart = false; isEnd = true; count = s.count }) else let hasNext = s.iter.MoveNext() if hasNext <> true then (EndOfInput s.count, {iter = s.iter; isStart = false; isEnd = true; count = s.count }) else (Char (s.iter.Current, s.count), {iter = s.iter; isStart = false; isEnd = false; count = s.count + 1 }) (* test whether the flow is empty *) let inline flowEmpty (s:FlowState<'T>) = s.isEnd

(* ------------------- STATE TRANSITIONS ------------------- *) (* after a step, there may be multiple ways to be in a particular state. This function decides which of those is preferred, considering where the match began and the length of match *) let preferredStates ss : RegExStateData list = let initialstate = ([], None, None, None, None) let statefunc (result:RegExStateData list, id, start, endpos, y:RegExStateData option) (x:RegExStateData) = let (xid, xstart, xendpos, _, _) = x match id with | None -> (result, Some xid, xstart, xendpos, Some x) | Some idval when idval <> xid -> ((Option.get y)::result, Some xid, xstart, xendpos, Some x) | _ -> let s = start =?? 0 let s' = xstart =?? 0 let e = endpos =?? System.Int32.MaxValue let e' = xendpos =?? System.Int32.MaxValue if s < s' then (result, id, start, endpos, y) elif s > s' then (result, Some xid, xstart, xendpos, Some x) elif e <= e' then (result, id, start, endpos, y) else (result, Some xid, xstart, xendpos, Some x) let statecompletion (result, _, _, _, y) : RegExStateData list = match y with | Some yval -> yval::result | _ -> result ss |> List.sortBy (fun ((id,_,_,_,_):RegExStateData) -> id) |> List.fold statefunc initialstate |> statecompletion (* this function takes a pair of start and end lists for submatches and appends new entries if the vertex identified by id is a subpattern marker. Using previous and next allows proper handling of the start and end of input respectively *) let mapSubMatches (substarts, subends) (id:int) (previous:int) (next:int) (ndfa:NdfaGraph) = let targetNode = ndfa |> Graph.getVertex id |> Graph.vertexData let substarts' = match targetNode |> snd with | Some StartSubPattern -> next::substarts | _ -> substarts let subends' = match targetNode |> snd with | Some EndSubPattern -> previous::subends | _ -> subends (substarts',subends')

(* free transitions are state transitions that may be optional or are automatic. They do not depend on input tokens *) let freeTransitions (ndfa:RegExCompiling.NdfaGraph) (previous:int) (next:int) (x:RegExStateData) = let rec transition (newState:bool) (targetStates:RegExStateData list) ((id, start, stop, substarts, subends):RegExStateData) (es:Adjacency<NdfaEdge>) = match es with | [] when newState -> Some targetStates | [] -> None | e::es -> let target = edgeTarget e let (substarts',subends') = ndfa |> mapSubMatches (substarts,subends) target previous next match edgeData e with | Auto -> transition true ((target,start,stop,substarts',subends')::targetStates) (id,start,stop,substarts,subends) es | NonGreedyAnyCharacter -> transition true ((target,start,stop,substarts',subends')::targetStates) (id,Some next,stop,substarts,subends) es | _ -> transition newState targetStates (id,start,stop,substarts,subends) es let rec transitionAll accummulator ((id,start,stop,substarts,subends):RegExStateData) = let x = (id,start,stop,substarts,subends) let es = ndfa |> Graph.getEdges id let newStates = transition false [] x es match newStates with | None -> x::accummulator | Some ns -> ns |> List.collect (fun (x:RegExStateData) -> transitionAll (x::accummulator) x) (transitionAll [x] x) |> preferredStates (* test an individual character against a test pattern *) let testCharacter (criteria:RegExpCharacterRangeUnit) c = let test (criteria:RegExpCharacterRange) : bool = match criteria with | Character v -> v = c | Range (v1,v2) -> (v1 <= c) && (v2 >= c) match criteria with | Matching criteria -> criteria |> List.exists test | Inversion criteria -> (criteria |> List.exists test) <> true

(* attempt to transition from one state to another using the specified edge *) let step1edge ndfa bestMatch ((id, start, stop, substarts, subends):RegExStateData) (current:int) (previous:int) (next:int) c e = let et:NdfaEdge = Graph.edgeData e let tgt:int = Graph.edgeTarget e let (substarts',subends') = ndfa |> mapSubMatches (substarts,subends) tgt previous next let nt:NdfaNode = ndfa |> Graph.getVertex tgt |> Graph.vertexData let normalResult = [(tgt, start, Some current, substarts', subends')] let sourceResult = [(id, start, stop, substarts, subends)] let nonGreedyResult = [(tgt, Some (current+1), None, substarts', subends'); (id, Some (current+1), None, substarts, subends)] match c with | StartOfInput -> match et with | StartAssertion -> normalResult | _ -> sourceResult | EndOfInput _ -> match et with | EndAssertion -> normalResult | _ -> sourceResult | Char (c,_) -> match et with | NonGreedyAnyCharacter when bestMatch = None -> nonGreedyResult | Simple c' when c = c' -> normalResult | AnyChar -> normalResult | CharacterTest criteria when (testCharacter criteria c) -> normalResult | _ -> [] (* transition one state into a list of next states *) let step1state ndfa bestMatch current previous next c (x:RegExStateData) = let (id,_,_,_,_) = x let es:Adjacency<NdfaEdge> = ndfa |> Graph.getEdges id let mappedStates = es |> List.collect (step1edge ndfa bestMatch x current previous next c) match c with | EndOfInput _ -> (* can keep original state *) x::mappedStates | _ -> mappedStates

(* ------------------- STATE INITIALISATION ------------------- *) (* get an array with an entry for each state and a value indicating whether the state is a closure for the NDFA *) let getClosureMap (ndfa:RegExCompiling.NdfaGraph) = let maxVertex = (ndfa |> fst |> fst) - 1 let testClosure ndfa id = match Graph.tryGetVertex id ndfa with | Some v -> (v |> Graph.vertexData |> fst) = Closure | None -> false let isClosureMap = [| for x in 0..maxVertex -> testClosure ndfa x |] isClosureMap (* create a RegExStateData for the provided initial state *) let inline initialiseState v : RegExStateData = (v |> Graph.vertexId, Some 0, None, [], []) (* build the set of possible starting states *) let getStartStates (ndfa:RegExCompiling.NdfaGraph) = ndfa |> snd |> List.choose (fun v -> let nodeType = v |> Graph.vertexData |> fst if nodeType = RegExCompiling.Start then Some (v |> initialiseState) else None) |> List.collect (fun x -> freeTransitions ndfa 0 0 x) (* ------------------- MATCH SELECTION ------------------- *) (* given a previous best match and a set of states find the best match. If the previous match wins then return None, if there is no closure in the new states then return None, otherwise return Some newBestMatch *) let findBestMatch ndfa (closureMap:bool array) (bestMatch:RegExStateData option) (states:RegExStateData list) : RegExStateData option = let closureTest tgt = Array.get closureMap tgt let bestMatchTest (x:RegExStateData option) (y:RegExStateData) = let (id,start,stop,_,_) = y let yValid = closureTest id if yValid then match x with | None -> Some y | Some (_,start',stop',_,_) -> (* Check starts *) if (start' < start) then x elif (start < start') then Some y (* Also check length *) elif (Option.get stop) > (Option.get stop') then Some y (* Otherwise, prefer earliest match *) else x else x let bestMatch' = states |> List.fold bestMatchTest None match (bestMatch, bestMatch') with | None, _ -> bestMatch' | _, None -> None (* previous match wins *) | Some (id, start,stop,_,_), Some (id', start',stop', _, _) -> if (start < start') then None (* finish with previous match *) elif stop > stop' then bestMatch elif stop' > stop then bestMatch' else bestMatch'

(* ------------------- CHARACTER CAPTURE ------------------- *) (* Character capture methods: captureNull and captureNullInit allow for efficient testing of matches without returning the matched string captureList and captureListInit use a doubly-linked list of characters to capture the text *) let inline captureNullInit () = () let inline captureNull (bestMatch:RegExStateData option) (bestMatch':RegExStateData option) states c capture captureOffset = (capture, 0) let inline captureListInit () = new CharList() let inline captureList (bestMatch:RegExStateData option) (bestMatch':RegExStateData option) states c (capture:CharList) captureOffset = let currentPos = currentIndex c let start1 = startOfStateWrapped currentPos bestMatch let start2 = startOfStateWrapped currentPos bestMatch' let starts = List.append [start1;start2] (states |> List.map (startOfState currentPos)) let minOffset = starts |> List.min let trimLength = minOffset - captureOffset match c with | Char (x,_) -> capture.AddLast x |> ignore | _ -> () let rec reduce n = if n > 0 then capture.RemoveFirst() reduce (n-1) else () reduce trimLength (capture, minOffset)

(* ------------------- PROCESS ------------------- *) (* process a single input token *) let step1 ndfa closureMap bestMatch (states:RegExStateData list) (c:RegExInput) = let current = currentIndex c let next = nextIndex c let previous = previousIndex c let states' = states |> List.collect (step1state ndfa bestMatch current previous next c) let states'' = states' |> List.collect (fun (x:RegExStateData) -> freeTransitions ndfa previous next x) |> preferredStates let bestMatch' = states'' |> findBestMatch ndfa closureMap bestMatch (bestMatch', states'')

(* find a match *) let processInput captureMethod ndfa closureMap bestMatch capture captureOffset states (input:FlowState<char>) = (* processInner wraps the logic without repeated passing of captureMethod, ndfa and closureMap *) let rec processInner bestMatch capture captureOffset states (input:FlowState<char>) = match states with | [] -> (bestMatch, capture, captureOffset) | _ when (flowEmpty input) -> (bestMatch, capture, captureOffset) | _ -> let (c, input') = flowChar input (* remove states that start after an existing match *) let states' = if (Option.isSome bestMatch) then let currentPos = currentIndex c let minPos = startOfStateWrapped currentPos bestMatch states |> List.choose (fun (x:RegExStateData) -> if (startOfState currentPos x) > minPos then None else Some x) else states (* move forward a step *) let (bestMatch2, states2) = c |> step1 ndfa closureMap bestMatch states' (* store the characters if necessary *) let (capture2, offset2) = captureMethod bestMatch bestMatch2 states2 c capture captureOffset match (bestMatch, bestMatch2) with | None, _ -> (* might get a better match so carry on *) processInner bestMatch2 capture2 offset2 states2 input' | Some _, None when (List.isEmpty states2) -> (* no better match available *) (bestMatch, capture2, offset2) | Some _, None -> (* still might get a better match with more characters *) processInner bestMatch capture2 offset2 states2 input' | _ -> (* need to carry on even if already got a match because a longer match may follow, but prefer the most recent match, bestMatch2 *) processInner bestMatch2 capture2 offset2 states2 input' processInner bestMatch capture captureOffset states input

(* Starts the matching process *) let execute captureMethod initCaptureMethod ndfa cs = let input = seqToFlow cs let closureMap = getClosureMap ndfa let initialStates = getStartStates ndfa let bestMatch = findBestMatch ndfa closureMap None initialStates let initialCapture = initCaptureMethod () processInput captureMethod ndfa closureMap bestMatch initialCapture 0 initialStates input (* ------------------- ENTRY POINTS ------------------- *) (* tests whether the input sequence can match the NDFA *) let test (ndfa:RegExCompiling.NdfaGraph) (cs:char seq) : bool = let (bestMatch, chars, offset) = execute captureNull captureNullInit ndfa cs match bestMatch with | None -> false | _ -> true (* finds the first match and submatch data *) let firstMatch (ndfa:RegExCompiling.NdfaGraph) (cs:char seq) = let (bestMatch, chars, offset) = execute captureList captureListInit ndfa cs match bestMatch with | None -> None | Some (state, startpos, endpos, substarts, subends) -> let startpos' = startpos =?? 0 let endpos' = endpos =?? System.Int32.MaxValue let matches = (* starts are added in the order encountered ends are added in reverse order *) (startpos', endpos'):: (List.zip (substarts |> List.rev) subends) Some { matchState = state; matchPositions = matches; buffer = chars; bufferStart = offset } (* helper for building the ndfa directly *) let parseAndCompile = RegExCompiling.parseAndCompile

Main program

Program.fs

#light open Graph open RegExParsing open RegExCompiling open RegExProcessor open GleeGraph open Microsoft.Glee.Drawing (* time a function *) let time (f:unit->'T) = let timeZero = System.DateTime.Now let result = f () ; (System.DateTime.Now.Subtract(timeZero).TotalMilliseconds, result) let nodeAdapter (v:Vertex<NdfaNode, NdfaEdge>) (n:Microsoft.Glee.Drawing.Node) = let (nt:NdfaNodeType, sp:NdfaNodeSubPatternData option) = vertexData v n.Attr.Shape <- match nt with | Start -> Shape.Point | Closure -> Shape.DoubleCircle | Normal -> Shape.Circle n.Attr.Color <- match sp with | None -> Color.Black | Some StartSubPattern -> Color.Blue | Some EndSubPattern -> Color.Red let edgeAdapter ((id,priority,target,data):EdgeData<NdfaEdge>) (e:Microsoft.Glee.Drawing.Edge) = match data with | Auto -> e.Attr.Color <- Color.Blue | Simple c -> e.Attr.Label <- c.ToString() e.Attr.Fontcolor <- Color.Red e.Attr.Color <- Color.Black | CharacterTest t -> e.Attr.Label <- (sprintf "$test %20A" t) e.Attr.Fontcolor <- Color.Blue e.Attr.Color <- Color.Black | AnyChar -> e.Attr.Label <- "$any" e.Attr.Fontcolor <- Color.Blue e.Attr.Color <- Color.Black | StartAssertion -> e.Attr.Label <- "$start" e.Attr.Fontcolor <- Color.Red e.Attr.Color <- Color.Red | EndAssertion -> e.Attr.Label <- "$end" e.Attr.Fontcolor <- Color.Red e.Attr.Color <- Color.Red | NonGreedyAnyCharacter -> e.Attr.Label <- "$any*[non-greedy]" e.Attr.Fontcolor <- Color.Blue e.Attr.Color <- Color.Blue

(* displays the main application form *) let run () = let frm = new System.Windows.Forms.Form() frm.Text <- "RegEx Compiler" let panel = new System.Windows.Forms.FlowLayoutPanel() panel.Size <- new System.Drawing.Size(100, 60) panel.Dock <- System.Windows.Forms.DockStyle.Top let txtBoxRegex = new System.Windows.Forms.TextBox() txtBoxRegex.Text <- "" let chkboxFullMatch = new System.Windows.Forms.CheckBox() chkboxFullMatch.Text <- "Initial matches" chkboxFullMatch.Checked <- true let btnGo = new System.Windows.Forms.Button() btnGo.Text <- "Generate" let txtBoxText = new System.Windows.Forms.TextBox() txtBoxText.Text <- "" let btnTest = new System.Windows.Forms.Button() btnTest.Text <- "Test" let label l = let c = new System.Windows.Forms.Label() c.Text <- l c :> System.Windows.Forms.Control panel.Controls.AddRange [| (label "regular expression") ; (txtBoxRegex :> System.Windows.Forms.Control) ; (chkboxFullMatch:>System.Windows.Forms.Control) ; (btnGo :>System.Windows.Forms.Control) ; (label "test text") ; (txtBoxText :>System.Windows.Forms.Control) ; (btnTest :>System.Windows.Forms.Control) |] panel.SetFlowBreak (btnGo, true) let gc = new Microsoft.Glee.GraphViewerGdi.GViewer() gc.Graph <- new Microsoft.Glee.Drawing.Graph("empty") gc.Graph.GraphAttr.Orientation <- Microsoft.Glee.Drawing.Orientation.Landscape gc.Dock <- System.Windows.Forms.DockStyle.Fill; frm.Controls.Add gc frm.Controls.Add panel frm.Size <- new System.Drawing.Size(600, 600) let getSyntax () = let syntax = txtBoxRegex.Text |> RegExParsing.parseRegExp let mode = if (chkboxFullMatch.Checked) then FullMatch else FirstMatch let g = syntax |> RegExCompiling.compile mode printf "Pattern: %A\n\n" txtBoxRegex.Text printf "Syntax: %A\n\n" syntax printf "Graph: %A\n\n" g g let applyGleeGraph () = let g = getSyntax() gc.Graph <- (Graph.toGlee txtBoxRegex.Text nodeAdapter edgeAdapter g) g let applyGraph () = let g = applyGleeGraph () printf "-------------------------------------------------------------\n\n" let testMatch () = let g = applyGleeGraph () let txt = txtBoxText.Text printf "Test text: %A\n\n" txt let fn () = RegExProcessor.firstMatch g txt let timeResult = time fn printf "Time: %Ams\n" (fst timeResult) printf "Test match: %A\n\n" (snd timeResult) printf "-------------------------------------------------------------\n\n" btnGo.Click.Add (fun _ -> applyGraph () ) btnTest.Click.Add (fun _ -> testMatch () ) applyGraph () System.Windows.Forms.Application.Run frm

run () (* Test the processing of a million characters *) let runTest () = let pattern = "a+b|a" let chars = Seq.append (Seq.init 1000000 (fun a -> 'a')) (Seq.singleton 'b') let result () = chars |> ( pattern |> RegExCompiling.parseAndCompile RegExMatchMode.FirstMatch |> RegExProcessor.test ) let resultVal = time result printf "time: %Ams ; result: %A\n\n" (resultVal |> fst) (resultVal |> snd) (* uncomment the following line to run the test. usual runtime is around 20 seconds *) (* runTest () *)

Date post:	03-Feb-2022
Category:	Documents
Upload:	others
View:	11 times
Download:	0 times

Building a Regular Expression Parser, Compiler and Processor

Documents