Recursive descent parsingDan S. Wallach and Mack Joyner, Rice University
Copyright © 2016 Dan Wallach, All Rights Reserved
Mac users: don’t upgrade!IntelliJ crashes. Don’t upgrade! Watch Piazza for instructions once they fix it and we verify it works.
Last time: recursive grammars
Vocabulary reminder: context-free grammars LHS is always a single non-terminal
• Example: matched #’s of a’s and b’s • S→A • A→a A b • A→∅ (empty)
We’ve seen these before!
Data definition are also grammars!
A list is: a head value and a tail list; or an empty list List→value List List→∅
A tree is: a value, a left-tree, and a right-tree; or an empty-tree Tree→value Tree Tree Tree→∅
What’s a parser?
Given a sentence (list of tokens) and a language (grammar): Construct a derivation or parse tree if the sentence is in the language. If the sentence is not in the language, returns some sort of error.
Last week’s project: constructing the list of tokens. Regular expressions to match specific tokens.
This week’s project: constructing the parse tree. Recursive code to process the tokens.
What’s a parserParsers: list of tokens → parse tree { "name": "Dan Wallach", "email": "[email protected]", "classes": [ "Comp215", "Comp427" ] }
JObject
JKeyValue
JString name
JString Dan Wallach
JKeyValue
JString classes
JArray
JKeyValue
JString email
JString Dan Wallach
JString Comp215
JString Comp427
What’s a parserParsers: list of tokens → parse tree { "name": "Dan Wallach", "email": "[email protected]", "classes": [ "Comp215", "Comp427" ] }
JObject
JKeyValue
JString name
JString Dan Wallach
JKeyValue
JString classes
JArray
JKeyValue
JString email
JString Dan Wallach
JString Comp215
JString Comp427
What’s a parserParsers: list of tokens → parse tree { "name": "Dan Wallach", "email": "[email protected]", "classes": [ "Comp215", "Comp427" ] }
JObject
JKeyValue
JString name
JString Dan Wallach
JKeyValue
JString classes
JArray
JKeyValue
JString email
JString Dan Wallach
JString Comp215
JString Comp427
What’s a parserParsers: list of tokens → parse tree { "name": "Dan Wallach", "email": "[email protected]", "classes": [ "Comp215", "Comp427" ] }
JObject
JKeyValue
JString name
JString Dan Wallach
JKeyValue
JString classes
JArray
JKeyValue
JString email
JString Dan Wallach
JString Comp215
JString Comp427
How do you write a parser?
Option 1: “recursive-descent parser” You’re writing one this week for JSON. By hand.
Option 2: “table-driven parser” Tools take a BNF grammar and write your parser for you automatically. You’ll see this more in Comp412 and elsewhere.
Theory stuffThe set of languages you can construct with recursive descent (top-down) and one-token lookahead is called LL(1).
The set of languages you can construct with a table-driven (bottom-up) and one-token lookahead is called LR(1).
They’re not equivalent, but for Comp215, we don’t really care. It’s at least useful that you’ve heard the terms.
Recursive-descent parsingLet’s use a simple example: s-expressions Every LISP-family program is an s-expression. Example: factorial (define (factorial n) (if (= n 0) 1 (* n (factorial (- n 1)))))
We’ve got only three kinds of tokens: • open parenthesis • close parenthesis • “words” (terminals)
S-expression grammarS→word
S→( L )
L→∅
L→S L
S-expression grammarS→word an s-expression can be any word (i.e., any terminal)
S→( L ) an s-expression can be parentheses around a list
L→∅ a list can be nothing
L→S L or an s-expression followed by another list
S-expression grammarS→word an s-expression can be any word (i.e., any terminal)
S→( L ) an s-expression can be parentheses around a list
L→∅ a list can be nothing
L→S L or an s-expression followed by another list
Hey, look, it’s the data definition of a list!
The grammar cannot be left-recursiveWhy? Because we want our parser to “eat” one token each time. Guarantees recursion will terminate. (By inductive proof!)
S→word
S→( L )
L→∅
L→S L
The grammar cannot be left-recursiveWhy? Because we want our parser to “eat” one token each time. Guarantees recursion will terminate. (By inductive proof!)
S→word
S→( L )
L→∅
L→S L
RHS is a terminal. Definitely not left-recursive.
The grammar cannot be left-recursiveWhy? Because we want our parser to “eat” one token each time. Guarantees recursion will terminate. (By inductive proof!)
S→word
S→( L )
L→∅
L→S L
We eat a left-paren, so we’re always making progress.
The grammar cannot be left-recursiveWhy? Because we want our parser to “eat” one token each time. Guarantees recursion will terminate. (By inductive proof!)
S→word
S→( L )
L→∅
L→S L
Seems scary, but not a big deal, because…
The grammar cannot be left-recursiveWhy? Because we want our parser to “eat” one token each time. Guarantees recursion will terminate. (By inductive proof!)
S→word
S→( L )
L→∅
L→S L A non-empty list starts off with an s-expression, so a word or a left-paren
The grammar cannot be ambiguousWhy? Because we want exactly one valid parse tree. Uniqueness is essential. Otherwise, what does the data “mean”?
S→word
S→( L )
L→∅
L→S L
The grammar cannot be ambiguousWhy? Because we want exactly one valid parse tree. Uniqueness is essential. Otherwise, what does the data “mean”?
S→word
S→( L )
L→∅
L→S L
The first token (word or open-paren) tells us unambiguously which production we’re supposed to use.
The grammar cannot be ambiguousWhy? Because we want exactly one valid parse tree. Uniqueness is essential. Otherwise, what does the data “mean”?
S→word
S→( L )
L→∅
L→S L If we find a word or open-paren, then we’re dealing with a non-empty list (so the bottom production). If we find a close-paren,
then we’re dealing with an empty-list. No ambiguity.
Writing a recursive-descent parser
Example code in this week’s code dump: edu.rice.sexpr
Three main .java files: Scanner: uses our NamedMatcher to tokenize an s-expression Value: functional data definition for our s-expressions Parser: mutually recursive functions that eat tokens, return Values
Data definitionpublic interface Value { enum ValueType { SEXPR, WORD }
class Sexpr implements Value { private final IList<Value> valueList; ... }
class Word implements Value { private final String word; ... }
Data definitionpublic interface Value { enum ValueType { SEXPR, WORD }
class Sexpr implements Value { private final IList<Value> valueList; ... }
class Word implements Value { private final String word; ... }
We’ve got two kinds of values we care about: words and s-expressions.
Data definitionpublic interface Value { enum ValueType { SEXPR, WORD }
class Sexpr implements Value { private final IList<Value> valueList; ... }
class Word implements Value { private final String word; ... }
An Sexpr is a Value that has a list of Values inside.
A Word is a Value with a string inside.
Data definitionpublic interface Value { enum ValueType { SEXPR, WORD }
class Sexpr implements Value { private final IList<Value> valueList; ... }
class Word implements Value { private final String word; ... }
Parser engineering
externally visible builder / factory method public static Option<Value> parseSexpr(String input) { ... }
internal helper methods static Option<Result<Value>> makeSexpr(IList<Token<SexprPatterns>> tokenList) { ... }
static Option<Result<Value>> makeWord(IList<Token<SexprPatterns>> tokenList) { ... }
Parser engineering
externally visible builder / factory method public static Option<Value> parseSexpr(String input) { ... }
internal helper methods static Option<Result<Value>> makeSexpr(IList<Token<SexprPatterns>> tokenList) { ... }
static Option<Result<Value>> makeWord(IList<Token<SexprPatterns>> tokenList) { ... }
If the parser fails, you get back Option.none()
Parser engineering
externally visible builder / factory method public static Option<Value> parseSexpr(String input) { ... }
internal helper methods static Option<Result<Value>> makeSexpr(IList<Token<SexprPatterns>> tokenList) { ... }
static Option<Result<Value>> makeWord(IList<Token<SexprPatterns>> tokenList) { ... }
Given a list of tokens, return an s-expression “result” if you can.
Parser engineering
externally visible builder / factory method public static Option<Value> parseSexpr(String input) { ... }
internal helper methods static Option<Result<Value>> makeSexpr(IList<Token<SexprPatterns>> tokenList) { ... }
static Option<Result<Value>> makeWord(IList<Token<SexprPatterns>> tokenList) { ... }
Given a list of tokens, return a word “result” if you can.
Internal results management
If the parsing function succeeds, it returns Option.some() of: a production (Word, Sexpr, etc.) a list of remaining tokens.
And if it fails? It return Option.none(). static class Result<T> { public final T production; public final IList<NamedMatcher.Token<Scanner.SexprPatterns>> tokens;
. . . }
List parsing is recursive, just like lists
static Option<Result<Value>> makeSexpr(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.OPEN)
? makeSexprHelper(remainingTokens) .map(result -> new Result<>(new Sexpr(result.production), result.tokens)) : Option.none());} private static Option<Result<IList<Value>>> makeSexprHelper(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.CLOSE) ? Option.some(new Result<>(List.makeEmpty(), remainingTokens)) : makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));}
List parsing is recursive, just like lists
static Option<Result<Value>> makeSexpr(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.OPEN)
? makeSexprHelper(remainingTokens) .map(result -> new Result<>(new Sexpr(result.production), result.tokens)) : Option.none());} private static Option<Result<IList<Value>>> makeSexprHelper(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.CLOSE) ? Option.some(new Result<>(List.makeEmpty(), remainingTokens)) : makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));}
If there are no tokens left, we can’t make anything!
List parsing is recursive, just like lists
static Option<Result<Value>> makeSexpr(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.OPEN)
? makeSexprHelper(remainingTokens) .map(result -> new Result<>(new Sexpr(result.production), result.tokens)) : Option.none());} private static Option<Result<IList<Value>>> makeSexprHelper(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.CLOSE) ? Option.some(new Result<>(List.makeEmpty(), remainingTokens)) : makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));}
Sexpr must start with an open-paren, and if so…
List parsing is recursive, just like lists
static Option<Result<Value>> makeSexpr(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.OPEN)
? makeSexprHelper(remainingTokens) .map(result -> new Result<>(new Sexpr(result.production), result.tokens)) : Option.none());} private static Option<Result<IList<Value>>> makeSexprHelper(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.CLOSE) ? Option.some(new Result<>(List.makeEmpty(), remainingTokens)) : makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));}
We want to get an List of all the Values within.
List parsing is recursive, just like lists
static Option<Result<Value>> makeSexpr(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.OPEN)
? makeSexprHelper(remainingTokens) .map(result -> new Result<>(new Sexpr(result.production), result.tokens)) : Option.none());} private static Option<Result<IList<Value>>> makeSexprHelper(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.CLOSE) ? Option.some(new Result<>(List.makeEmpty(), remainingTokens)) : makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));}
We require at least one token, a close-paren.
List parsing is recursive, just like lists
static Option<Result<Value>> makeSexpr(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.OPEN)
? makeSexprHelper(remainingTokens) .map(result -> new Result<>(new Sexpr(result.production), result.tokens)) : Option.none());} private static Option<Result<IList<Value>>> makeSexprHelper(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.CLOSE) ? Option.some(new Result<>(List.makeEmpty(), remainingTokens)) : makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));}
If we find close-paren, then we return an empty-list.
List parsing is recursive, just like lists
static Option<Result<Value>> makeSexpr(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.OPEN)
? makeSexprHelper(remainingTokens) .map(result -> new Result<>(new Sexpr(result.production), result.tokens)) : Option.none());} private static Option<Result<IList<Value>>> makeSexprHelper(IList<Token<SexprPatterns>> tokenList) { return tokenList.match( emptyList -> Option.none(), (token, remainingTokens) -> (token.type == SexprPatterns.CLOSE) ? Option.some(new Result<>(List.makeEmpty(), remainingTokens)) : makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));}
Otherwise…
Recursive list parsing
// tokenList: every token (the argument to the function) // token: the first token (from the match(): which we know, at this point, is not a close-paren) // remainingTokens: all but the first token (also from the match())
makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));
Recursive list parsing
// tokenList: every token (the argument to the function) // token: the first token (from the match(): which we know, at this point, is not a close-paren) // remainingTokens: all but the first token (also from the match())
makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));
Recursively, see if it’s another Sexpr or Word. (We ate the open-paren beforehand; we’re making progress!)
Recursive list parsing
// tokenList: every token (the argument to the function) // token: the first token (from the match(): which we know, at this point, is not a close-paren) // remainingTokens: all but the first token (also from the match())
makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));
If it succeeded (Option.some), then we’ll recursively call ourselves with the remaining tokens.
Recursive list parsing
// tokenList: every token (the argument to the function) // token: the first token (from the match(): which we know, at this point, is not a close-paren) // remainingTokens: all but the first token (also from the match())
makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));
And if that succeeded, we’ll take the IList<Value> from the tail and put our head value on the front.
Recursive list parsing
// tokenList: every token (the argument to the function) // token: the first token (from the match(): which we know, at this point, is not a close-paren) // remainingTokens: all but the first token (also from the match())
makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));
And whatever tokens we didn’t eat, after we hit the close-paren, we’ll pass along to our parent for further processing.
Review: Option.map and Option.flatmap
default <R> Option<R> map(Function<T, R> func) { return flatmap(val -> some(func.apply(val)));} default <R> Option<R> flatmap(Function<T, Option<R>> func) { return match(Option::none, func);}
flatmap, map, etc.
// tokenList: every token (the argument to the function) // token: the first token (from the match(): which we know, at this point, is not a close-paren) // remainingTokens: all but the first token (also from the match())
makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));
flatmap, map, etc.
// tokenList: every token (the argument to the function) // token: the first token (from the match(): which we know, at this point, is not a close-paren) // remainingTokens: all but the first token (also from the match())
makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));
Returns Option<Result<Value>>
flatmap, map, etc.
// tokenList: every token (the argument to the function) // token: the first token (from the match(): which we know, at this point, is not a close-paren) // remainingTokens: all but the first token (also from the match())
makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));
Returns Option<Result<IList<Value>>> already, so we don’t want to say Option.some() of it. It’s already an Option. Thus flatmap.
flatmap, map, etc.
// tokenList: every token (the argument to the function) // token: the first token (from the match(): which we know, at this point, is not a close-paren) // remainingTokens: all but the first token (also from the match())
makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));
Gets us Result<IList<Value>> from the tail, and then we’re adding on the token from the head. Only runs if makeSexprHelper
returned Option.some().
Cool trick: trying multiple productions
private static final IList<Function<IList<Token<SexprPatterns>>, Option<Result<Value>>>> MAKERS = List.of( Parser::makeSexpr, Parser::makeWord); static Option<Result<Value>> makeValue(IList<Token<SexprPatterns>> tokenList) { return MAKERS.oflatmap(x -> x.apply(tokenList)).ohead();}
Cool trick: trying multiple productions
private static final IList<Function<IList<Token<SexprPatterns>>, Option<Result<Value>>>> MAKERS = List.of( Parser::makeSexpr, Parser::makeWord); static Option<Result<Value>> makeValue(IList<Token<SexprPatterns>> tokenList) { return MAKERS.oflatmap(x -> x.apply(tokenList)).ohead();}
Read this slow. A list of lambdas that take lists of tokens and return an optional result. It’s less ugly than it seems.
Cool trick: trying multiple productions
private static final IList<Function<IList<Token<SexprPatterns>>, Option<Result<Value>>>> MAKERS = List.of( Parser::makeSexpr, Parser::makeWord); static Option<Result<Value>> makeValue(IList<Token<SexprPatterns>> tokenList) { return MAKERS.oflatmap(x -> x.apply(tokenList)).ohead();}
The resulting list has all the Option.some() values, and none of the Option.none() values. So no more Option!
Cool trick: trying multiple productions
private static final IList<Function<IList<Token<SexprPatterns>>, Option<Result<Value>>>> MAKERS = List.of( Parser::makeSexpr, Parser::makeWord); static Option<Result<Value>> makeValue(IList<Token<SexprPatterns>> tokenList) { return MAKERS.oflatmap(x -> x.apply(tokenList)).ohead();}
If the list is non-empty, this returns Option.some() of the head. If the list is empty, this returns Option.none().
Functional program engineeringEach maker-method has no side-effects, so we can try them all No worries that one will crash the other one
• Mutating parsers tend to “push back” tokens they don’t need
• Testing and debugging is tedious!
private static final IList<Function<IList<Token<SexprPatterns>>, Option<Result<Value>>>> MAKERS = List.of( Parser::makeSexpr, Parser::makeWord);static Option<Result<Value>> makeValue(IList<Token<SexprPatterns>> tokenList) { return MAKERS.oflatmap(x -> x.apply(tokenList)).ohead();}
Functional program engineeringEach maker-method has no side-effects, so we can try them all No worries that one will crash the other one
• Mutating parsers tend to “push back” tokens they don’t need
• Testing and debugging is tedious!
private static final IList<Function<IList<Token<SexprPatterns>>, Option<Result<Value>>>> MAKERS = List.of( Parser::makeSexpr, Parser::makeWord);static Option<Result<Value>> makeValue(IList<Token<SexprPatterns>> tokenList) { return MAKERS.oflatmap(x -> x.apply(tokenList)).ohead();}
Your project this week: Parse JSON
The grammar for you is pretty simple We’ve already stubbed out the functions you’ll implement.
Start early. This will take some work!
Lab this week is on paper!
Bring a pencil or pen.
(Time to get you warmed up for the midterm.)
Live coding: writing complex expressions
We’ll walk through how to write and debug something like this:
makeValue(tokenList) .flatmap(headResult -> makeSexprHelper(headResult.tokens) .map(tailResults -> new Result<>(tailResults.production.add(headResult.production), tailResults.tokens))));