+ All Categories
Home > Documents > Automating Grammar Comparison

Automating Grammar Comparison

Date post: 11-Nov-2021
Category:
Upload: others
View: 13 times
Download: 0 times
Share this document with a friend
18
Co n si s te n t * C om ple t e * W ell D o c um e n ted * Ea s y to Reu s e * * E va l u a t ed * OO PS L A * A rt ifa ct * AE C Automating Grammar Comparison Ravichandhran Madhavan EPFL, Switzerland ravi.kandhadai@epfl.ch Mikaël Mayer EPFL, Switzerland mikael.mayer@epfl.ch Sumit Gulwani Microsoft Research, USA [email protected] Viktor Kuncak * EPFL, Switzerland viktor.kuncak@epfl.ch Abstract We consider from a practical perspective the problem of checking equivalence of context-free grammars. We present techniques for proving equivalence, as well as techniques for finding counter-examples that establish non-equivalence. Among the key building blocks of our approach is a novel algorithm for efficiently enumerating and sampling words and parse trees from arbitrary context-free grammars; the algorithm supports polynomial time random access to words belonging to the grammar. Furthermore, we propose an algorithm for proving equivalence of context-free grammars that is complete for LL grammars, yet can be invoked on any context-free grammar, including ambiguous grammars. Our techniques successfully find discrepancies between different syntax specifications of several real-world lan- guages, and is capable of detecting fine-grained incremental modifications performed on grammars. Our evaluation shows that our tool improves significantly on the existing available state of the art tools. In addition, we used these algorithms to develop an online tutoring system for grammars that we then used in an undergraduate course on computer language processing. On questions involving grammar constructions, our system was able to automatically evaluate the correctness of 95% of the solutions submitted by students: it disproved 74% of cases and proved 21% of them. * This work is supported in part by the European Research Council (ERC) Project Implicit Programming and Swiss National Science Foundation Grant Constraint Solving Infrastructure for Program Analysis. S S + S | S * S | ID S ID E E +S |* S | Figure 1. Grammars recognizing simple arithmetic expres- sions. An example proven equivalent by our tool. Categories and Subject Descriptors D.3.1 [Programming Languages]: Formal Definitions and Theory—Syntax; D.3.4 [Programming Languages]: Processors—Parsing, Compilers, Interpreters; K.3.2 [Computers and Education]: Computer and Information Science Education; D.2.5 [Software Engi- neering]: Testing and Debugging General Terms Languages, Theory, Verification Keywords Context-free grammars, equivalence, counter- examples, proof system, tutoring system 1. Introduction Context-free grammars are pervasively used in verification and compilation, both for building input parsers and as foun- dation of algorithms for model checking, program analysis, and testing. They also play an important pedagogical role in introducing fundamentals of formal language theory, and are an integral part of undergraduate computer science education. Despite their importance, and despite decades of theoretical advances, practical tools that can check semantic properties of grammars are still scarce, except for specific tasks such as parsing. In this paper, we develop practical techniques for checking equivalence of context-free grammars. Our techniques can find counter-examples that disprove equivalence, and can prove that two context-free grammars are equivalent, much like a software model checker. Our approaches are motivated by two applications: (a) comparing real-world grammars, such as those used in production compilers, (b) automating tutoring and evaluation of context-free grammars. These applications are interesting and challenging for a number of reasons.
Transcript
Page 1: Automating Grammar Comparison

Consist

ent *Complete *

Well D

ocumented*Easyt

oR

euse* *

Evaluated*

OOPSLA*

Artifact *

AEC

Automating Grammar Comparison

Ravichandhran MadhavanEPFL, Switzerland

[email protected]

Mikaël MayerEPFL, Switzerland

[email protected]

Sumit GulwaniMicrosoft Research, [email protected]

Viktor Kuncak ∗

EPFL, [email protected]

AbstractWe consider from a practical perspective the problem ofchecking equivalence of context-free grammars. We presenttechniques for proving equivalence, as well as techniquesfor finding counter-examples that establish non-equivalence.Among the key building blocks of our approach is a novelalgorithm for efficiently enumerating and sampling wordsand parse trees from arbitrary context-free grammars; thealgorithm supports polynomial time random access to wordsbelonging to the grammar. Furthermore, we propose analgorithm for proving equivalence of context-free grammarsthat is complete for LL grammars, yet can be invoked on anycontext-free grammar, including ambiguous grammars.

Our techniques successfully find discrepancies betweendifferent syntax specifications of several real-world lan-guages, and is capable of detecting fine-grained incrementalmodifications performed on grammars. Our evaluation showsthat our tool improves significantly on the existing availablestate of the art tools. In addition, we used these algorithmsto develop an online tutoring system for grammars that wethen used in an undergraduate course on computer languageprocessing. On questions involving grammar constructions,our system was able to automatically evaluate the correctnessof 95% of the solutions submitted by students: it disproved74% of cases and proved 21% of them.

∗ This work is supported in part by the European Research Council (ERC)Project Implicit Programming and Swiss National Science Foundation GrantConstraint Solving Infrastructure for Program Analysis.

S → S + S | S ∗ S | ID S → ID EE → +S | ∗ S | ε

Figure 1. Grammars recognizing simple arithmetic expres-sions. An example proven equivalent by our tool.

Categories and Subject Descriptors D.3.1 [ProgrammingLanguages]: Formal Definitions and Theory—Syntax; D.3.4[Programming Languages]: Processors—Parsing, Compilers,Interpreters; K.3.2 [Computers and Education]: Computerand Information Science Education; D.2.5 [Software Engi-neering]: Testing and Debugging

General Terms Languages, Theory, Verification

Keywords Context-free grammars, equivalence, counter-examples, proof system, tutoring system

1. IntroductionContext-free grammars are pervasively used in verificationand compilation, both for building input parsers and as foun-dation of algorithms for model checking, program analysis,and testing. They also play an important pedagogical role inintroducing fundamentals of formal language theory, and arean integral part of undergraduate computer science education.Despite their importance, and despite decades of theoreticaladvances, practical tools that can check semantic propertiesof grammars are still scarce, except for specific tasks such asparsing.

In this paper, we develop practical techniques for checkingequivalence of context-free grammars. Our techniques canfind counter-examples that disprove equivalence, and canprove that two context-free grammars are equivalent, muchlike a software model checker. Our approaches are motivatedby two applications: (a) comparing real-world grammars,such as those used in production compilers, (b) automatingtutoring and evaluation of context-free grammars. Theseapplications are interesting and challenging for a numberof reasons.

Page 2: Automating Grammar Comparison

S → A⇒ S | IntA→ Int ,A | Int

S → Int GG→⇒ Int G | , Int A | εA→, Int A | ⇒ Int G

Figure 2. Grammars defining well-formed function signa-tures over Int . An example proven equivalent by our tool.

(a)S → A⇒ Int | IntA→ S, Int | Int (b)

S → A⇒ S | IntA→ S, S | Int

Figure 3. Grammars subtly different from the grammarsshown in Fig. 2. The grammar on the left does not accept“Int ⇒ Int ⇒ Int".

Much of the front-ends of modern compilers and inter-preters are automatically or manually derived from grammar-based descriptions of programming languages, more so withintegrated language support for domain specific languages.When two compilers or other language tools are built ac-cording to two different reference grammars, knowing howthey differ in the programs they support is essential. Our ex-periments show that two grammars for the same languagealmost always differ, even if they aim to implement the samestandard. For instance, we found using our tool that two highquality standard Java grammars (namely, the Java grammar1

used by Antlr v4 parser generator [1], and the Java languagespecification [2]) disagree on more than 50% of words thatare randomly sampled from them.

Even though the differences need not correspond to incom-patibilities between the compilers that use these grammars(since their handling could have been intentionally delegatedto type checkers and other backend phases), the sheer volumeof these discrepancies does raise serious concerns. In fact,most deviations found by our tool are not purposeful. More-over, in the case of dynamic languages like Javascript, whereparsing may happen at run-time, differences between parserscan produce diverging run-time behaviors. Our experimentsshow that (incorrect) expressions like “++ RegExp - this"discovered by our tool while comparing Javascript grammarsresult in different behaviors on Firefox and Internet Explorerbrowsers, when the expressions are wrapped inside functionsand executed using the eval construct (see section 5).

Besides detecting incompatibilities, comparing real-worldgrammars can help identify portions of the grammars that areoverly permissive. For instance, many real-world Java gram-mars generate programs like “enum ID implements char{ ID }", “private public class ID" (which were dis-covered while comparing them with other grammars). Theseimprecisions can be eliminated with little effort without com-promising parsing efficiency.

Furthermore, often grammars are rewritten extensivelyto make them acceptable by parser generators, which is la-borious and error prone. Parser generators have become in-

1 github.com/antlr/grammars-v4/blob/master/java/Java.g4

creasingly permissive over the years to mitigate this problem.However, there still remains considerable overhead in thisprocess, and there is a general need for tools that pin-pointsubtle changes in the modified versions (documented in workssuch as [33]). It is almost always impossible to spot differ-ences between large real-world grammars through manualscanning, because the grammars typically appear similar, andeven use the same name for many non-terminals. A challengethis paper addresses is developing techniques that scales toreal-world grammars, which have hundreds of non-terminalsand productions.

An equally compelling application of grammar compari-son arises from the importance of context-free grammar incomputer science education. Assignments involving context-free grammars are harder to grade and provide feedback,arguably even more than programming assignments, becauseof the greater variation in the possible solutions arising due tothe succinctness of the grammars. The complexity is furtheraggravated when the solutions are required to belong to sub-classes like LL(1). For example, Figures 1 and 2 show twopairs of grammars that are equivalent. The grammars shownon the right are LL(1) grammars, and are reference solutions.The grammars shown on the left are intuitive solutions thata student comes up with initially. Proving equivalence ofthese pairs of grammars is challenging because they do nothave any similarity in their structure, but recognize the samelanguage. On the other hand, Figure 3 shows two grammars(written by students) that subtly differ from the grammars ofFig. 2. The smallest counter-example for the grammar shownin Fig. 3(a) is the string “Int ⇒ Int ⇒ Int". We invite thereaders to identify a counter-example that differentiates thegrammar of Fig. 3(b) from those of Fig. 2.

In our experience, a practical system that can prove that astudent’s solution is correct and provide a counter-exampleif it is not can greatly aid tutoring of context-free grammars.The state of the art for providing feedback on programmingassignments is to use test cases (though there has been recentwork on generating repair based feedback [29]). We bring thesame fundamentals to context-free grammar education. Fur-thermore, we exploit the large, yet under-utilized, theoreticalresearch on decision procedures for equivalence of context-free grammars to develop a practical algorithm that can provethe correctness of solutions provided by the students.

Overview and Contributions. At the core of our system isa fast approach for enumerating words and parse trees of anarbitrary context-free grammar, which supports exhaustiveenumeration as well as random sampling of parse treesand words. These features are supported by an efficientpolynomial time random access operation that constructs aunique parse tree for any given natural number index. Weconstruct a scalable counter-example detection algorithm byintegrating our enumerators with a state of the art parsingtechnique [25].

Page 3: Automating Grammar Comparison

We develop and implement an algorithm for proving equiv-alence by extending decision procedures for subclasses ofdeterministic context-free grammars to arbitrary (possiblyambiguous) context-free grammars, while preserving sound-ness. We make the algorithm practical by performing numer-ous optimizations, and use concrete examples to guide theproof exploration. We are not aware of any existing systemthat supports both proving as well as disproving of equiva-lence of context-free grammars. The following are our maincontributions:

• We present an enumerator for generating parse trees of ar-bitrary context-free grammars that supports the followingoperations: 1) a polynomial time random access operationlookup(i, l) that given an index i returns the unique parsetree generating a word of length l, corresponding to theindex, and 2) sample(n, l) that generates n uniformlyrandom samples from the parse trees of the grammar gen-erating words of length l (Section 2).

• We use the enumerators to discover counter-examples forequivalence of context-free grammars.

• We integrate and extend the algorithms of Korenjak andHopcroft [15], Olshansky and Pnueli [24], and Harrisonet al. [12], for proving equivalence of LL context-freegrammars to arbitrary context-free grammars. Our exten-sions are sound but incomplete. We show using experi-ments that the algorithm is effective on many grammarsthat are outside the classes with known decision proce-dures.

• We implement and evaluate an online tutoring systemfor context-free grammars . Our system is able to de-cide the veracity of 95% of the submissions, detectingcounter-examples in 74% of the submissions, and provingcorrectness of 21% of the submissions.

• We evaluate the counter-example detection algorithmon 10 real-world grammars describing the syntax of5 mainstream programming languages. The algorithmdiscovers deep, fine-grained errors, by finding counter-examples with an average length of 35, detecting almost3 times more errors than a state of the art approach.

2. Enumeration of Parse Trees and WordsA key ingredient of our approach for finding counter-examples is enumeration of words and parse trees belongingto a context-free grammar. Enumeration is also used in opti-mizing and improving the scope of our grammar equivalenceproof engine. We model our enumerators as functions fromnatural numbers to objects that are enumerated (which areparse trees or words), as opposed to viewing them as itera-tors for a sequence of objects as is typical in programminglanguage theory. The enumerators we propose are bijectivefunctions from natural numbers to parse trees in which theimage and pre-image of any given value is efficiently com-

putable in polynomial time (formalized in Theorem 1). Thefunctions are partial if the set that is enumerated is finite.Using bijective functions to construct enumerators has manyadvantages, for example, it immediately provides a way ofsampling elements from the given set. It also ensures thatthere is no duplication during enumeration. Additionally, thealgorithm we present here can be configured to enumerateparse trees that generate words having a desired length.

Notations. A context-free grammar is a quadruple(N,Σ, P, S), where N is a set of non-terminals, Σ is a set ofterminals, P ⊆ N × (N ∪ Σ)∗ is a finite set of productionsand S ∈ N is the start symbol. Let T denote the set of parsetrees belonging to a grammar. We refer to sequences of termi-nals and non-terminals belonging to (N ∪ Σ)∗ as sententialforms of the grammar. If a sentential form has only terminals,we refer to it as a word, and also sometimes as a string. Weadopt the usual convention of using greek alphabets α, β torepresent sentential forms and upper-case latin characters torepresent non-terminals. We use lower-case latin charactersa, b, c etc. to represent terminals and w, x, y etc. to denotewords. We introduce more notations as they are needed.

2.1 Constructing Random Access EnumeratorsWe use Enum[α] : N → T ∗ to denote an enumerator for asentential form α of the input grammar. The enumerators arepartial functions from natural numbers to tuples of parse treesof the grammar, one rooted at every symbol in the sententialform. For brevity, we refer to the tuple as parse trees ofsentential forms. We define Enum[α] recursively followingthe structure of the grammar as explained in the sequel.

For a terminal a belonging to a grammar, Enum[a] isdefined as {0 → leaf (a)}. That is, the enumerator for aterminal amaps the first index to a parse tree with a single leafnode containing a and is undefined for every other index. Wenow describe an enumerator for a non-terminal. Consider fora moment the non-terminal S of the grammar shown in Fig. 4.The parse trees rooted at S is constructed out of the parsetrees that belong to the non-terminal A and the sententialform BA. Assume that we have enumerators defined for Aand BA, namely Enum[A] and Enum[BA] that are functionsfrom natural numbers to parse trees (a pair of them in thecase of BA). Our algorithm constructs an enumerator for Scompositionally using the enumerators for A and BA.

Recall that we consider enumerators as bijective functionsfrom natural numbers. So, given an index iwe need to define aunique parse tree of S corresponding to i (provided i is withinthe number of parse trees rooted at S). To associate a parsetree of S to an index i, we first need to identify a right-hand-side α of S and select a parse tree t of the right-hand-side. Todetermine a parse tree t of the right-hand-side α, it sufficesto determine the index of t in the enumerator for α. Hence,we define a function Choose[A] : N → ((N ∪ Σ)∗ × N)for every non-terminal A, that takes an index and returns aright-hand-side of A, and an index for accessing an element

Page 4: Automating Grammar Comparison

S → A | BAA → a | aSB → b

∀t ∈ {a, b}. Enum[t](i) = leaf (t) if i = 0∀A ∈ {S,A,B}. Enum[N ](i) = node(S, Enum[α](j)), where (α, j) = Choose[S](i)Enum[BA](i) = (Enum[B](j), Enum[A](k)) where (j, k) = π(i,∞,∞)Enum[aS](i) = (Enum[a](j), Enum[S](k)) where (j, k) = π(i, 1,∞)

Figure 4. An example grammar and illustrations of the Enum functions for the symbols of the grammar. Choose and π aredefined in Fig. 6 and Appendix A, respectively.

#t(α) =

n−1∏i=0

#t(Mi), where α = M0 · · ·Mn−1, n > 1

#t(A) =

n−1∑i=0

#t(αi), where A→ α0 | · · · | αn−1

#t(a) = 1, where a ∈ Σ

Figure 5. Equations for computing the number of parse treesof sentential forms. #t is initialized to ∞ for every non-terminal M .

of the right-hand-side. We define the enumerator for a non-terminal as: Enum[A](i) = node(A, Enum[α](j)), where(α, j) = Choose[A](i). That is, as a node labelled A andhaving the tuple Enum[α](j) as children.

In the simplest case, if A has n right-hand-sidesα0, α2, · · · , αn−1, the choose function Choose[A](i) couldbe defined as (αi%n, bi/nc). This definition, besides beingsimple, also ensures a fair usage of the right-hand-sides of Aby mapping successive indices to different right-hand-sides,which ensures that any sequence of enumeration of the wordsbelonging to a non-terminal alternates over the right-hand-sides of the non-terminal. However, this definition is welldefined only when every right-hand-side of A has unboundednumber of parse trees. For instance, consider the non-terminalA shown in Fig. 4. It has two right-hand-sides a and aS ofwhich a has only a single parse tree. Defining Choose[A] as(αi%2, bi/2c) is incorrect as, for example, Enum[A](2) mapsto Enum[a](1), which is not defined. Therefore, we extendthe above function so that it takes into account the numberof parse trees belonging to the right-hand-sides, which isdenoted using #t(α).

It is fairly straightforward to compute the number of parsetrees of non-terminals and right-hand-sides in a grammar.For completeness, we show a formal definition in Fig. 5. Wedefine #t as the greatest fix-point of the equations shownin Fig. 5, which can be computed iteratively starting withan initial value of∞ for #t . As shown in the equations, thenumber of (tuples of) parse trees of a sentential form is theproduct of the number of parse trees of the symbols in thesentential form. The number of parse trees of a non-terminalis the sum of the number of parse trees of its right-hand-sides,and the number of parse trees of a terminal is one. Note that ifthe grammar has cycles, #t could be infinite in some cases.

Choose[A](i) =

let A→ α0 | · · · | αn−1 s.t.

∀1 ≤ m < n. #t(αm−1) ≤ #t(αm) in

let b0 = 0, and b1, · · · , bn = #t(α0), · · · ,#t(αn−1) in

let ∀0 ≤ m ≤ n. im = bm(n−m+ 1) +

m−1∑i=0

bi in

let k be such that 0 ≤ k ≤ n− 1 and ik ≤ i < ik+1 in

let q = b(i− ik)/(n− k)c and r = (i− ik)%(n− k) in

(αk+r, bk + q)

Figure 6. Choose function for a non-terminal A.

Fig.6 defines a Choose function, explained below, that canhandle right-hand-sides with a finite number of parse trees.The definition guarantees that whenever Choose returns apair (α, i), i is less than #t(α), which ensures that Enum[α]is defined for i. In Fig.6, the right-hand-sides of the non-terminal A: α0, · · · , αn−1, are sorted in ascending order ofthe number of parse trees belonging to them. We define b0 aszero and use b1, · · · , bn to denote the number of parse treesof the right-hand-sides. (Note that #t(αi) is given by bi+1.)The index im is the smallest index (of Enum[A]) at whichthe mth right-hand-side αm becomes undefined, which isdetermined using the number of parse trees of each right-hand-side as shown. Given an index i, Choose[A](i) firstdetermines the right-hand-sides that need to be skipped i.e,whose enumerators are not defined for the index i, by findinga k such that ik ≤ i < ik+1. It then chooses a right-hand-side (namely αj) from the remaining n− k right-hand-sideswhose enumerators are defined for the index i, and computesthe index to enumerate from the chosen right-hand-side.

Most of the computations performed by Fig. 6 – suchas computing the number of parse trees of the right-hand-sides (and hence b0, · · · , bn) and the indices i0, · · · , im,and sorting the right-hand-sides of non-terminals by theirnumber of parse trees – need to performed once per grammar.Therefore, for each index i, the Choose function may onlyhave to scan through the right-hand-sides to determine thevalue of k, and perform simple arithmetic operations tocompute q and r.

Page 5: Automating Grammar Comparison

Note that the Choose function degenerates to the simpledefinition (αi%n, bi/nc) presented earlier when #t is un-bounded for every right-hand-side of N . The function alsopreserves fairness by mapping successive indices to differ-ent right-hand-sides of the non-terminals. For instance, inthe case of the non-terminal A shown in Fig. 4, the Choosefunction maps index 0 to (a, 0), index 1 to (aS, 0), but in-dex 2 is mapped to (aS, 1) as a has only one parse tree, i.e,#t(a) = 1.

We now describe the enumerator for a sentential form αwith more than one symbol. Let α = M1M2 · · ·Mm. Thetuples of parse trees belonging to the sentential form is thecartesian product of the parse trees of M1, · · · ,Mm. How-ever, eagerly computing the cartesian product is impossiblefor most realistic grammars because it is either unboundedor untractably large. Nevertheless, we are interested onlyin accessing a tuple at a given index i. Hence, it sufficesto determine for every symbol Mj , the parse tree tj that isused to construct the tuple at index i. The tree tj can be de-termined if we know its index in Enum[Mj ]. Therefore, itsuffices to define a bijective function π : N→ Nm that mapsa natural number (the index of Enum[α]) to a point in an m-dimensional space of natural numbers. The jth componentof π(i) is the index of Enum[Mj ] that corresponds to the jth

parse tree of the tuple. In other words, Enum[M1 · · ·Mm](i)could be defined as (Enum[M1](i1), · · · ,Enum[Mm](im)),where ij is the jth component of π(i).

When m is two, the function π reduces to an inversepairing function that is a bijection from natural numbersto pairs of natural numbers. Our algorithm uses only aninverse pairing function as we normalize the right-hand-sides of the productions in the grammar to have at mosttwo symbols. We use the well known Cantor’s inversepairing function [26]. But, this function assumes that thetwo dimension space is unbounded in both directions, andhence cannot be employed directly when the number of parsetrees generated by the symbols in the sentential form arebounded. We extend the inverse paring functions to twodimensional spaces that are bounded in one or both thedirections. The extended functions take three arguments, theindex that is to be mapped, and the sizes of the x and ydimensions (or infinity if they are unbounded). We presenta formal definition of the functions in Appendix A. Usingthe extended Cantor’s inverse pairing function π we definethe enumerator for a sentential form with two symbols as:Enum[M1M2](i) = (Enum[M1](i1),Enum[M2](i2)), where(i1, i2) = π(i,#t(M1),#t(M2)).

Termination of Random Access. Later in section 2.3 wepresent a bound on the running time of the algorithm, butnow we briefly discuss termination. If A is a recursive non-terminal e.g, if it has a production of the form A → αAβ,the enumerator for N may recursively invoke itself, eitherdirectly or through other enumerators. However, for everyindex other than 0 and 1, the recursive invocations will always

Prgm → import QName ; ClassDefQName → ID | ID . QNameClassDef → class { Body }

Figure 7. A grammar snippet illustrating the need to boundthe length of the generated words during enumeration.

be passed a strictly smaller index. This follows from thedefinition of the Choose and the inverse pairing functionsused by our algorithm. (In the case of the inverse pairingfunction, if π(i) = (j, k), j and k are strictly smaller than ifor all i > 1). For indices 0 and 1 the recursive invocationsmay happen with the same index. However, this will not resultin non-termination if the following properties are ensured: (a)for every non-terminal, the right-hand-side chosen for index0 is the first production in the shortest derivation startingfrom the non-terminal and ending at a word. (b) There are nounproductive non-terminals (which are non-terminals that donot generate any word) in the input grammar.

From Parse Trees to Words. We obtain enumerators forwords using the enumerators for parse trees by mapping theenumerated parse trees to words. However, when the inputgrammar is ambiguous, the resulting enumerators are nolonger bijective mappings from indices to words. The numberof indices that map to a word is equal to the number of parsetrees of the word.

2.2 Enumerating Fixed Length WordsThe enumeration algorithm we have described so far is agnos-tic to the lengths of the enumerated words. As a consequence,the algorithm may generate undesirably long words, and infact may also favour the enumeration of long words overshorter ones. Fig. 7 shows a snippet from the Java grammarthat results in this behavior.

In the Fig. 7, the productions of the non-terminal Body arenot shown for brevity. It generates all syntactically correctbodies allowed for a class in a Java program. Consider theenumeration of the words (or parse trees) belonging to thenon-terminal Prgm starting from index 1. A fair enumerationstrategy, such as ours, will try to generate almost equal num-ber of words from the non-terminals QName and ClassDef.However, the lengths of the words generated for the sameindex differ significantly between the non-terminals. For in-stance, the word generated from the non-terminal QName atan index i has length i + 1. On the other hand, the lengthsof the words generated from the non-terminal ClassDef growslowly relative to their indices, since it has many right-hand-sides, and each right-hand-side is in turn composed of non-terminals having many alternatives. In essence, the wordsgenerated for the non-terminal Prgm will have long importdeclarations followed by very short class definitions.

Moreover, this also results in reduced coverage of rulessince the enumeration heavily reuses productions of QName,but fails to explore many alternatives reachable through

Page 6: Automating Grammar Comparison

S → a | BSB → b

(a)

S3 → B1S2 | B2S1

S2 → B1S1

S1 → aB1 → b

(b)

Figure 8. (a) An example grammar. (b) the result of restrict-ing the grammar shown in (a) to words of size 3.

[[N ]]l = [[N → α1]]l ∪ · · · ∪ [[N → αn]]l,

where N → α1, | · · · | αn

[[N → a]]l =

{{Nl → a} if l = 1

∅ otherwise

[[N → AB]]l =

l−1⋃i=1

({Nl → AiBl−i} ∪ [[A]]i ∪ [[B]]l−i)

Figure 9. Transforming non-terminals and productions of agrammar to a new set of non-terminals and productions thatgenerate only words of length l.

ClassDef. We address this issue by extending the enumerationalgorithm so that it generates only parse trees of words havinga specified length, We accomplish this by transforming theinput grammar in such way that it produces only stringsthat are of the required length, and use the transformedgrammar in enumeration. The idea behind the transformationis quite standard e.g, works such as [14] and [19] that developtheoretical algorithms for random sampling of unambiguousgrammars also resort to a similar approach. However, whatis unique to our algorithm is using the transformation toconstruct bijective enumerators while guaranteeing randomaccess property for all words of the specified length.

Fig. 8 illustrates this transformation on an example, whichis explained in detail below. For explanatory purposes, as-sume that the input grammar is in Chomsky’s Normal Form(CNF) [16] which ensures that every right-hand-side of thegrammar is either a terminal or has two non-terminals.

Fig. 9 formally defines the transformation. For every non-terminal N of the input grammar, the transformation createsa non-terminal Nl that generates only those words of Nthat have a length l. The productions of Nl are obtained bytransforming the productions of N . For every production ofthe form N → a, where a is a terminal, the transformationcreates a production Nl → a if l = 1. For every productionof the form N → AB that has two non-terminals on theright-hand-side, the transformation considers every possibleway in which a word of size l can be split between thetwo non-terminals, and creates a production of the formNl → AiBl−i for each possible split (i, l − i). Additionally,the transformation recursively produces rules for the non-terminals Ai and Bl−i. The transformed grammar may have

unproductive non-terminals and rules that do not generateany word (like the non-terminal B2 and rule S3 → B2S1 ofFig. 9(b)), and hence may have to be simplified. Observe thatthe transformer grammar is acyclic and generates only a finitenumber of parse trees.

This transformation increases the sizes of the right-hand-sides a factor of l, in the worst case. Therefore, for efficiencyreasons, we construct the productions of the transformedgrammar on demand, when it is required during the enumera-tion of a parse tree.

Sampling Parse Trees and Words. Having constructed enu-merators with the above characteristics, it is straightforwardto sample parse trees and words of a non-terminal N havinga given length l. We uniformly randomly sample numbersin the interval [0,#t(N)− 1], and lookup the parse tree orword at the sampled index using the enumerators. Since wehave a bijection from numbers in the range [0,#t(N)− 1] toparse trees of N , this approach guarantees a uniform randomsampling of parse trees. However, sampling of words is guar-anteed to be uniform only if the grammar is unambiguous. Ingeneral, the probability of choosing a word w of length l in asample of size s is equal to t×s

#t(N) , where t is the number ofparse trees of the word w.

2.3 Running Time of Random AccessThe number of recursive invocations performed by Enumfor generating a word of length l is equal to the number ofnodes and edges in the parse tree of the word, which is O(l)for a grammar in CNF [16]. The time spent between twosuccessive invocations of Enum is dominated by the Chooseoperation, whose running time is bounded by O(r · l · |i|2)since it performs a linear scan over the right-hand-sides of anon-terminal performing basic arithmetic operations over thegiven index i. Therefore, we have the following theorem.

Theorem 1. LetG be a grammar in Chomsky’s Normal Form.Let r denote the largest number of right-hand-sides of a non-terminal. The time taken to access a parse tree generating aword of length l at an index i of a non-terminal belonging tothe grammar is upper bounded by O(r · |i|2 · l2), providedthe number of parse trees generated by each non-terminaland right-hand-side is precomputed.

Notice that the time taken for random access is polynomialin the size of the input grammar, the number of bits in theindex i (which is O(log i)), and the size of the generatedword l. We now briefly discuss the complexity of computingthe number of parse trees (#t). For a grammar in CNF, thenumber of parse trees that generate a word of length l isO(r2l

) in the worst case. (For unambiguous grammars, it isO(cl), where c is the number of terminals.) Thus, computingthe number of parse trees could, in principle, be expensive.However, in practice, the number of parse trees, in spite ofbeing large, is efficiently computable.

Page 7: Automating Grammar Comparison

Prior theoretical works on uniform random sampling (suchas [19]) for context-free grammars assume that the inputgrammar is a constant, and that the arithmetic operationstake constant time. (Our approach matches the best knownrunning time O(l log l) under these assumptions). But, thisassumption is quite restrictive in the real-world. For example,the Java 7 grammar has 430 non-terminals and 2447 ruleswhen normalized to CNF, and the number of parse treesincreases rapidly with the length of the generated word. Infact, for length 50, it is a 84 digit number (in base 10). Usingnumbers as big as these in computation introduces significantoverhead which cannot be discounted. Our enumerators offerquite some flexibility in sampling by supporting randomaccess. For example, we can sample only from a restrictedrange instead of using the entire space of parse trees. Sincewe ensure a fair usage of rules while mapping rules to indices,restricting the sizes of indices still provides a good coverageof rules. In fact, our implementation exposes a parameterfor limiting the range of the sample space, which we founduseful in practice.

3. Counter-Example DetectionWe apply the enumerators described in previous sections tofind counter-examples for equivalence of two context-freegrammars. We sample words (of length within a predefinedrange) from one grammar and check if they are acceptedby the other and vice versa. Bounding the length of wordsgreatly aids in reducing the parsing overhead especially whileusing generic parsers.

In section 5, we present detailed results about the effi-ciency and accuracy of counter-example detection. The re-sults show that the implementation is able to enumerate andparse millions of words within a few minutes on large real-world grammars. In the sequel, we present an overview of theparsers used by the tool.

Parsing. We use a suite of parsers consisting of CYK parser[16], Antlr v4 parser [1] (in compiler and interpreter modes),and LL(1) [3] parser, and employ them selectively depend-ing on the context. For instance, for testing large program-ming language grammars for equivalence, we compile thegrammars to parsers (at runtime) using Antlr v4, which usesadaptive LL(*) parsing algorithm [25], and use the parsers tocheck if the generated words are accepted by the grammar.The CYK parsing algorithm we implement in our tool, is atop-down, non-recursive algorithm that memoizes the parsinginformation computed for the substrings of the word beingparsed (using a trie data structure), and reuses the informationon encountering the same substring again, during a parse ofthe same or another word. Though the top-down evaluationintroduces some overheads compared to the conventional dy-namic programming approach, it improves the performanceof the CYK parser by orders of magnitude when used in batchmode to parse a collection of words using the same grammar.

(a)S → aTT → aTb | b (b)

P → aRR → abb | aRb | b

Figure 10. GNF grammars for the language anbn.

We mostly rely on the optimized CYK parser for checkingthe correctness of students’ solutions. We find that quiteoften the solutions provided by students are convoluted, andare tricky to parse using specialized parsers. For instance,for a grammar with productions S → a | B and B →aaBb | aB | ε, the performance of the Antlr v4 parserdegenerates severely with the length of word that is parsed.

4. Proving EquivalenceOur approach for proving equivalence is based on the algo-rithms proposed by Korenjak and Hopcroft [15], and extendedby Olshansky and Pnueli [24] and Harrison et al. [12]. Thisfamily of algorithms is attractive because it works directlyon context-free grammars without requiring conversions toother representation like push-down automatons. Moreover,they come with strong completeness guarantees. Korenjakand Hopcroft [15] introduce a decision procedure for check-ing equivalence of simple deterministic grammars, which areLL(1) grammars in Griebach Normal Form (GNF). Olshan-sky and Pnueli [24] extend this algorithm to LL(k) grammarsin GNF, while Harrison et al. [12] extend the algorithm inanother direction, namely to decide equivalence of determin-istic GNF grammars of which one of them is LL(1). (A subtlepoint to note is that a LL(k) grammar with epsilon produc-tions may become LL(k+1) when expressed in GNF [28].)

Our approach extends the work of Olshansky and Pnueli[24] by incorporating several aspects of the algorithm ofHarrison et al. [12]. The resulting algorithm is applicableto arbitrary context-free grammars, but at the same time iscomplete for LL grammars (our implementation is completeonly for LL(2) grammars since we limit the lookahead forefficiency reasons). Furthermore, we perform several exten-sions to the algorithm that improves its precision and alsoperformance in practice. In particular, we extend the approachto handle inclusion relations, which provides an alternativeway of establishing equivalence when the equivalence queryis not directly provable. We also introduce transformationsthat use concrete examples to dynamically refine the queriesduring the course of the algorithm. Our experiments showthat the algorithm succeeds in 82% of the cases that passedall test cases, even on queries involving ambiguous grammars(see section 5).

Algorithm. We use the grammars shown in Fig. 10 forthe language anbn as a running example. Observe that thegrammar shown on the right is ambiguous – it has two parsetrees for aabb. We formalize the verification algorithm as aproof system that uses the inference rules shown in Fig. 11.We later discuss an extension to the algorithm that augments

Page 8: Automating Grammar Comparison

the rules with a fixed lookahead distance k. Fig. 12 illustratesthe algorithm on our running example.

In the sequel, we make the following assumptions: (a)the input grammars have the same set of terminals, (b) thegrammars do not have any epsilon productions, and (c) thenon-terminals belonging to grammars are unique. In our im-plementation, if an input grammar has epsilon productions,we remove them using the standard transformations [16].However, in the case of LL(1) grammars we use the spe-cialized, but expensive algorithm introduced by Rosenkrantzand Stearns [28] that preserves the LL property. This ensuresthat the algorithm is complete for arbitrary LL(1) grammarsincluding those not in GNF.

Derivatives. We express our algorithm using the notion ofa derivative of a sentential form which is defined as follows.A derivative d : Σ∗ × (N ∪ Σ)∗ → 2(N∪Σ)∗ is a functionthat given a word w and a sentential form α, computes thesentential forms β that remain immediately after derivingw from α. We formally define d using the notion of a left-most derivation. We say that α derives β in one step, denotedα ⇒ β, iff β is obtained by replacing the left most non-terminal in α by one of its right-hand-sides. Let⇒∗ denotethe reflexive transitive closure of ⇒. For any non-emptystring w and sentential form α,

d(w,α) =

{β} if α = wβ

{β | ∃x ∈ Σ∗, A ∈ N, γ ∈ (N ∪ Σ)∗

α⇒∗ xAγ ⇒ wβ ∧ |x| < |w|} otherwise

For example, if A → a and B → bB | b, thend(aab,AaB) = {ε, B}, and d(b, AaB) = ∅. Though deriva-tives are defined for any grammar, they are more efficientlycomputable when a grammar is normalized to GNF. For agrammar in GNF every production of the grammar starts witha terminal. Hence, a wordw that is derivable from a sententialform α would be derivable in at most |w| steps,

We lift the derivative operation to a set of sentential formsas: d̂(w,α) =

⋃α∈α d(w,α).

Inference Rules. We consider two types of relations betweensets of sentential forms: equivalence (≡) and inclusion (⊆). Arelation α ≡ β (or α ⊆ β) holds if the set of words generatedby the sentential forms in α i.e,

⋃α∈α L(α) is equal to (or

included in) the set of words generated by the sententialforms in β i.e,

⋃β∈β L(β). Though we are only interested

in proving equivalence of sentential forms, our algorithmsometimes uses inclusion relations in the intermediate stepsto establish equivalence. As a consequence, the approach canalso be used to prove inclusion of grammars. But, the rulesdo not guarantee completeness. The rules shown Fig. 11 usejudgements of the form C ` α ⊆ β, where C is a set ofrelations which can be assumed to hold when deciding thetruth of α ⊆ β. Every inference rule shown Fig. 11 providesa set of judgements, given by the antecedents, which whenestablished guarantees that the consequent holds. In other

words, the antecedents provide a sufficient condition for theconsequent. (Sometimes they are also necessary conditions.)

Consider the illustration shown in Fig. 12. Our goal isto establish that the start symbols of the two grammars areequivalent under an empty context, i.e, ∅ ` [S ≡ P]. Weprove this by finding a derivation for the judgement using theinference rules. In Fig. 12, the relations that are added to thecontext are marked with †. At any step in the derivation, wecan assume that every relation that is marked in the precedingsteps leading to the current step hold.

BRANCH Rule. Initially, we apply the BRANCH rule to[S ≡ P]. The rule asserts that a relation α op β holds in acontext C if for every alphabet a, the derivatives of α and βwith respect to a are related under the same operation. Thecorrectness of this part is obvious: if two sentential formsare equivalent, the sentential forms that remain after derivingthe first character ought to be equivalent. Additionally, theBRANCH rule allows the relation α op β to be consideredas valid when proving the antecedents. This is because, theBRANCH rule also incorporates inductive reasoning. To proveα op β, the rule hypothesises that the relation holds for allwords with length smaller than k, and attempts to establishthat the relation holds for words of length k. It suffices for theantecedents to hold for all words of length less than k sincewe peel off the first character from the words generated byα and β by computing the derivative. Therefore, during theproof of the antecedents if the relation α op β is encounteredagain then we know that it needs to hold only for words oflength less than k, which holds by hypothesis.

An equivalent contra-positive argument is that, if therelation α op β has a counter-example then the antecedentswill have a strictly smaller counter-example. However, whenα op β is encountered during the proof of the antecedents itneed not be explored any further because it would not leadto the smallest counter-example. Harrison et al. [12] referto this property, wherein the counter-examples of the newlycreated relations (antecedents) are strictly smaller than thecounter-examples of the input relation (consequent) whenthey exist, as monotonicity. In our system, the only other rulethat is monotonic is the SPLIT rule.

Applying the BRANCH rule to [S ≡ P] produces the rela-tion [T ≡ R] for the terminal a since T and R are derivativesof S and P w.r.t a, and produces the empty relation [∅ ≡ ∅]for terminal b. The empty relation trivially holds, as assertedby rule Empty, and hence is not shown.

Equivalence to Inclusion. The INCLUSION rule reducesequivalence relations to pairs of inclusion relations, (e.g. seerelation 3 in Fig. 12). The DIST rule simplifies the inclusionrelations by distributing the inclusion operation over theleft-hand-sides, as illustrated on the relation 7. These rulesensure that every relation generated during the algorithm isnormalized to the form {α} ≡ {β}, or {α} ⊆ β.

The TESTCASES rule applies to a relation of the form{α} ⊆ β. It samples a predefined set of words from α and

Page 9: Automating Grammar Comparison

BRANCH

∀a ∈ Σ. C ∪ {α op β} ` d̂(a,α) op d̂(a,β)

C ` α op β

INCLUSIONC ` α ⊆ β C ` β ⊆ α

C ` α ≡ β

DIST∀1 ≤ i ≤ m. C ` {αi} ⊆ β

C `m⋃i=1

αi ⊆ β

INDUCTrel ∈ C rel⇒ α op β

C ` α op β

SPLIT

x ∈ ||A|| |γ| > 1 Ψ =

m⋃i=1

ψiβi d̂(x,Ψ) =

m⋃i=1

ρiβi ∀0 ≤ i ≤ m. |βi| > 0

C′ = C ∪ {{Aγ} op Ψ} C′ ` {γ} op d̂(x,Ψ) ∀0 ≤ i ≤ m. C′ ` {Aρi} op {ψi}C ` {Aγ} op Ψ

TESTCASES

S ⊂ β sample(n, α) ⊆⋃β∈S

L(β) C ` {α} ⊆ S

C ` {α} ⊆ β

EMPTY1

` ∅ ≡ ∅EMPTY2

` ∅ ⊆ β

EPSILONC ` α op β

C ` (α ∪ {ε}) op (β ∪ {ε})

Figure 11. Basic inference rules of the verification algorithm. In the figure, op ∈ {≡,⊆}, ||A|| is the set of shortest wordsderivable from A, and rel1 ⇒ rel2 is a syntactic implication check that holds if rel1 is stronger than rel2.

1[S ≡ P]† BRANCH−−−−−→ 2[T ≡ R]

† BRANCH−−−−−→ 3[Tb ≡ Rb ∪ bb] ∧ 4[b ≡ b]4[b ≡ b]

BRANCH−−−−−→ 5[ε ≡ ε] EPSILON−−−−−→ [∅ ≡ ∅] EMPTY−−−−→ proved3[Tb ≡ Rb ∪ bb]

INCLUSION−−−−−−→ 6[Tb ⊆ Rb ∪ bb] ∧ 7[Rb ∪ bb ⊆ Tb]6[Tb ⊆ Rb ∪ bb]

TESTCASES−−−−−−→ 8[Tb ⊆ Rb]† SPLIT−−−→ 9[b ⊆ b] ∧ 10[T ⊆ R]

INDUCT−−−−→ 11[b ⊆ b]∗−→ proved

7[Rb ∪ bb ⊆ Tb]DIST−−−→ 12[Rb ⊆ Tb] ∧ 13[bb ⊆ Tb]

13[bb ⊆ Tb]† BRANCH−−−−−→ 14[b ⊆ b] ∧ 15[∅ ⊆ Tbb]

∗−→ proved12[Rb ⊆ Tb]

† SPLIT−−−→ 16[b ⊆ b] ∧ 17[R ⊆ T]INDUCT−−−−→ 18[b ⊆ b]

∗−→ proved

Figure 12. Illustration of application of the rules on the running example. A star (∗) denotes application of one or more rules.Curly braces around singleton sets are omitted.

searches for a strict subset of β that accepts all the samples.On finding such a subset S, it construct a stronger relation{α} ⊆ S that implies the input relation. For instance, the rulereduces the relation 6: [Tb ⊆ Rb ∪ bb] to [Tb ⊆ Rb] usinga set of examples. This rule uses an enumerator to samplewords from sentential forms and a parser to check if thesample words are accepted by the sentential forms. In ourimplementation, we use a CYK parser extended for parsingsentential forms to check if the sample words are accepted bythe sentential forms.

The TESTCASES rule, despite being incomplete, is usefulin practice. It makes the tool converge faster, and also helps inproving more queries by making other rules applicable. Forinstance, removing smaller sentential forms from a union maymake the SPLIT rule (described shortly) applicable. (Fig. 18of section 5 sheds light on the usefulness of the TESTCASESrule in practice.)

INDUCT Rule. The INDUCT rule asserts that all relationsimplied by the context hold. The implication check only usessyntactic equality of the sentential forms. In particular, forequality relations α ≡ β, we check if the context contains

the same relation or β ≡ α. For inclusion relations of theform α ⊆ β, we check if the context contains an equivalencerelation between α and β or an inclusion relation of the formα ⊆ S, where S has fewer sentential forms than β. Forinstance, in the derivation shown in Fig. 12, the relations 10and 17 are implied by the relation 2: [T ≡ R], added to thecontext in step 2 during the application of BRANCH rule, andhence are considered valid.

SPLIT Rule. The main purpose of the SPLIT Rule is to pre-vent the sentential forms in the relations from becoming ex-cessively long. The key idea behind the rule is to split the sen-tential forms that are compared (say [Aγ ≡ βδ]) into smallerchunks that are piece-wise equivalent e.g. as [Aρ ≡ β], and[γ ≡ ρδ] (where ρ is a sentential form derived from β), whilepreserving completeness under some restrictions. It identifiesthe point to split by deriving a shortest word of A from theother side of the relation.

We apply this rule only to a relation whose left-hand-side is a singleton (since all relations will be reduced to thisform). Let r1 be the relation {Aγ} op Ψ (with non-empty γ).Let x be one of the shortest words derived from A, denoted

Page 10: Automating Grammar Comparison

||A||. The SPLIT rule requires that every sentential form inΨ can be split into ψiβi such that ψi can derive x, and βiis non-empty. (However, this requirement can be relaxed asdescribed shortly.) This implies that the derivative of Ψ w.r.tthe word xwill preserve the suffix βi. That is, d̂(x,Ψ) will beof the form

⋃i ρiβi, where ρi is a union of sentential forms

corresponding to the derivative of ψi.Under the above conditions, the rule asserts that if

γ op d̂(x,Ψ), and, for all i, Aρi op ψi holds, then so doesr1. Furthermore, the rule allows assuming r1 while provingthe antecedents. The requirement that all βis are non-emptyensures the monotonicity of the rule. (Note that βis cannotderive the empty string as there are no epsilon productions.)If this requirement does not hold, the rule is still applicablebut we cannot add r1 to the context.

The soundness of this assertion is relatively easy to es-tablish. For all i, Aρi op ψi implies Aρiβi op ψiβi(since we are concatenating the left- and the right-hand-sides with the same sentential form). This entails that⋃iAρiβi op

⋃i ψiβi. We are also given that γ and

⋃i ρiβi

(which is d̂(x,Ψ)) are related by op, where op ∈ {≡,⊆}.Substituting

⋃i ρiβi with γ yields Aγ op

⋃i ψiβi, which is

the relation r1. Hence, the antecedents imply the consequent.However, the converse does not necessarily hold. It holds (forequivalence) when the grammars satisfy the suffix property:αβ ≡ γβ ⇒ α ≡ γ, and are strict deterministic [12] (whichincludes LL(1) grammars).

In the illustration shown in Fig. 12, the split rule is appliedon relations 8 and 12. Consider the relation 8: [Tb ⊆ Rb].The shortest word derivable from T is b (see Fig. 14). Since,d(b, Rb) = b, we can deduce that ψ1 is R, β1 is b, andρ1 = {ε} (which are the sentential forms that remain afterderiving b from R). The new relations created by the SPLITrule are γ op d(b, Rb), and Aρ1 op ψ1, which correspond to[b ⊆ b] and [T ⊆ R]. Note that without the application of theSPLIT rule, the relation [Tb ⊆ Rb] will gradually grow withthe application of BRANCH rule and lead to non-termination.

Application Strategy and Termination Checks. In orderto preserve termination and completeness of the algorithm forLL grammars, we adopt a specific strategy for applying therules. We use the INCLUSION rule to convert an equivalencerelation to inclusion relations only when at least one of theoperands of the relation has size greater than one. Such caseswill not arise if both the grammars are LL(1) (or LL(k) whenthe rules are augmented with a lookahead distance of k). Weprevent the sentential forms from growing beyond a thresholdby applying SPLIT rule whenever the threshold is reached.We prioritize the application of rules EMPTY, INDUCT, andTESTCASES that simplify the relations over the BRANCHrule. Note that TESTCASES rule, which is incomplete, willnot apply to LL(1) grammars since inclusion relations willnot be created during its proof exploration.

We use a set of filters to identify relations that are false andto terminate the algorithm. An important filter is the Length

filter, which checks for every equivalence query {α} op {β},whether the length of the left sentential form α is larger thanthe length of the shortest word that can be generated by β,and vice versa. If this check fails, one of the sentential formscannot generate the shortest word of the other and the relationdoes not hold. (Recall that the input grammar do not haveepsilon productions.) We also use other filters especially forinclusion relations to quickly abort the search and report afailure. We elide details for brevity.

The algorithm described above reduces to the algorithmof Korenjak and Hopcroft [15] for LL(1) grammars that arein GNF. Hence, our algorithm is a decision procedure forLL(1) grammars in GNF. However, our algorithm may notterminate for grammars outside this class, since the sententialforms in an inclusion relation can grow arbitrarily long. Inour implementation, we abort the algorithm and return failureif the algorithm exhausts the memory resources or exceeds aparametrizable time limit (fixed as 10s in our experiments).

4.1 Incorporating LookaheadIn general, the SPLIT rule described above is incomplete.Recall that given a relation [Aγ ≡ ψ], the rule computes aword x of shortest length derivable from A, and equates γwith the derivative of ψ with respect to x. This is because,since Aγ ⇒∗ xγ, the rule optimistically hypothesises thatγ and d(x, ψ) are equivalent. However, if there are otherderivations of the form Aγ ⇒∗ xβ where β 6= γ, equating γalone with d(x, ψ) could be incomplete. In the case of LL(1)grammars there is at most one derivation for a sententialform starting with a prefix x from a non-terminal A (if x isnon-empty), which entails completeness of the SPLIT rule.Interestingly, this property also holds for LL(k) grammarsprovided we know the string w of length at least k − 1that would follow x. In the sequel, we briefly describe theextensions we perform to the proof rules shown in Fig. 11to exploit this property. Our extensions are based on thealgorithm of Olshansky and Pnueli [24]. In essence, theextensions statically enumerate all strings of length smallerthan k, referred to as lookahead strings, and use them tocreate fine-grained relations between sentential forms.

We perform two major extensions to the relations and sen-tential forms: (a) We qualify every relation with a (possiblyempty) word x, which restricts the relations to only wordshaving x as a prefix. For instance, α ≡x β holds iff α and βgenerate the same set of words having the prefix x. (b) We in-troduce two special types of sentential forms: prefix restrictedsentential forms (PRS), and grouped variables. A PRS is ofthe form [[x, α]] where x is a word and α is a sentential form.It allows only those derivations of α that will lead to a wordhaving x as the prefix. A grouped variable is a disjoint unionof two or more PRS that have different prefixes. A groupedvariable allows all derivations that are possible through itsindividual members, akin to a union of sentential forms. PRSand grouped variables are formally defined by Olshansky andPnueli [24]. They can be treated as any other sentential form

Page 11: Automating Grammar Comparison

e.g. can be concatenated with other sentential forms, used inderivative computation and so on.

We extend the definition of a derivative d(w,α) so thatit additionally accepts a string x and refines the result ofd(w,α) to include only those sentential forms that can derivethe string x. That is, d(w, x, α) = {β | β ∈ d(w,α) ∧(∃γ s.t. β ⇒∗ xγ)}. We refer to this parameter x as alookahead as it is not consumed by the derivative but is usedto select the sentential forms. We denote using d̂ the operationd lifted to a set of sentential forms.

We adapt the BRANCH and SPLIT rules shown in Fig. 11to the extended domain of relations and sentential forms.(Other rules in Fig. 11 do not require any extensions.) Wenow discuss the extended branch rule. For brevity, we presentthe extended SPLIT rule in Appendix B.

BRANCHEXT.x = aw

∀b ∈ Σ. C ∪ {α opx β} ` d̂(a,wb,α) opwb d̂(a,wb,β)

C ` α opx β

Similar to BRANCH rule, the BRANCHEXT rule removesthe first character a (of the words considered by the relation)from the sentential forms in α and β. However, unlike theBRANCH rule that compares all the sentential forms leftbehind after deriving the first character, the BRANCHEXTrule looks ahead at the string wb that follows the character ato choose the sentential forms that have to be compared. Notethat the derivative operation only returns the sentential formsthat can derive the lookahead string wb.

Given a lookahead distance k, and two grammars withstart symbols S1 and S2, we begin the algorithm with theinitial set of relations S1 ≡wk−1

S2, where wk−1 is a word oflength≤ k−1. The grammars are equivalent if every relationis proven using the inference rules. In our implementation, wefix the lookahead distance as 2. Our implementation reducesto the algorithm of Olshansky and Pnueli [24] when the inputgrammars are LL(2) GNF grammars, and hence is completefor LL(2) grammars in GNF.

5. Experimental ResultsWe developed a grammar analysis system based on the algo-rithms presented in this paper, using the Scala programminglanguage. All the experiments discussed in this section wereperformed on a 64 bit OpenJDK1.7 VM running on UbuntuLinux operating system, executing on a server with two 3.5GHz Intel Xeon processors, and 128GB memory.

5.1 Evaluations with Programming LanguageGrammars

Fig. 13 presents details about the benchmarks used in theevaluation. The column Lang. shows the list of programminglanguages chosen for evaluation. For each language weconsidered at least two grammars which are denoted using thenames shown in column B. For each language we hand-picked

Query # Ctr. Exs. # Samples RA timec1 ≡ c2 82 227 1msp1 ≡ p2 417 1053 0.2msjs1 ≡ js2 75 150 0.8msj1 ≡ j2 133 240 1.5msv1 ≡ v2 41 52 2.7ms

Figure 14. Counter-examples found in 1min when compar-ing grammars of the same programming language. The col-umn RA time denotes the average time taken for one randomaccess.

grammars that cover almost all features of the languageand are expected to be identical. For example, in the caseof Javascript and VHDL, the grammars we choose weresupposed to implement the same standard, namely ECMAstandard and VHDL-93. In some cases, the grammars evenuse identical names for many non-terminals. The column Sizeshows the number of non-terminals and productions in eachgrammar when expressed in standard BNF form. The columnSource shows the source of the grammars.

Comparing Real-world Grammars. As an initial exper-iment, we compared the grammars belonging to the sameprogramming language for equivalence. We ran the counter-example detector for 1 minute on each pair of grammars,fixing the maximum length of the word that is enumerated as50. Fig. 14 show the results of this experiment. The columnCtr.Exs shows the number of counter-examples that werefound in 1min, and the column Samples shows the number ofsamples generated during counter-example detection.

The column RA time shows the average time taken foraccessing one word (of length between 1 and 50) uniformly atrandom. The results show that the operation is quite efficienttaking only a few milliseconds across all benchmarks.

Interestingly, as shown by the results, the grammars havelarge number of differences even when they implement thesame standard. In many cases, more than 40% of the sampledwords are counter-examples. Manually inspecting a fewcounter-examples revealed that this is mostly due to rules thatare more permissive than they ought to be. For instance, thestring “enum ID implements char { ID }" is generatedby j2 (Antlr v4 Java Grammar), but is not accepted by j1 [2].The counter-examples were mostly distinct but sometimesstrings that have many parse trees occurred more than once.

From Counter-examples to Incompatibilities. Focusingon Javascript, we studied the usefulness of the counter-examples found by our tool in discovering incompatibilitiesbetween Javascript interpreters (embedded within browsers).In this experiment, we automatically converted every counter-example found by our tool (in one minute) while compar-ing Javascript grammars to a valid Javascript expression,wrapped the expression inside a function, and passed thefunction as a string to the eval construct. For example, forthe counter-example “++ RegExp - this", we generate

Page 12: Automating Grammar Comparison

Language B Size Source

C 2011c1 (228, 444) Antlr v4c2 (75, 269) www.quut.com/c/ANSI-C-grammar-y.html

Pascalp1 (177, 79) ftp://ftp.iecc.com/pub/file/pascal-grammarp2 (148, 244) Antlr v3

JavaScriptjs1 (128, 336) www-archive.mozilla.org/js/language/grammar14.htmljs2 (124, 278) Antlr v4

Java 7j1 (256, 530) docs.oracle.com/javase/specs/jls/se7/html/jls-18.htmlj2 (229, 490) Antlr v4

VHDLv1 (286, 587) tams-www.informatik.uni-hamburg.de/vhdl/vhdl.htmlv2 (475, 945) Antlr v4

Figure 13. Benchmarks, their sizes as pairs of number of non-terminals and productions, and their sources. Antlr v4 and Antlrv3 denote the repositories: github.com/antlr/grammars-v4/ and www.antlr3.org/grammar/.

try {eval("var k = function() { ++ /ab∗/ − this }");false;

} catch(err) {true;

}

Figure 15. A Javascript program, created using a counter-example discovered by our tool, that returns true in Fire-fox/Chrome browsers, and false in Internet Explorer.

the program eval("var k = function(){ ++ /ab*/ -this }"). This program when executed may either assign afunction value to the variable k if the body of the function isparsed correctly, or throw an exception if the body has parseerrors2.

We executed the code snippets on Mozilla Firefox (ver-sion 38), Google Chrome (version 43) and Microsoft InternetExplorer (version 11) browsers. On five counter-examples,the code snippet threw an exception (either ParseError or Ref-erenceError) in Firefox and Chrome browsers, but terminatednormally in Internet Explorer assigning a function value for k.Exploiting this we created a pure Javascript program shownin Fig. 15 that returns true in Firefox/Chrome browsers andfalse in Internet Explorer.

In essence, the experiment highlights that in dynamiclanguages that supports constructs like eval, where parsingmay happen at run-time, differences in parsers will likelymanifest as differences in run-time behaviors.

Discovering Injected Errors. In this experiment, we evalu-ate the effectiveness of our tool on grammars that have com-paratively fewer, and subtle counter-examples. Since gram-mars obtained from independent sources are likely to havemany differences, in order to obtain pairs of grammars thatalmost recognize the same language, we resort to automatic,controlled tweaking of our benchmarks. We introduce 3 typesof errors as explained below. (Let Gm denote the modifiedgrammar and G the original grammar).

2 www.ecma-international.org/ecma-262/5.1/#sec-15.1.2.1

• Type 1 Errors. We construct Gm by removing one pro-duction of G chosen at random. In this case, L(Gm) ⊆L(G). The inclusion is proper only if the production thatis removed is not redundant.

• Type 2 Errors. We create Gm by choosing (at random)one production of G having at least two non-terminals,and removing (at random) one non-terminal from the right-hand-side. In this case, neither L(Gm) nor L(G) has tonecessarily include the other.

• Type 3 Errors. We construct Gm as follows. We ran-domly choose one production of the grammar, say P ,having at least two non-terminals, and also choose onenon-terminal of the right-hand-side, say N . We then cre-ate a copy (say N ′) of the non-terminal N that has everyproduction of N except one (determined at random). Wereplace N by N ′ in the production P .

The above error model has some correspondence to prac-tical scenarios. For instance, most grammars we found donot enforce that the condition of an if statement should be aboolean valued expression. They have productions like S →if E then E else E | · · · , and E → E + E | E ≥ E| · · · .Enforcing this property requires one or more type 3 fixes,as we need to create a copy E′ of E that do not have someproductions of E, and use E′ in place of E to define an ifcondition.

We avoid injecting errors that can be discovered throughsmall counter-examples using the following heuristic. Werepeat the random error injection process until the modifiedgrammar agrees with the original grammar on the numberof parse trees (the function #t defined in Fig. 5) generatingwords of length≤ 15. This ensures that the minimum counter-example, if it exists, is at least 15 tokens long. We relaxthis bound to 10 and 7 for C and JavaScript grammars,respectively, since the approach failed to produce errors thatsatisfy larger bounds within reasonable time limits. We alsoensured that the same error is not reintroduced. It is to benoted that the counter-example detection algorithm is notaware of the similarities between the input grammars, neitherdoes it attempt to discover such similarities.

Page 13: Automating Grammar Comparison

Type 1 ErrorsB Disproved Avg.Time/query Avg.Ctr.Size

our cfga our cfga our cfgac1 10 7 12.7s 396.7s 29.1 10.0c2 10 4 13.8s 325.0s 30.3 10.3p1 10 7 6.8s 127.8s 39.3 15.0p2 10 5 6.8s 329.2s 43.2 16.2js1 10 0 10.9s - 32.2 -js2 10 9 9.6s 190.9s 31.2 8.1j1 8 0 14.5s - 41.1 -j2 9 0 14.3s - 32.1 -v1 10 1 16.9s 810.4s 39.3 15.0v2 10 0 23.4s - 39.0 -

Type 2 Errorsc1 9 3 13.3s 319.1s 33.8 10.0c2 10 6 9.1s 300.7s 35.6 10.3p1 10 5 6.2s 358.5s 41.3 16.0p2 10 5 7.9s 229.8s 40.0 15.8js1 10 0 12.3s - 33.8 -js2 7 8 15.3s 52.8s 31.4 7.4j1 7 0 16.3s - 33.9 -j2 9 0 15.1s - 38.1 -v1 10 2 16.4s 729.2s 43.7 15.0v2 10 0 58.0s - 35.8 -

Type 3 Errorsc1 5 4 37.2s 413.6s 17.8 10.3c2 6 5 131.3s 361.2s 30.3 10.0p1 10 3 11.0s 272.5s 34.8 15.0p2 10 5 7.5s 526.8s 34.8 15.8js1 5 0 198.6s - 28.2 -js2 5 2 34.0s 79.3s 33.2 7.5j1 8 0 25.7s - 35.4 -j2 6 0 24.8s - 36.3 -v1 9 0 17.7s - 38.6 -v2 9 0 54.6s - 37.3 -

262 81 28.1s 342.6s 35.0 12.2

Figure 16. Identification of automatically injected errors,using our tool (our) and the implementation of [4] (cfga).

For each benchmark b and error type t, we create 10defective versions of b each containing one error of typet. In total, we create 300 defective grammars. In each case,we query the tool for the equivalence of the erroneous and theoriginal versions, with a time out of 15 minutes. Fig. 16 showsthe results of this experiment. We categorize the results basedon the type of the error that was injected. For now consideronly the sub-columns labelled ours.

The column Disproved shows the number of queriesdisproved, i.e, the cases where the defective grammar wasidentified to be not equivalent to the original version. (Themaximum possible value for this columns is 10.) Note that forthis experiment we ran our tool only until it finds one counter-

example. The column Avg.Time/query shows the average timetaken by the tool on queries where it found a counter-example.The column Avg.Ctr.Size shows the average length of thecounter-example discovered by the tool. The last row of thetable summaries the results by showing the total number ofqueries disproved, average time taken to disprove a query,and the average length of a counter-example.

The results show that the tool was successful in disprovingall queries except 3 for Type 1 Errors, and 92 out of 100queries for Type 2 Errors, within a few seconds. For Type 3Errors, which are quite subtle, the tool succeeded in findingcounter-examples for 73 out of 100 queries taking at most200s. It timed out after 15 min in the remaining cases. Wefound that the tool generated millions of words before timingout on a query, across all benchmarks,.

To put these results in perspective, we now present acomparison with the approach proposed in Axelsson et al.[4], which is also used in the more recent work of Creuss andGodoy [7].

Comparison with a Related Work. The approach proposedin [4] finds counter-examples for equivalence by constructinga propositional formula that is unsatisfiable iff the inputgrammars are equivalent upto a bound l, i.e, they accept (orreject) the same set of words of length≤ l. The approach usesan incremental SAT solver to obtain a satisfying assignmentof the formula, which corresponds to a counter-example forequivalence. We ran their tool cfgAnalyzer on the same setof equivalence queries constructed by automatically injectingerrors in our benchmarks as described earlier, with the sametime out of 15 minutes. We present the results obtained usingtheir tool in Fig. 16 adjacent to our results, under the sub-column cfga. The cfgAnalyzer tool was run in its defaultmode, wherein the bound l on the length of the words isincremented in unit steps starting from 1 until a counter-example is found. (We tried adjusting the start length to 15and greater, and also tried varying the length increments, butthey resulted in worse behaviour. This is probably because ofincremental solving which may benefit starting from 1.)

The results show that our tool out performs cfgAnalyzerby a huge margin on these benchmarks. When aggregatedover all benchmarks, our tool disproves 3 times more queriesthan cfgAnalyzer. Observe that on Java, VHDL and thefirst Javascript (js1) benchmarks, cfgAnalyzer timed out onalmost all queries. In general, we found that the performanceof cfgAnalyzer degrades with the length of the counter-examples, and with the sizes of the grammars. On the otherhand, as highlighted by the results in Fig. 16, our tooldiscovers large counter-examples within seconds.

To the credit of cfgAnalyzer, in cases where it terminates,it finds the shortest counter-example (as a consequence ofrunning it in the default mode). This, however, is not alimitation of our tool, since we can progressively searchfor smaller counter-examples by narrowing the range of thepossible word lengths after discovering a counter-example.

Page 14: Automating Grammar Comparison

Query Refuted Proved Unprvd. time/query1395 1042 289 64 107ms

(100%) (74.6%) (20.7%) (4.6%)

Figure 17. Summary of evaluating students’ solutions.

5.2 A Tutoring System for Context-free GrammarsWe implemented an online grammar tutoring system availableat grammar.epfl.ch using our tool. The tutoring system of-fers three types of exercises: (a) constructing (LL(1) as well asarbitrary) context-free grammars from English descriptions,(b) converting a given context-free grammar to normal formslike CNF and GNF, and (c) writing left most derivations forautomatically generated strings belonging to a grammar. Eachclass of exercise had about 20 problems each with varyinglevels of difficulty.

For exercises (a) and (b), the system automatically checksthe correctness of the grammars submitted by the usersby comparing them to a pre-loaded reference solution forthe question. The following are the possible outcomes ofchecking a solution: (i) the shortest counter-example thatwas found within the time limits, or (ii) a message that thegrammar has been proved correct, or (iii) a message thatthe grammar passed all test cases but was not proved to becorrect.

The system also supports checking LL(1) property andambiguity of grammars. Moreover, it also has experimentalsupport for generating hints (a feature outside the scope ofthis paper). The system offers a very intuitive syntax forwriting grammars, and also supports EBNF form that permitsusing regular expressions in right-hand-sides of productions.

5.3 Evaluations of the Algorithms in the Context of aTutoring System

We used our tutoring system in a 3rd year undergraduatecourse on computer language processing. We summarize theresults of this study in Fig. 17. The column Queries showsthe total number of distinct equivalence queries that the toolwas run on. The system refuted 1042 queries by findingcounter-examples. (It was configured to enumerate at most1000 words of length 1 to 11). Among the 353 submissionsfor which no counter-example was found, the tool provedthe correctness of 289 submissions. For 64 submissions, thetool was neither able to find a counter-example nor ableto prove correctness. In essence, the tool was to able todecide the veracity of 95% of the submissions, and wasincomplete on the remaining 5% (in which cases we reportthat the student’s solution is possibly correct). The grammarssubmitted by students on average had around 3 non-terminalsand 6 productions (the maximum was 9 non-terminals and43 productions). Moreover, at least 370 of the submissionswere ambiguous. We now present detailed results on theeffectiveness of the verification algorithm, which is, to ourknowledge, a unique feature of our grammar tutoring system.

Query Proved Time LL1 LL2 Amb353 289 410ms 7 56 101

100% 81.9% 2% 15.9% 28.6%w/o TESTCASES rule

353 280 630ms 7 56 94

Figure 18. Evaluation of the verification algorithm on stu-dents’ solutions.

Evaluation of the Verification Algorithm. Our tutoringsystem uses the verification algorithm described in section 4to establish the correctness of the submissions for which nocounter-examples are found within the given time limit andsample size. In our evaluation, there are 353 such submis-sions. The first row of Fig. 18 shows the results of using thealgorithm, with all of its features enabled, on the 353 submis-sions. We used a time out of 10s per query. The system provedalmost 82% of the queries taking on average less than half asecond per query (as shown by column Time). The remainingcolumns further classify the queries that were verified basedon the nature of the grammars that are compared.

The column LL1 shows the number of queries in whichthe grammars that are compared are LL(1) when normalizedto GNF. The algorithm of [15] is applicable only to thesecases. The results show that only a meager 2% of the queriesbelong this category. This is expected since even LL(1)grammars may become non-LL(1) when epsilon productionsare eliminated [28] (which is required by the verificationalgorithm).

The column LL2 shows the number of queries in which thegrammars compared are LL(2) but not LL(1), after conversionto GNF. About 16% of the queries belong this category.This class is interesting because the algorithm of [24] iscomplete for these cases. (Although the algorithm of [12] isalso applicable, it seldom succeeds for these queries sinceit uses no lookhead). A vast majority (72%) of the queriesthat are proven involved at least one grammar that is notLL(2). In fact, about 28% of the queries involved ambiguousgrammars. (Neither [24] nor [15] is directly applicable tothis class of grammars, and [12] is likely to be ineffective.)This indicates that without our extensions a vast majorityof the queries may remain unproven. We are not aware ofany existing algorithm that can prove equivalence queriesinvolving ambiguous grammars.

We also measure the impact of the TESTCASES inferencerule, which uses concrete examples to refine inclusion rela-tions (see section 4). The second row of Fig. 18 shows theresults of running the verification algorithm without this rule.Overall, the number of queries proven decreases by 9 whenthis rule is disabled. The impact is mostly on queries involv-ing ambiguous grammars. Moreover, the verifier is slowerin this case as shown by the increase in the average time perquery. It also timed out on 25 queries after 10s. This is dueto the increase in the number and sizes of relations created

Page 15: Automating Grammar Comparison

during the verification algorithm. We measured a two foldincrease in the average number of sentential forms containedin a relation.

6. Related Work

Grammar Analysis Systems. Axelsson et al. [4] present aconstraint based approach for checking bounded propertiesof context-free grammars including equivalence and ambigu-ity. In section 5 we presented a comparison of our counter-example detection algorithm with this work, which showsthat our approach does better especially when the counter-examples and grammars are large. Creus and Godoy [7]present RACSO an online judge for context-free grammars.RACSO integrates many strategies for counter-example de-tection including the approach of Axelsson et al. [4]. Wediffer from this work in many aspects. For instance, our enu-merators support random access and uniform sampling, scaleto large programming language grammars generating mil-lions of strings within seconds. Our system can additionallyprove equivalence of grammars. (An empirical comparisonwith this work was not possible since their interface restrictsthe sizes of grammars that can be used while creating prob-lems, by requiring that non-terminals have to be upper casecharacters.)

Decision Procedures for Equivalence. Decision proce-dures for restricted classes of context-free grammars havebeen extensively researched [15], [24], [12], [28], [32], [23],[5], [31]. For brevity we only highlight a few important works.Korenjak and Hopcroft [15], and later Bastien et al. [5] de-veloped algorithms for deciding equivalence of simple gram-mars. Rosenkrantz and Streans [28] introduced LL(k) gram-mars and showed that their equivalence is decidable. Later,Olshansky and Pnueli [24] proposed a direct algorithm for de-ciding equivalence of LL(k) grammars. Nijholt [23] presenteda similar result for LL-regular grammars, which properly con-tain LL(k) grammars. Decision procedures for several propersubclasses of deterministic grammars were studied by Harri-son et al. [12], and Valiant [32]. Sénizergues [31] showed thatequivalence of arbitrary deterministic grammars is decidable.

We are not aware of any practical applications of thesealgorithms. We extend the algorithms of Olshansky andPnueli [24], and Harrison et al. [12] to a sound but incompleteapproach for proving equivalence of arbitrary grammars, anduse it to power a grammar tutoring system.

Uniform Sampling of Words. Hickey and Cohen [14],and Mairson [19] present algorithms for sampling wordsfrom unambiguous grammars uniformly at random (u.a.r).Gore et al. [10] develop a subexponential time algorithm forsampling words from (possibly ambiguous) grammars, wherethe probability of generating a word varies from uniformby a factor 1 + ε, ε ∈ (0, 1). Bertoni et al. [6] present analgorithm for sampling from a finitely ambiguous grammarin polynomial time.

Our approach has a comparable running time for samplinga word u.a.r, and is not restricted to uniform random sampling.We are not aware of any implementations of these relatedworks.

Enumeration in the Context of Testing. Grammar-basedsoftware testing approaches (such as [27], [22], [30], [21],[13], [18], [20], [9], [11]) generate strings belonging to gram-mars describing the structure of the input, and use them totest softwares like refactoring engines and compilers. In con-trast to our objective, there the focus is on generating stringsfrom grammars satisfying complex semantic properties, suchas data-structure invariants, type correctness etc., that willexpose bugs in the software under test.

Purdom [27], and Malloy [21] present specialized algo-rithms for generating small number of test cases that result insemantically correct strings useful for detecting bugs. Maurer[22], Sirer and Bershad [30], and Guo and Qiu [11] pro-pose approaches for stochastic enumeration of strings fromprobabilistic grammars where productions are weighted byprobabilities. The probabilities are either manually providedor dynamically adjusted during enumeration. A differencecompared to our approach is that they do not sample wordsby restricting their length (which is hard in the presence ofsemantic properties), but control the frequency with whichthe productions are used.

Hennessy [13], and Lämmel and Schulte [18] explore var-ious criteria for covering the productions of the grammar thatcan be beneficial in discovering bugs in softwares. Majumdarand Xu [20], and Godefroid et al. [9] propose approachesfor selectively generating strings from a grammar that willexercise a path in the program under test using symbolicexecution.

Daniel et al [8], and Kuraj and Kuncak [17] present genericapproaches for constructing enumerators for arbitrary struc-tures, by way of enumerator combinators. They allow com-bining simple enumerators using a set of combinators (suchas union and product) to produce more complex enumera-tors. These approaches (Kuraj and Kuncak [17] in particular)were an inspiration for our enumeration algorithm, which isspecialized for grammars, and provides more functionalitieslike polynomial time random access, and uniform randomsampling.

7. ConclusionsWe present scalable algorithms for enumerating and samplingwords (and parse trees) belonging to context-free grammars,using bijective functions from natural numbers to parsetrees that provide random access to words belonging to agrammar in polynomial time. Our experiments show that theenumerators are effective in finding discrepancies betweenlarge, real world grammars meant to describe the samelanguage, as well as for unraveling deep, subtle differencesbetween grammars, outperforming the available state of theart techniques. We also show that the counter-examples serve

Page 16: Automating Grammar Comparison

as good test cases for discovering incompatibilities betweeninterpreters, especially for languages like Javascript.

We also develop a practical system for proving equiva-lence of arbitrary context-free grammars building on top ofprior theoretical research. We built a grammar tutoring sys-tem, available at grammar.epfl.ch, using our algorithms.Our evaluations show that the system is able to decide thecorrectness of 95% of the submissions, proving almost 82%of grammars that pass all test cases. To our knowledge, thisis the first tutoring system for grammars that can certify thecorrectness of the solutions. This opens up the possibility ofusing our tool in massive open online courses to introducegrammars to large populations of students.

References[1] Antlr version 4. http://www.antlr.org/

[2] Java 7 language specification. http://docs.oracle.com/javase/specs/jls/se7/html/jls-18.html

[3] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Princiles,Techniques, and Tools. Addison-Wesley, 1986. ISBN 0-201-10088-6

[4] R. Axelsson, K. Heljanko, and M. Lange. Analyzingcontext-free grammars using an incremental SAT solver.In Automata, Languages and Programming, ICALP, pages410–422, 2008. . URL http://dx.doi.org/10.1007/978-3-540-70583-3_34

[5] C. Bastien, J. Czyzowicz, W. Fraczak, and W. Rytter. Primenormal form and equivalence of simple grammars. Theor.Comput. Sci., 363(2):124–134, 2006

[6] A. Bertoni, M. Goldwurm, and M. Santini. Random generationand approximate counting of ambiguously described combina-torial structures. In STACS 2000, pages 567–580. 2000

[7] C. Creus and G. Godoy. Automatic evaluation of context-free grammars (system description). In Rewriting and TypedLambda Calculi RTA-TLCA, pages 139–148, 2014. . URLhttp://dx.doi.org/10.1007/978-3-319-08918-8_10

[8] B. Daniel, D. Dig, K. Garcia, and D. Marinov. Automatedtesting of refactoring engines. In Foundations of SoftwareEngineering, pages 185–194, 2007

[9] P. Godefroid, A. Kiezun, and M. Y. Levin. Grammar-basedwhitebox fuzzing. In Programming Language Design andImplementation, pages 206–215, 2008

[10] V. Gore, M. Jerrum, S. Kannan, Z. Sweedyk, and S. R. Ma-haney. A quasi-polynomial-time algorithm for sampling wordsfrom a context-free language. Inf. Comput., 134(1):59–74,1997

[11] H. Guo and Z. Qiu. Automatic grammar-based test generation.In Testing Software and Systems ICTSS, pages 17–32, 2013

[12] M. A. Harrison, I. M. Havel, and A. Yehudai. On equivalenceof grammars through transformation trees. Theor. Comput. Sci.,9:173–205, 1979

[13] M. Hennessy. An analysis of rule coverage as a criterion ingenerating minimal test suites for grammar-based software. InAutomated Software Engineering, pages 104–113, 2005

[14] T. J. Hickey and J. Cohen. Uniform random generation ofstrings in a context-free language. SIAM J. Comput., 12(4):645–655, 1983

[15] A. J. Korenjak and J. E. Hopcroft. Simple deterministiclanguages. In Symposium on Switching and Automata Theory(Swat), pages 36–46, 1966

[16] D. Kozen. Automata and computability. Undergraduate textsin computer science. Springer, 1997. ISBN 978-0-387-94907-9

[17] I. Kuraj and V. Kuncak. Scife: Scala framework for efficientenumeration of data structures with invariants. In ScalaWorkshop, pages 45–49, 2014

[18] R. Lämmel and W. Schulte. Controllable combinatorial cover-age in grammar-based testing. In Testing of CommunicatingSystems, TestCom, pages 19–38, 2006

[19] H. G. Mairson. Generating words in a context-free languageuniformly at random. Inf. Process. Lett., 49(2):95–99, 1994

[20] R. Majumdar and R. Xu. Directed test generation usingsymbolic grammars. In Automated Software Engineering,pages 553–556, 2007

[21] B. A. Malloy. An interpretation of purdom’s algorithm forautomatic generation of test cases. In International Conferenceon Computer and Information Science, pages 3–5, 2001

[22] P. M. Maurer. Generating test data with enhanced context-freegrammars. IEEE Software, 7(4):50–55, 1990

[23] A. Nijholt. The equivalence problem for LL- and LR-regulargrammars. pages 149–161, 1982

[24] T. Olshansky and A. Pnueli. A direct algorithm for checkingequivalence of LL(k) grammars. Theor. Comput. Sci., 4(3):321–349, 1977

[25] T. Parr, S. Harwell, and K. Fisher. Adaptive LL(*) parsing: thepower of dynamic analysis. In Object Oriented ProgrammingSystems Languages & Applications, OOPSLA, pages 579–598,2014

[26] S. Pigeon. Pairing function. http://mathworld.wolfram.com/PairingFunction.html

[27] P. Purdom. A sentence generator for testing parsers. BITNumerical Mathematics, pages 366–375, 1972

[28] D. J. Rosenkrantz and R. E. Stearns. Properties of deterministictop down grammars. In Symposium on Theory of ComputingSTOC, pages 165–180, 1969

[29] R. Singh, S. Gulwani, and A. Solar-Lezama. Automated feed-back generation for introductory programming assignments.In Programming Language Design and Implementation PLDI,pages 15–26, 2013

[30] E. G. Sirer and B. N. Bershad. Using production grammars insoftware testing. In Domain-Specific Languages DSL, pages1–13, 1999

[31] G. Sénizergues. L(a)=l(b)? decidability results from completeformal systems. Theoretical Computer Science, 251(1–2):1 –166, 2001

[32] L. G. Valiant. Decision procedures for families of deterministicpushdown automata. Technical report, University of Warwick,Coventry, UK, 1973

Page 17: Automating Grammar Comparison

[33] A. Warth, J. R. Douglass, and T. D. Millstein. Packrat parserscan support left recursion. In Symposium on Partial Evaluationand Semantics-based Program Manipulation, PEPM, pages103–110, 2008

A. Cantor’s Inverse Pairing Functions forBounded and Unbounded Domains

The basic inverse pairing function π that maps a naturalnumber in one dimensional space to a number in two dimen-sional space that is unbounded along both directions [26].π(z) = (x, y), where x and y are defined as follows:

x = w − yy = z − t

(t, w) = simple(z)

where, simple(z) = (t, w) is a function defined as follows:

t =w(w + 1)

2

w =

⌊⌊√8z + 1

⌋− 1

2

We extend the Cantor’s inverse pairing function to twodimensional spaces bounded along one or both directions.The inverse pairing function π takes three arguments: thenumber z that has to be mapped, and the bounds of the xand y dimensions xb and yb (which could be∞). xb is the(inclusive) bound on the x-axis i.e, ∀x.x ≤ xb, and yb is the(exclusive) bound on the y-axis i.e, ∀y.y < yb. We defineπ(z, xb, yb) = (x, y), where

x = w − yy = z − t

(t, w) =

bskip(z) if z ≥ zbxskip(z) if zx ≤ z < zb

yskip(z) if zy ≤ z < zb

simple(z) Otherwise

where, zx, zy and zb are indices at which the bounds alongthe x or y or both directions are crossed, respectively. Thevalues are defined as follows:

zy =yb(yb + 1)

2

zx =(xb + 1)(xb + 2)

2

zb =

yb(xb − yb + 1) + zy if xb > yb − 1

(xb + 1)(yb − xb − 1) + zx if yb − 1 > xb

zy Otherwise

We define xskip(z) as (t, w), where t and w are defined asfollows:

t =2wxb − x2

b + xb2

w =

⌊2z + x2

b + xb2(xb + 1)

⌋Define yskip(z) as (t, w), where

t =2wyb − y2

b + yb2

w =

⌊2z + y2

b − yb2yb

⌋Define bskip(z) as (t, w), where

t =(2wb − 1)w − w2 − sb + wb

2

w =

r −⌈√

r2 − 8z − 4sb + 4yb − 4xb

⌉2

r = 2wb + 1

wb = xb + yb

sb = x2b + y2

b

The above definitions use only integer arithmetic opera-tions. (Note that we always compute floor or ceil of divisionsand square roots). These operations take at most quadratictime on the sizes of the inputs, and can be optimized evenfurther. Moreover, many multipliers and divisors are powersof 2, and hence can be computed by bit shift operations.

B. Extended Split RuleFig. 19 shows the extended split rule that incorporates a finiteamount of lookahead. We assume that the lookahead distancek is at least 2. We define Θk−1(γ) as the set of all wordsof length at most k − 1 derivable from a sentential form γ.

That is, Θk−1(γ) = {w | w ∈k−1⋃j=1

Σj∧γ ⇒∗ w}. Recall the

definition of SPLIT shown in Fig. 11. The SPLITEXT rule hasa similar structure but it creates more constrained relationsby specializing the SPLIT rule for every possible lookaheadstring in Θk−1(γ).

Let x be one of the shortest words derivable from A. Akinto the SPLIT rule, SPLITEXT rule applies only when Ψ canbe expressed as a union of sentential forms ψi each of whichhas a non-empty suffix of length at least two (representedusing δiβi) that is preserved by its derivative with respectto x. The rule computes the derivative of ψi w.r.t x lookingahead at the strings w ∈ Θk−1(γ) (which are possible stringsthat may follow x). We denote using ρwi the derivative of (theprefix of) ψi w.r.t x when the lookahead string is w.

The relations shown in the antecedents of the SPLITEXTrule are straightforward extensions of the antecedents asserted

Page 18: Automating Grammar Comparison

SPLITEXT

x ∈ ||A|| Ψ =

m⋃i=1

αiδiβi ∀w ∈ Θk−1(γ). d(x,w, αiδiβi) = ρwi δiβi ∀0 ≤ i ≤ m. |δi| > 0, |βi| > 0

C′ = C ∪ {{Aγ} opz Ψ}

∀w ∈ Θk−1(γ). C′ ` {γ} opw d̂(x,w,Ψ) ∀0 ≤ i ≤ m. C′ `

⋃w∈Θk−1(γ)

A[[w,ρwi δi]]

opz {αiδi}

C ` {Aγ} opz Ψ

Figure 19. Extended Split Rule. Θk−1(γ) is the set of all words of length at most k − 1 derivable from the sentential form γ.||A|| is the set of shortest words derivable from A. [[w,α]] denotes a prefix restricted sentential form defined by Olshansky andPnueli [24].

by the SPLIT rule that take into account the lookahead stringsin Θk−1(γ). For instance, the antecedent relations on the leftassert that for any lookahead stringw, the strings generated byγ and d(x,w,Ψ) starting with the prefix w should be related

by op. The sentential form [[w,ρwi δi]] used in the antecedentrelations on the right denotes a prefix restricted sententialform [24] that generates only those strings of ρwi δi havingthe prefix w.


Recommended