Learning context-free grammar rules from a set of program

IETdoi:

www.ietdl.org

Published in IET SoftwareReceived on 1st June 2007Revised on 22nd November 2007doi: 10.1049/iet-sen:20070061

In Special Issue on Language Engineering

ISSN 1751-8806

Learning context-free grammar rules from aset of programA. Dubey1 P. Jalote2 S.K. Aggarwal21Philips Research Asia, Bangalore 560045, India2Indian Institute of Technology, Kanpur 208016, IndiaE-mail: [email protected]

Abstract: The grammar of a programming language is important because it is used in developing program analysisand modification tools. Sometimes programs are written in dialects–minor variations of standard languages.Normally, grammars of standard languages are available but grammars of dialects may not be available.A technique for reverse engineering context-free grammar rules is presented. The proposed technique infersrules from a given set of programs and an approximate grammar is generated using an iterative approachwith backtracking. The correctness of the approach, is explained and a set of optimisations proposed to makethe approach more efficient. The approach and the optimisations are experimentally verified on a set ofprogramming languages.

1 IntroductionThe problem of grammar extraction or inference has attractedresearchers in the natural language processing (NLP) andtheoretical domains for many decades, but its importancein software engineering field has only recently been felt[1–3]. The problem of grammar inference is important inthe programming language (PL) domain because grammaris used to develop many software engineering tools.Sometimes programs are written in languages that areminor variations of standard languages. This is becausesome of the features frequently needed for a particularsoftware implementation are not available in the grammarof standard languages. In order to generate softwareengineering tools for such programs we need the grammarof the dialect. While grammars of standard languages aregenerally available, grammars of dialects may not be. Forexample C� is a data parallel dialect of C developed byThinking Machines Corp for Connection Machine (CM2and CM5) SIMD processors. Since Thinking MachineCorp no longer exists, it is very difficult to find the sourcecode or the reference manual for C�. Sometimes, even theavailable reference manuals are incomplete, as discussed in[2]. In such situations, one way of obtaining the grammarof a dialect could be to reverse engineer it from the sourcecode of the available programs and the grammar of the

Softw., 2008, Vol. 2, No. 3, pp. 223–24010.1049/iet-sen:20070061

standard languages. This paper addresses the problem ofreverse engineering (that is, when PLs undergo a growth).

Since PL grammars can be represented with context-freegrammars (CFG), inferring CFG rules is the focus here.Most of the results concerning CFG learning are negative.For example, Gold’s theory states that no language in theChomsky hierarchy can be learned from the positivesamples (set of valid sentences) alone [4]. A CFG, though,can be learned from the positive and negative samples [5],but the question whether it can be learned in polynomialtime is still open. Despite there limitations, a great deal ofwork on grammatical inference has been done, which ismainly in NLP applications and in the theoretical domain[6–10]. In the latter, researchers have addressed theproblem of inferring small subclasses of context-freelanguages. Only some efforts have been made in the fieldof PLs [3, 11–17]. A detailed survey on grammar inferenceattempts can be found in [18, 19]. Most of these attemptsuse some heuristic to guess the grammar rules, with orwithout user input, and then check the correctness of ruleselection by parsing the given programs with the grammar,which now includes the proposed rule. Also, some of theabove techniques are not grammatical inference techniquesbecause they do not infer the grammar from inputprograms; rather they extract grammar from other artifacts

223

& The Institution of Engineering and Technology 2008

224

&

www.ietdl.org

(such as reference manuals, parser source codes, and so on) oruse a knowledge-base.

This paper proposes a technique for inferring grammarrules from the positive samples (a set of correct programs)and an approximate grammar. Syntactic growth of a PL isnothing but the extension of the CFG of the baselanguage. Most commonly observed extensions in PLdialects are

1. New declaration specifiers to support new data types, newscopes, etc. (Note that observations discussed may notcompletely cover all kinds of growth, but these are the mostperceived dimensions on which minor growth happens.) Forexample, RC [20] is a dialect of C which adds safe, region-based memory management to C with three additional typequalifiers:sameregion,parentptr andtraditional.

2. New expressions to support new kinds of expressions. Forexample, C� [21] has max, min, minto, maxto, dimof,rankof etc new operators.

3. New keywords to support new statements. For example,Java 1.5 [22] has a statement with the keyword enum tosupport enumerated types, whereas lower versions of Javado not have this statement. The above extensions requiremodifications in the lexical analyser to identify tokenscorresponding to new terminals (that is, new types,operators, keywords) and modifications in the parser thatsupport the new grammar rules.

Our technique is iterative with backtracking. A set ofpossible rules is built in each iteration and one of them isadded to the grammar. After a certain number of iterations,the modified grammar is checked to see whether it parsesall the programs or not. If the grammar does not parse allthe programs then the algorithm backtracks to the previousiteration and selects another rule. The technique buildsupon the previous work of [11], who used a heuristic-basedapproach, which does not always guarantee that theinferred grammar will parse all the programs. Theapproach discussed in this paper is deterministic. We alsodescribe ways to reduce the search space of possible rules.

The organisation of the paper is as follows. Problemformalisation and notations are discussed in Section 2.Section 3 describes our approach and its correctness.Section 4 proposes some optimisations that are applied inour approach. Section 5 discusses the implementation,Section 6 discusses our experiences with the approach andsection 7 concludes the article.

2 Problem definition andnotationsThe problem of grammar inference under PL growth isdefined as follows:

The Institution of Engineering and Technology 2008

Given a set of programs P ¼ {w1, w2, . . . } and anincomplete grammar G ¼ (N , T , P, S), find a grammarG0

¼ (N 0, T 0, P 0, S) such that G0 is complete w.r.t. P

(Notations are borrowed from [23].).

Definition 1: A grammar G0 is complete w.r.t. inputprograms P, if P # LðG0

Þ.

From Gold’s theorem [4] no language in the Chomskyhierarchy can be exactly identified using a set of positivesamples alone; hence our goal is not to find the ‘actual’grammar (suppose it is GT ) of the entire language, as therecan be an infinite number of grammars that accept a givenset of programs and it is impossible to determineL(GT ) ¼ L(G0) [24].

With respect to the above criteria of completeness, there aretwo extreme possibilities of a complete grammar: (1) the mostgeneral grammar: where rules are of form S ! T � and (2) theleast general grammar: where rules are of form S ! w1,S ! w2,. . .etc. The above two grammars are semanticallymeaningless. Hence, we make following assumptions on G0

based on observed growth in PLs.

1. The relationship between G0 and G is as follows:N ¼ N 0, T # T 0, P # P 0. G and G0 are epsilon-freegrammars. Set (T 0

� T ) is known beforehand and denotedas Tnew.

2. The type of rules missing in the grammar are constructsthat involve new keywords, operators or declarationspecifiers. We call these new terminals of the grammar,which fall in the set Tnew.

3. A single rule is sufficient to express the syntactic constructcorresponding to each new terminal.

4. Additional rules in G0 are of form A ! aanewb, whereA [ N , anew is a new terminal (i:e: anew [ Tnew) anda, b [ (N < T )�.

Assumption 1 implies that new terminals of G0 are knownbeforehand. Hence, the lexical analyser does not recognisethem as an identifier or as an unknown token. We assumeN ¼ N 0 because N 0 . N would imply additionalnon-terminals in G0 and those non-terminals should bereachable from the start symbol; hence the addition of rulescontaining the new non-terminals in its RHS. Assumption 2is used to enforce the language growth model discussedearlier. Assumption 3 is based on observations found in thePL grammars. For instance, consider the grammars of Cand C� and the with construct of C�; here a singlegrammar rule is sufficient to express the syntax of thewith construct. Assumption 4 is based on the observationof PL grammars. For instance, rules corresponding tokeywords and declaration specifiers are usually of formA ! anewb where anew is a keyword or a declarator

IET Softw., 2008, Vol. 2, No. 3, pp. 223–240doi: 10.1049/iet-sen:20070061

IETdoi:

www.ietdl.org

specifier and rules corresponding to operators are usually ofform A ! aanewb, where anew is an operator.

Since a single rule is sufficient to express the syntaxcorresponding to each new terminal, there exists at least onecomplete grammar G0

¼ (N 0, T 0, P 0, S0) such thatjP 0

� Pj ¼ jTnewj; our goal is to obtain one of those completegrammars. A parser generated from a complete grammar iscalled a complete parser and a parser generated from theapproximate grammar (G) is called an approximate parser.

In the case of a single missing rule (jTnew ¼ 1j), there canbe many possible rules where each one makes the grammarcomplete; we call the set of all such rules a set of correctingrules corresponding to the new terminal. Similarly, in thecases of multiple missing rules, there can be many sets ofrules, where each set makes the grammar complete (eachset contains one rule for each new terminal); we call suchset a correcting set of rules (note the difference between theterms set of correcting rules and a correcting set of rules).

The input program is represented as a1,n, where n is thenumber of tokens in the program, ai (1 � i � n) denotesthe ith token of the program and ai, j (1 � i � j � n)denotes the substring that starts at he ith token and ends atthe j th token. An LR parser’s configuration is a pair whosefirst component is a stack and the second a remaininginput string (to be parsed)

(s0X1s1X2s2 . . .Xmsm, aiaiþ1 . . . an)

This configuration represents the right sentential form

X1X2 . . .Xmaiaiþ1 . . . an

where sm denotes the top of the stack. Other terms anddefinitions related to the LR parser, if not defined, aretaken from [23]. We ignore the lookahead part of an LRitem where they are not required. Since each statecorresponds to an LR itemset, we use state and itscorresponding itemset interchangeably.

A CYK parser [25, 26] maintains an upper triangular matrixC of size n� n, where n is the length of the input a1,n. Eachentry C[i, j] is called a CYK-cell, which contains all thesymbols that can derive substring ai, j , that is, A [ C[i, j].

3 ApproachWe first present our approach for inferring a single correctingrule from a set of programs and an approximate grammar,and later extend it for multiple correcting rules.

3.1 Grammar completion with single rule

Here we assume that there exists at least one completegrammar that is just one rule away from the initialgrammar. For now, we restrict our discussion to determinea correcting rule from one program. We assume that the


program a1,n cannot be parsed by the approximate grammarG, and G must be augmented. First, input programs areparsed with the parser generated from G. If each programis parsed successfully then G is considered complete andreturned intact. Otherwise a correcting rule is computed.

The basic approach (Fig. 1) consists of four phases: (1)left-hand side (LHS) gathering phase, (2) right-hand side(RHS) generation phase, (3) rule building and (4) rulechecking phase. The LHS gathering phase computes theset (L) of possible LHSs of the correcting rule from thecontext of error point. The RHSs generation phasecomputes a set (R) of possible RHSs of correcting rulesusing the CYK-parser table. The rule building phase buildsthe set of possible rules (PR) using L and R. Each rule inthe PR is checked in the rule checking phase, that is,whether the grammar, after including a rule, is able toparse all the programs. If all the programs are parsedsuccessfully by including some rule in G, then that rule isreturned as a correcting rule.

We describe the phases in more detail. The next fewsections describe the approach that infers rules of formA ! anewa and later generalises for the rules of formA ! aanewb (A [ N , a, b [ (N < T )� and anew [ Tnew).

3.1.1 LHSs gathering phase: The first phase returns aset of non-terminals (L) as possible LHSs of correcting rules

Figure 1 Overall approach

Figure 2 LHSs gathering phase

225


226

&

www.ietdl.org

(Fig. 2). The input to this phase is a program, a1,n, and theapproximate grammar G. First, a1,n is parsed with the LRparser generated from G. Here we do not assume that G isstrictly LR; however we assume that an LR parser can begenerated after resolving conflicts with the help of a given setof associativities and precedences in the grammar specificationfile of G. Since G is incomplete, the parser will stop at somepoint. If ai is the first occurrence of a new terminal, the parserwill stop after shifting the token ai�1 because G does not havea production containing ai in its RHS. This state of the parseris called the error state. Possible LHSs are collected from anLR-parser stack as follows.

The algorithm starts with a set of stack configurations (Ouralgorithm uses a modified LR parser because in theconventional LR parser there is only one stack, whereas in ouralgorithm there are multiple stacks. The remaining operationsare the same as that of an LR parser). Initially this setcontains a single stack, that is, LR parser stack in the errorstate. The algorithm does the following iteration until this setbecomes empty: It removes a stack from the set and looks atthe action table entry corresponding to the the top of thestack. If some of the entries corresponding to the top of thestack are reduce entries, then it performs reductions possibleon that state without looking at the next token. We call thisoperation ‘forced reduction’. In the case of multiple possiblereductions, each reduction is performed on a separate copy ofthe stack. The new stack obtained after performing eachreduction is added to the stack set. If the top of the stack hassome shift entries then for all items of type [A ! a † B b]in the itemset corresponding to the top of the stack, non-terminals which occur after the dot (B) is collected in L.

The algorithm stopswhen the stack set gets empty. The stackset will get empty when no more reductions are possible on anyof the stacks in the stack set. If a1,n is parsed with a completeparser; the parser will make a sequence of reductions betweenthe shifts of tokens ai�1 and ai and then expand a non-terminal to derive the substring covered by a correcting rule.(In few situations, there could be no reductions between theshifts). Algorithm GATHER_POSSIBLE_LHSs simulatesthe behaviour of a complete parser by performing all possiblereductions. By including all non-terminals that come after thedot in some item of top itemsets, it considers every non-terminal a complete parser could expand while parsingremaining string ai,n.

3.1.2 RHSs generation phase: The RHSs generationphase determines the possible RHSs of correcting rules(Fig. 3) The algorithm is called with a program, theapproximate grammar and a parameter max_length. Itbuilds a set of possible RHSs of length no more thanmax_length. First, the input program is parsed with G toobtain an incomplete CYK-table for the program. It thenuses this CYK-table to build a set of possible RHSs.

For building the set of possible RHSs, first the index of thelast occurrence of the new terminal is determined. Suppose it is


i. For each k (i , k � n, where n is number of tokens in theprogram), the algorithm computes the set of symbol stringsthat can derive the substring ai,k (where a1,n is the inputprogram). Since correcting rule starts with a new terminal,the last occurrence of the new terminal in the program is thepoint from where the substring derived by the last occurrenceof correcting rule starts and does not embed (beinginnermost) a substring derived by a correcting rule. Since weconsider each index as a possible endpoint of the substringderived by the last occurrence of correcting rule, RHSs ofcorrecting rules will be built and added in the possible RHSs.

The set of symbol strings, which can derive ai, j , iscomputed by the BUILD_STRINGS algorithm (Fig. 4)which works similar to the CYK parser. First each cellC[ p, p] (i � p , j) is filled with the symbol string whichderive the single token ap,p. Symbol strings that can derivethe larger substrings are built in bottom-up manner. Tocompute the symbol string of length l, which can deriveam,mþq, the algorithm does following the iteration for eachindex k (m � k , mþ q): a symbol string of length r(0 � r � l ) is picked from the cell C[m, k] and a symbolstring of length of l � r is picked from the cellC[kþ 1, mþ q]; these strings are concatenated to obtain asymbol string of length l (Since the algorithm is a bottom-up algorithm, symbol strings for the substrings ai,k andakþ1, j are already computed).

For example, to build symbol strings of length 2, which canderive the substring ai,k, the algorithm considers a break pointi1; suppose cell C[i, i1] has symbol strings X1 and X2 andC[i1 þ 1, k] has symbol strings Y1 and Y2, then the symbolstrings constructed from these two cells will be X1Y1, X1Y2,X2Y1 and X2Y2.

3.1.3 Rule-building and rule checking phase: Inthe rule-building phase, set of possible RHSs and LHSs,constructed in previous phases, are used to build a set ofpossible rules (PR). The set PR is a cross-product of sets Land R. Each rule in PR is checked in rule-checking phase.One rule from PR is picked and added in approximategrammar; the modified grammar is then checked whether itparses all the programs. If the modified grammar parses allthe programs by adding some rule, then that rule isreturned as a correcting rule. The above checking is doneusing LR parser. The approach returns whenever it findsthe first correcting rule. PR could also include some ruleswhich are correcting but non-LR. By non-LR rule, wemean that the grammar after including that rule becomes

Figure 3 RHSs generation


IETdoi:

www.ietdl.org

non-LR; that is, an LR parser cannot be generated from themodified grammar, even after resolving conflicts with alreadyavailable precedences and associativities. A rule which retainsthe LR property is called the LR-preserving rule.

The parameter max_length is usually kept equal to thelength of the largest RHS of rules in G. However, thealgorithm can be slightly modified to incrementally buildpossible rules with larger RHS lengths and check theircorrectness when it does not find any correcting rule withRHS length less than or equal to max_length (Fig. 4).

3.1.4 Realistic example: Suppose the rulecorresponding to the keyword ‘while’ is missing in theANSI C grammar and program in Fig. 5a is given asinput. First an LR parser is generated from the incompleteANSI C grammar and the program is parsed. The parsergets stuck at the first ‘while’. Let the itemset correspondingto the top of the stack (Fig. 5b) be

{[statement �! expression SEMICOL†]}

Since the item in the above itemset is of the form [A ! b †],a forced reduction is performed and parser reaches to a statewhere itemset corresponding to the top of the stack containsan item of form [statement ! statement_list † statement].Non-terminal statement is now added in the set of possibleLHSs because it occurs after the dot. For building possibleRHSs, we start from the last occurrence of the ‘while’ and

Figure 4 Algorithm for building symbol strings

Figure 5 Example of rule inference process


build a set of symbol strings which can derive substringsstarting from that point. Using the set of possible LHSs andRHSs, set of possible rules are built as shown in Fig. 5b (Weare not showing all possible RHSs, LHSs and itemsets asthese were too many for the real grammar.). Now each rulein this set is checked for their correctness. Rules statement! while (expression) or statement ! while (expression)statement is returned as a correcting rule.

3.1.5 Inferring a rule of form A ! aanewb: Theapproach discussed previously works if the correcting rule isof the form A ! anewa (i.e. RHS of the rule starts with anew terminal). In this section, we weaken this restrictionby some modifications in the LHSs and RHSs gatheringphases of the previous approach [27].

LHSs gathering phase: The input program, a1,n, is parsed withthe LR parser generated by the approximate grammar. Unlikethe earlier approach where LHSs were gathered only after theparser arrives at error state (at a new terminal), here possibleLHSs are gathered from each configuration the approximateparser passes. That is, at each step, the approach checks thetop itemset of the parser stack; if it contains an item of type[A ! a † B b], then B is added in the set of possibleLHSs. Once the parser reaches the error state, a forcedreduction is performed in the same manner as discussedpreviously and other possible LHSs are collected.

The idea behind suchmodification is as follows. Suppose afis the first occurrence of the new terminal in a1,n, whereas theindex f denotes the first and the outermost occurrence ofcorrecting rule in a1,n. Suppose the substring covered by thefirst occurrence of correcting rule starts from the index m(m � f ). If a1,n is parsed by a complete parser, the parser willstart recognising the substring covered by a correcting rulefrom am. Since the value of m is not known, the modifiedapproach considers each index i (0 � i , f ) as a possiblestarting point of the substring covered by the missing rule.Therefore possible LHSs are gathered from eachconfiguration the approximate parser passes while parsingthe input program. In the previous approach, value of m isalways equal to f (i.e. the index of the first new terminal)because correcting rule always starts with a new terminal.

RHSs generation phase: Since RHS of the correcting rule is ofthe form a anewb (where a, b [ T <N � and anew [ Tnew),we divide a possible RHS into two parts. (1) Part whichoccurs to the left of the new terminal, that is, a. (2) Partwhich occurs to the right of the new terminal, that is, b.The approach builds sets of possible a and possible b

separately and then uses these two sets to build a set ofpossible RHSs. Set of all possible symbol strings that occurto the left of the new terminal is denoted as RL. Similarly,set of all possible symbol strings that occur to the right ofthe new terminal is denoted as RR.

For building sets RL and RR, first input program a1,n isparsed with the approximate grammar using CYK parser.

227


228

&

www.ietdl.org

Suppose f and l are the indexes of the first and the lastoccurrences of the new terminal. For building RL, each indexi (1 � i , f ) is considered as a possible starting point of thesubstring derived by a correcting rule, and a set of symbolstrings that can derive substring ai, f �1 is computed andadded in the set RL. Similarly, for building set RR, each

index j (l , j � n) is considered as a possible endpoint of thesubstring derived by a correcting rule, and all symbol stringsthat can derive substring alþ1, j are added in the set RR. Theset of possible RHSs R is built by concatenating the symbolstrings taken from sets RL and RR as follows

R ¼ {a anew bj8a [ RL and 8b [ RR}

Note: Sets RL and RR both will additionally contain emptystring e. This is to consider the case when RHS is of formanewa or aanew (i.e. a new terminal in the RHS occurs at thefirst or at the last place). The rule-building and rule-checkingphase is the same as that discussed in the previous approach.

3.1.6 Using multiple programs: Since the syntax ofeach new terminal can be expressed with a single grammarrule, we can use the possible rules computed from multipleprograms to obtain a reduced number of possible rules. Fig. 1will be changed as follows: GATHER_POSSIBLE_LHSswill be called for each program to obtain a set of possibleLHSs from each program and then their intersection will becomputed to obtain a reduced set of possible LHSs.Similarly, a set of possible RHSs will be computed from eachprogram and then their intersection will be used as a set ofpossible RHSs. Rule-building phase will use the reduced setof possible LHSs and RHSs to build a set of possible rules.

3.2 Grammar completion withmultiple rules

A simple extension of the algorithm discussed in previoussection can address those situations when multiple rules arerequired to complete the grammar. An iterative algorithmwith backtracking is proposed where in each iteration a setof possible rules corresponding to a new terminal is built.

For systematically building a set of possible rulescorresponding to a new terminal, programs are groupedaccording to the layout of new terminals in the programs. Foreach new terminal, K, two groups of programs, PK and P0

K ,are made. PK is the set of all programs in which K is the firstnew terminal and P0

K is the set of all programs in which K isthe last new terminal. The above grouping is done because ofthe following reason.

For building a set of possible RHSs of rules correspondingto a new terminal K, those programs are used where K is thelast new terminal (i.e. programs in P0

K ). For building a set ofpossible LHSs of rules corresponding to K those programsare used where K is the first new terminal (i.e. programs inPK ). Hence, groups, PK and P0

K , are used to build a set ofpossible rules (PRK) corresponding to K.


The approach is described in Fig. 6. Suppose the initialgrammar is G ¼ (N , T , P, S), the set of new terminals isTnew ¼ (T 0

� T ) ¼ {K1, . . . , Km} and the set of programs isP. In each iteration, a pair of groups (PKi

, P0Ki) is made for

each new terminal Ki [ Tnew. Next, a terminal Ki [ Tnew isselected such that sets PKi

and P0Ki

are non-empty and theset of possible rules corresponding to Ki is built using setsPKi

and P0Ki. The set of possible LHSs (LKi

) of rulescorresponding to Ki is gathered from programs in PKi

usingthe algorithm GATHER_POSSIBLE_LHSs. The set ofpossible RHSs (RKi

) of rules corresponding to Ki isgathered from programs in the set P0

Kiusing the algorithm

BUILD_RHSs. The set of possible rules, PRKi,

corresponding to Ki is built by taking the cross-product ofLKi

and RKi. One rule from PRKi

is selected and added inG, and Ki is removed from the set of new terminals (Tnew).Since Ki is no longer a new terminal and rule correspondingto Ki is added in G, input programs are grouped againaccording to the layout of the modified set of new terminalsin the next iteration.

Above steps are repeated until a rule corresponding to eachnew terminal is added in G. Now the modified grammar ischecked for the completeness. If grammar is complete w.r.t.P, then the rules added in G in different iterations arecollectively returned as a correcting set of rules, else webacktrack to one of the previous iterations and selectanother rule (Note that in some iterations the algorithmmay not find a new terminal, Ki [ Tnew, such that sets PKi

and P0Ki

both are non-empty. In such situation, it buildspossible rules corresponding to those new terminals, Kj , forwhich at least set P0

Kjis non-empty. The set of possible

RHSs corresponding to Kj is computed using P0Kj

and theset of possible LHSs is assigned equal to N (set of all non-

terminals of G).

3.3 Proof of correctness

We will first prove the correctness of the approach for thosesituations when single rule is sufficient to complete the

Figure 6 Inferring multiple rules


IETdoi

www.ietdl.org

grammar and later show how the proof can be extended formultiple rules.

3.3.1 Correctness of INFER_SINGLE_RULE: We willshow that a correcting rule will always fall in the set ofpossible rules (PR) built by INFER_SINGLE_RULEalgorithm. Although we assume that the missing rule isof the form A ! anewa, the proof can be extended forthe rule of the form A ! aanewb. Since there existsmany correcting rules, we pick one correcting rule(suppose r ¼ D ! h) and show that D will fall in L andh will fall in R, respectively, and therefore D ! h willfall in PR. The grammar obtained after adding D ! h

in G is denoted as G0. G0 is a complete grammar w.r.t.P. The parser generated from a complete grammar (G0)is called a complete parser (or modified parser) anddenoted as ‘G0 . The parser generated from G isrepresented as ‘G . a1,n is the input program and thesubstring derived by the first occurrence of rule r startsfrom index i.

It is evident that ‘G , while parsing a1,n, will get stuck aftershifting the token ai�1 because ai is a new terminal and it doesnot fall in the set T. Suppose the sentential formcorresponding to the parser configuration, after shifting the

token ai�1, is aai � � � an. Since the derivation a)wa1,i�1

contains the productions from G alone, ‘G and ‘G0 willperform the same sequence of shifts and reductions till theyshift the token ai�1. Suppose ‘G0 makes a sequence ofreductions [p1, . . . , pl ] between the shifts of ai�1 and ai ;these reductions are blocked when the program is parsedwith ‘G . Reductions, p1, . . . , pl , are performed by ‘G0

before shifting ai, hence each one will be a production in theoriginal grammar because substring derived by the missingrule starts from ai. Since GATHER_POSSIBLE_LHSs,through forced reductions, performs all the blockedreductions, it will definitely perform the reduce sequence[p1, . . . , pl ].

Lemma 2: LHS of the correcting rule (i.e. D) will fall in theset L.

Proof : Suppose after making [ p1 . . . pl ] reductions, ‘G

reaches the configuration (s0X0 � � �Xl sl , aiaiþ1 . . . an). Weshow that one of the non-terminals collected from thetop itemset will be D. Suppose none of the collectednon-terminals is D and LHS is obtained from somestate, sj , inside the stack. Suppose itemset correspondingto the state sj has an item it ¼ [A ! g † C d ] andnon-terminal C is the LHS of a correcting rule, thenthe string of symbols Xjþ1 Xjþ2 � � � Xl (which includes allthe symbols starting from the state sjþ1 till the top of thestack) will be appended before the new terminal in theRHS of the correcting rule, that is, the rule will be ofthe form C ! Xjþ1 Xjþ2 . . . Xl anew g (g is a string ofterminals and non-terminals). This contradicts ourassumption that h starts with a new terminal. Hence, Dwill fall in L. A

Softw., 2008, Vol. 2, No. 3, pp. 223–240: 10.1049/iet-sen:20070061

Lemma 3: RHS of the correcting rule (i.e. h) will fall in theset R.

Proof: For building a set of possible RHSs, algorithmBUILD_RHSs starts from the index of the last occurrenceof the new terminal (suppose i) and adds all the symbolstrings that can derive the substring ai, k (i , k � n) in theset R. Suppose that the substring ai, j is derived by h, thatis, by the last occurrence of rule D ! h. No substringderived from other instance of a correcting rule will benested within the substring ai, j because substring ai, j is thelast and innermost occurrence of the substring derived by r.Since the algorithm considers each index k (i , k � n) aspossible endpoint of the substring covered by a correctingrule, it will definitely consider index j. SinceBUILD_STRINGS algorithm constructs all the symbolstrings that can derive a particular substring, h will beconstructed while constructing symbol strings that canderive ai, j . Hence, h will fall in the set R. A

Lemma 4: The approach will always return a correcting rule.

Proof: From Lemmas 2 and 3, LHS and RHS of a correctingrule (D ! h) will be in L and R, respectively. Therefore PRwill contain D ! h. Since algorithm iteratively checks thecorrectness of rules, it will select r in some iteration andreturn it as a correcting rule. A

3.3.2 Correctness of INFER_RULES: The correctnessof the INFER_RULES algorithm, when missing rules areof the form A ! anewa, can be proved by showing that ineach iteration, the algorithm has enough information forbuilding a set of possible rules corresponding to at least onenew terminal (because there exists atleast one new terminal,K, for which atleast the set PK is non-empty) and then usethe results of single missing rule case, discussed earlier.

Note: INFER_RULES will not work in those situationswhen missing rules are of the form A ! aanewb and thealgorithm does not find a new terminal K (in someiteration) where sets P0

K and P0K are non-empty.

3.4 Time complexity of INFER_RULES

Since INFER_RULES is a generalised version ofINFER_SINGLE_RULE, we discuss only the timecomplexity of INFER_RULES. The algorithmINFER_RULES involves building a parser, parsing eachprogram, building a set of possible rules and then checkingthe correctness of the rules. We first describe the timecomplexity of each step and then compute the overall timecomplexity. The size of an LR(1) parser for a grammar ofsize m (Size of a grammar is expressed as a sum of thelengths of all the productions where length of a productionB ! b is equal to the 1þ length(b); length(b) is thenumber of symbols in b.) can be O(2c

ffiffiffi

mp

) in the worst case(c . 0 is a constant) [28]; hence, the worst-case time takenin building an LR(1) parser for the grammar of size m is

229


230

& T

www.ietdl.org

O(2cffiffiffi

mp

). Therefore the LHSs gathering phase takesO(n)þ O(2c

ffiffiffi

mp

) time. The maximum number of possiblesymbol strings of length less than or equal to max_length isvmax length, where v ¼ jN < T j. The time taken to computethe set of symbol strings is O(n3) because the computationsdone for larger substrings reuse the computations alreadydone for the smaller substrings in the similar way as theCYK parser works. Hence, the time taken to build all thesymbol strings is O(n3 � vmax length). The upper boundon the number of possible rules is O(vmax length

� jN j) ¼O(vmax lengthþ1).

Each iteration of INFER_RULES involves building a set ofpossible rules and adding one of them to the grammar; this takesO(nþ 2c

ffiffiffi

mp

þ n3vmax length) time. The number of possiblerules for each keyword is bounded by O(vmax lengthþ1). Sincethe algorithm checks the completeness of only thosegrammars which are not more than jTnewj rules away fromthe initial grammar, the maximum number of iterations inthe algorithm is equal to the number of nodes in a completetree where the degree of each tree node is bounded by anumber of possible rules, that is, atmost O(vmax lengthþ1);such a tree will have O (vmax lengthþ1)jTnew j ¼ O(v(max lengthþ1)�jTnew j) nodes. Hence, the maximum numberof iterations in the algorithm will be O(v( max lengthþ1)�jTnew j).The maximum number of possible combinations of rulesfor all new terminals is pjTnew j (p is the upper boundon the number of possible rules for each keyword,that is, p ¼ vmax lengthþ1). Hence, the worst-casetime taken by the algorithm is O(nþ 2

ffiffiffiffi

cmp

þ n3 vmax length)�O(v( max lengthþ1)�jTnew j)þ O(v(max lengthþ1)�jTnew j)� O(nþ2

ffiffiffiffi

cmp

) ¼ O(v(max lengthþ1)�jTnew j � (2ffiffiffiffi

cmp

þn3)).

4 OptimisationsThe approach we have discussed above usually results in avery large set of possible rules to be checked becausegrammar of PLs are normally large in size (typically 200–400 productions). In this section, we discuss someoptimisations to reduce the number of possible rules.

4.1 Utilising unit productions

PL grammars often have a large number of unit productions(A unit product has a single non-terminal in its RHS). (asshown in Table 1, taken from [29]). We use this propertyfor reducing the number of possible RHSs to be checked.We add only most general symbol strings in the set ofpossible RHSs to be checked. To achieve this, thealgorithm BUILD_STRING is modified as follows.

he Institution of Engineering and Technology 2008

Suppose that cells [X1, X2] and [Y1, Y2] are used for

building symbol strings and X1 )w

X2 and Y1 )w

Y2. Ratherthan adding all the symbol strings, that is, X1Y1,X1Y2, X2Y1, X2Y2, built from these cells, we add only X1Y1in the set of possible RHSs. A rule whose RHS isX1Y1, that is, A ! X1Y1 (for some A [ L), is sufficient forchecking the incorrectness of the rules whose RHSs areX1Y2, X2Y1 or X2Y2 (that is, A ! X1Y2, A ! X2Y1,A ! X2Y2) because A ! X1Y1 is more general than otherrules. Hence, the number of possible RHSs to be checkedcan be reduced without compromising the correctness ofthe approach. This optimisation significantly reduces thenumber of possible rules.

4.2 Optimisation in rule checking process

In this section, we propose an optimisation in the rule-checking process. We propose a modified CYK parsingalgorithm using which correctness of an RHS w.r.t. a set ofpossible LHSs can be checked in a single CYK parse. Thatis, for a symbol string a [ R and a set of possible LHSs L,the correctness of all the rules with a as its RHS and somenon-terminal A [ L as its LHS can be done in a singlepass of CYK algorithm. This is achieved by incorporatingsome additional steps in the CYK parser. The number ofinvocations to the rule-checking step in the earlier approachwas jLj � jRj in the worst-case because each rule in the setPR (built from set L and R) is checked individually,whereas it becomes jRj after the above optimisation.

The correctness of an RHS, a, w.r.t. to a set of possibleLHSs, L, is checked as follows. First, all the rules of theform B ! a (8B [ L) is added in the approximategrammar. Input program is parsed with the modifiedgrammar using the CYK parser. We illustrate thisoptimisation with an example. Consider an approximategrammar G ¼ (V , T , P, S), where T ¼ {a, b, d },V ¼ {A, B, C , S} and P has the following the productions

1. r1:S ! a A B C

2. r2: S! a C d

3. r3: A ! a

4. r4: A ! A B

5. r5: B ! b

6. r6: C ! A B

Table 1 Summary of unit productions in different programming language’s grammars

Languages Algol ADA Cobol CPP CSTAR C Delphi grail Java MATLAB Pascal

no. of productions 170 576 519 785 312 227 385 122 265 92 188

no. of unit productions 78 218 193 237 169 106 177 20 124 40 84


IETdoi

www.ietdl.org

Suppose the input program is aabeabaeabbeab and we checkthe correctness of RHS eab w.r.t. possible LHSs set {A, B, C}.Rules A ! eab, B ! eab and C ! eab will be added in thegrammar and input program aabeabaeabbeab will be parsedwith the modified grammar. The parse tree is shown inFig. 7; it is slightly different from a normal parse tree. Rootof each subtree contains a set of pairs; the first part of the pairis a non-terminal which derives the substring covered by thesubtree, and the second part is a set of non-terminals whichis discussed later. For example, node 6 in Fig. 7 containspairs (A, fBg) and (C, fBg); first part of the pair, that is A andC, shows that A and C derive the substring a7,10 ¼ aeab. Thesecond part, that is, fBg, is called a set of unfiltered non-terminals. Dashed edges are not the part of the parse tree.

Program is parsed with the modified CYK parser whichfilters out incorrect LHSs for a given RHSs while parsingthe program. The idea is as follows. Initially, each rule withRHS a (i.e. A ! a, for all A [ L) is considered acorrecting rule and parse tree is built bottom up with theCYK parser. CYK parser filters out all those non-terminals,F, if F ! a is not used in building subtrees of largersubstrings. To support the filtering operation, a set of non-terminals, called unfiltered non-terminals, is associated witheach non-terminal of each CYK cell. This set is shown asthe second part of the pair in Fig. 7. For example, node 6contains a pair (A, fBg), where A is the root of the subtreecovering aeab and fBg is the set of unfiltered non-terminal.The set of unfiltered non-terminals associated with a non-terminal A [ C[i, j] is denoted as UFA(i, j). Non-terminalB [ UFA(i, j) implies that rule B ! a is used in thederivation A)

w

ai,j , that is, A)w

d B g )B!a

d a g)w

ai,j .For example, consider substring a7,10 ¼ aeab in Fig. 7,since B ! eab is used in the derivation of a7,10

(A )A!AB

AB )A!a

aB )B!eab

aeab), hence UFA(7, 10) ¼ {B}.This set is associated with non-terminal A at node 6 inFig. 7. UF sets are maintained as follows.

First, 8A [ C[i, i] (0 � i � n), UFA(i, i) is initialised asempty set; next, the program is parsed with CYK parser.Since the modified grammar has rules B ! a (8B [ L),

Figure 7 Example of LHSs filtering


whenever a substring derived by a is encountered (supposeai,j), all non-terminals of L are added in C[i, j]. Initially, eachnew rule B ! a (B [ L) is considered correcting, andtherefore set UFB(ai,j) ¼ {B} is associated with each non-terminal B (B [ C[i, j]). For example, consider nodes 3, 8and 9 in Fig. 7. Non-terminals A, B and C are roots of thesubtrees covering substring eab (substrings a4,6, a8,10 anda12,14). This figure shows roots of the subtrees and theircorresponding UF sets. For example, roots of the subtreecovering a4,6 (node 3) are A, B and C and the associated UFsets are UFA(4, 6) ¼ fAg, UFB(4, 6) ¼ fBg and UFC(4,6) ¼ fCg, respectively.

Since CYK parser builds the parse tree bottom-up, supposethat, while building entry for cell C[p, q] (i.e. computing non-terminals that derive ap,q), production A ! X Y is used,where X [ C[p, k] and Y [ C[kþ 1, q]. Set UFA(p, q) isupdated according to the following rules.

1. UFA( p, q) ¼ UFX ( p, k)<UFY (kþ 1, q) if atleast one setbetween UFX (p, k) and UFY (kþ 1, q) is empty. For example,consider node 2 in Fig. 7. Production A ! AB is used whilebuilding the cell entry C[2, 3], where both UFA(2, 2) andUFB(3, 3) are empty; hence, UFA(2, 3) is empty. Considernode 6 in Fig. 7. Production A ! AB is used, whereUFA(7, 7) is empty but UFB(8, 10) is non-empty; hence,UFA(7, 10) ¼ {B}.

2. UFA(p, q) ¼ UFX (p, k)> UFY (kþ 1, q), if bothUFX (p, k) and UFY (kþ 1, q) are non-empty. If UFA(p, q)is computed as empty, then non-terminal A is dropped outfrom the cell C[p, q]. Consider node 4 in Fig. 7.Production A ! AB is used for building the cell entry forC[7, 14]. Here sets UFA(7, 11) and UFB(12, 14) both arenon-empty, therefore UFA(7, 14) ¼ UFA(7, 11)>UFB

(12, 14) ¼ {B}. Consider node 14, where UF sets ofsubtrees are UFA(8, 11) ¼ {A} and UFB(12, 14) ¼ {B},hence UFA(8, 14) ¼ UFA(8, 11)>UFB(12, 14) ¼ {}. SinceUFA(8, 14) is empty, non-terminal A is dropped out fromthe cell C(8, 14).

Non-terminals which are correct LHSs w.r.t. a given RHS(i.e. a here) will get added in the unfiltered set of non-terminals of the subtrees of larger substrings and will climbupward in the parse tree (this is shown by dashed arrows inFig. 7). Non-terminals which are finally added in the setUFS(1, n) (in Fig. 7 it is fBg) are correct LHSs for the givenRHS. Incorrect non-terminals are filtered out during theparsing. If set UFS(1, n) is empty after parsing, then there isno rule with RHS as a is correct. For the given program andthe possible RHS eab, B is the correct LHS and A and C arenot correct because they are filtered out. Hence, B ! a is acorrecting rule.

5 ImplementationWe have implemented the complete approach for inferringsingle correcting rule as well as multiple correcting rules.

231


23

&

www.ietdl.org

The implementation is done in Java. Our implementationincorporates various optimisations. The schematic diagramof the whole grammar inference process is shown in Fig. 8.The input to the system is a grammar written in the yaccreadable format and a set of input programs. New terminalsare tagged as % newkeyword in the grammar specificationfile. The grammar and the set of programs are fed to thegroup programs module (Fig. 8). Here programs are groupedaccording to the layout of the new terminals. After groupingthe programs, groups are fed to the LHSs generatorand the RHSs generator modules as discussed in the earliersection.

The main component of the LHSs generator module is amodified LR parser generator. Since our approach uses anadditional operation called ‘forced reduction’ which is notin the conventional LR parsers, we use a modified LR(1)parser generator which generates an LR(1) parser with thesupport of forced reduction operation. For supporting theforced reduction operation, we use a special data structurecalled graph-structured stack which can represent themultiple stacks compactly [30].

The RHSs generator modules consist of a modified CYKparser. The modified CYK parser works in several modes thatare used in different optimisations. Input programs are parsedwith the CYK parser and the possible RHSs are generated.Rule-building module obtains a set of possible LHSs and aset of possible RHSs as input and builds a set of possiblerules. The modified grammar is checked for thecompleteness in the grammar-checking module (shown ascheck grammar module in Fig. 8).

6 ExperimentsWe performed experiments on four PLs, viz. Java, C,MATLAB and Cobol. Grammars of languages were obtainedfrom [29]. We also performed experiments on a real dialect ofC, that is, C�. Since the grammars were not truly LR, we

2The Institution of Engineering and Technology 2008

have used precedence and associativity to remove thenon-LRness of the grammars (This is the most commonlyused method for resolving conflicts in LR parsers [23]). Inour implementation, the parser generator generates an LRparser only if there is less than 15 conflicts. Therefore infurther discussions, we state a grammar as LR if there are lessthan 15 shift reduce conflicts. Experiments were conductedon a machine with Intel Pentium(R) 4, 2.4 GHz processorand 512 MB RAM. To conduct various experiments, weremoved the rules corresponding to different keywords andoperators. To validate the approach for inferring a singlecorrecting rule, first we removed the rules corresponding to asingle keyword. Most of the experiments are done in thissetting; however, there are also a few experiments on multiplecorrecting rules inference. We also performed experiments onthose cases where correcting rule was of the form A ! aanewb(i.e. a new terminal occurs at any position in the RHS of therule). Since such rules are mostly used for representingexpressions (e.g. expressions involving additive, multiplicativeoperators etc.), we removed the rules corresponding todifferent expressions to perform these experiments.

The only parameter in our approach is the maximumlength of the RHSs. After a study of different PLgrammars, we found that the average RHS length ofproductions in most of the PL grammars were close to six.Hence, we chose this parameter as six in our experiments.In all the experiments, we found that there were atleast fewcorrecting rules with length less than or equal to six, thatis, the software never failed due to this parameter value.

Since there is no test set suit available for checking theperformance of a grammar inference approach in a PLdomain, we have downloaded programs and grammars fromdifferent sites to make a small test suit (The experimental suitcan be obtained from http://www.cse.iitk.ac.in/users/alpanad/grammars/). The summary of all the experiments are given inTable 2. The size of the grammar is expressed as the sum ofthe lengths of the RHSs of all the productions in the grammar.

Figure 8 Schematic diagram of grammar inference module


IETdoi

www.ietdl.org

Table 2 Summary of experiments

Language Constructs # Singlemissing ruleexperiments

# Multiplemissing rulesexperiments

Size ofprograms

Size ofgrammar

C for, while, switch, case, break 25 5 100 465

Java switch, case, try, enum, while, for, if, 160 8 75 568

MATLAB for, case, switch, otherwise, while,&&, /.,�.,,/, ¼ ¼ ,– – , �

122 3 40 239

Cobol move, display, read, perform 21 1 64 674

:

6.1 Inferring a single rule withoutoptimisation

In this section, we discuss the experiments done to validateour approach on different grammars. In each experiment,we removed a rule corresponding to one construct and thenfed it to our software, and the software returned acorrecting rule. The rule returned by the software need notbe same as the rule removed as there can be many possiblecorrecting rules. Results of some of the experiments wherea single program was used for inferring correcting rules areshown in the Table 3; the table also gives the examples ofcorrecting rules returned by the software. Numbers writteninside the parenthesis in the last column of the table showsthe number of rules the system checked before arriving at acorrecting rule.

We observe that in some of the cases, the time taken by thesystem is only few seconds but in some cases it is severalhours. The time spent by the inference process involves (1)time in generating the possible rules and (2) time inchecking the possible rules. Since the time spent in the firstprocess is more or less constant if the grammar and inputprogram size are same, the major time taken by theseexperiments depend on how early the approach arrives at acorrecting rule. For example, in the case of Java grammarand ‘while’ construct, the time taken by the system is102263 s (i.e. around 26 h); this is because the systemchecked 5922 rules before arriving at a correcting rule (Thistime can be reduced by using a faster LR-parser generator.The parser generator used by us are not as efficient as yaccor bison.); whereas in the case of ‘enum’ construct of theJava grammar, the time taken was 83.7 s. because the veryfirst rule checked by the system was a correcting rule.

As mentioned earlier, with multiple programs we canreduce the number of possible rules to be checked, whichcan otherwise be very large (Table 3). We experimentedand studied the reduction achieved in the number ofpossible rules by this optimisation. Table 4 demonstratesthe results. Each row represents the language and theconstruct removed. The number of possible rules obtainedfrom each program are shown in the third columnseparated by commas. The second column shows the size


of the intersection of the possible rules obtained from eachprogram. We can observe that the size of intersection ofpossible rules is 10–100 times lesser than that of theunoptimised approach in many cases. In some experiments,it reduces drastically; for example, in the case of Javagrammar and while construct, reduction achieved is 741times.

6.2 Unit production optimisation

In this section, we study the effect of unit productionoptimisation on the grammar inference process. We studythe reductions achieved in the number of possible rules inPL language grammars. In each experiment, a single rulecorresponding to a keyword is removed and a set of allpossible rules and a set of rules with most general RHSsare built. The size of the above two sets are then compared.Table 5 shows the outcome of our experiments. The firstcolumn shows the language and the construct removedfrom the grammar. The second column compares thenumber of all possible rules and the number of rules withthe most general RHSs (shown as All/MG in the table)obtained from different programs. The last column showsthe overall reduction achieved after considering theintersection of the sets of rules obtained from each programin both the cases. That is, the last column shows thereduction achieved by the combination of unit productionoptimisation and the use of multiple programs.

We can observe that the above optimisation reduces thesearch space of possible rules to a great extent (i.e. by afactor of 10–500). Since the reduction we achieve here isdue to the abundance of unit productions, this optimisationshould hold for most of the PL grammars (Table 1).

Since some of the rules obtained from unit productionoptimisation may cause non-LR ness in the grammar, theuse of LR parser for checking the correctness of the rulesmay fail sometimes. To study that how often the set ofrules obtained from the unit production optimisationcontains a correcting as well as LR preserving rule, wechecked the correctness of rules in the reduced set usingLR parser. We also compare the rules returned by the LR-parser checker (where LR parser is used for checking the

233


234

&

www.ietdl.org

Table 3 Summary of the unoptimised approach on single rule inference

Language Missingconstruct

Size ofprogram (LOC)

Number ofpossible rules

Rule returned Execution time (s)

C break 103 6.3 � 105 labeled_stmt ! BREAK stmt_list 1190.7 (79)

C case 103 7.3 � 106 if_stmt ! CASE cond_expr COLONstmt_list

418.1 (4)

C for 89 9.0 � 105 expr_stmt ! FOR ( \ stmt_listcond_expr \ ) stmt_list

236.57 (3)

C switch 78 1.9 � 105 select_stmt ! SWITCHtranslation_unit

2324.79 (184)

Java enum 31 1.3 � 106 Modifiers ! ENUM IDENTIFIER 83.7 (1)

Java if 26 3.5 � 108 CondExpr ! IF VarInit 1019.18 (56)

Java switch 39 4.4 � 105 Stmt ! SWITCH Complex PrimaryBlock

21254.91 (1224)

Java while 47 1.8 � 107 GuardingStmt! WHILE ComplexPrimary FieldDecl

102263.69 (5923)

MATLAB case 20 5.7 � 106 equality_expr ! CASE CONSTANT 38.99 (1)

MATLAB otherwise 29 6.8 � 105 select_stmt ! OTHERWISEIDENTIFIER]

44.07 (4)

MATLAB switch 20 1.0 � 107 otherwise_stmt ! SWITCH stmnt_listEND

16641.19 (3626)

MATLAB while 12 4.9 � 106 unary_expr ! WHILE stmt_list END 16914.48 (3745)

Cobol display 26 2.5 � 104 stmt_list ! DISPLAY ident_or_stringid_stmt stmt_list

18 (1)

Cobol perform 85 3.9 � 105 ident ! PERFORM loop_cond_part2stmt_list END_PERFORM

49.5 (1)

Cobol read 43 8850 clause ! READ ident_or_stringopti_at_end_clause

25 (1)

LOC stands for lines of code

correctness) and the rules returned by the CYK-parserchecker.

Table 6 shows the results of the experiments. As evident,in none of the cases the LR-parser checker failed. The timetaken by both the versions of the checkers are compared inbar charts shown in Fig. 9.

Since LR checker accepts only the LR preserving rules, insome cases it took more time because it has to check morerules than the CYK checker. The time taken by the LRchecker is the sum of the time taken in generating theparser and the time taken in parsing the programs [i.e.O(n), n is the length of input program]. The time ingenerating the parser depends only on the size of thegrammar; therefore for large programs, LR parser alwaysoutperformed the CYK parser because CYK parser is O(n3)algorithm. In our experiments, small-to-medium-sized


programs were used which did not make a significantdifference between the LR checker and the CYK checkerin terms of the time of computation. We can see in Fig. 9that except few cases, the LR checker performs better thanthe CYK checker. For example, in the experiment wherepossible rule corresponding to ‘while’ in C grammar ischecked, the CYK checker performed better than the LRchecker because LR parser had to check more number ofrules than the CYK parser to obtain correcting and LRpreserving rule.

6.3 LHSs filtering optimisation in rulechecking phase

In this section, we will discuss the effect of LHSs filteringoptimisation used in the rule-checking phase.Improvements we gain from this optimisation is measuredby comparing the time taken in checking all possible ruleswith simple CYK parser and the time taken in checking all


IETdoi

www.ietdl.org

Table 4 Number of possible rules generated from the programs and their intersection in different PL grammars

Language/construct

Size ofintersection ofpossible rules

Number of rules obtained from different programs Averagereduction inno. of rules

C/for 6.4 � 105 9.0 � 105, 1.3 � 106, 8.1� 105, 9.0 � 105, 9.0 � 105, 1.2 � 106, 8.1 � 105 11.9

C/switch 1.2 � 105 1.0 � 106, 1.9 � 105, 1.5 � 107, 1.5 � 106, 1.9 � 105, 4.6 � 106 225.6

C/case 1.4 � 104 5.3 � 106, 7.3 � 106, 5.9 � 104 301.4

C/break 1.3 � 102 8.8 � 103, 6.3 � 105, 5.2 � 102, 1.1 � 105 1441

Java/case 4.7 � 104 4.0 � 105, 2.2 � 106 27.65

Java/for 2.4 � 106 2.8 � 106, 2.5 � 106 1.1

Java/switch

2.7 � 105 4.9 � 105, 4.4 � 105 1.7

Java/while 2.9 � 104 1.8 � 107, 2.5 � 107 741.3

MATLAB/case

5.4 � 106 1.1 � 107, 5.5 � 106 1.52

MATLAB/for

4.4 � 105 3.2 � 106, 6.7 � 106, 6.0 � 106, 1.4 � 106, 4.6 � 106 38.9

MATLAB/otherwise

6.8 � 105 6.8 � 105, 6.8 � 105, 6.8 � 105 1.0

MATLAB/while

2.1 � 106 1.1 � 107, 4.9 � 106 14.2

:

possible RHSs with the modified CYK parser. If L is the setof possible LHSs, then for checking the correctness of apossible RHS, a, there will be jLj invocations to the rule


checking process (i.e. invocations to the parser), whereasusing the LHSs filtering optimisation it will be only one. Iftr is the time taken by a simple CYK parsing algorithm

Table 5 Comparison of number of all possible RHSs and number of most general RHSs

Language/construct

Number of all possiblerules (All)/most general rules (MG)obtained from different programs

Intersection and reductionachieved

C/for 9.0 � 105/1.4 � 104, 1.3 � 106/1.8 � 104, 9.0 � 105/1.4 � 104,9.0 � 105/1.4 � 103, 1.2 � 106/1.8 � 104, 8.1 � 105/1.4 � 104,

8.1 � 105/1.4 � 104

6.4 � 105/8.0 � 103 ¼ 80

C/switch 1.0 � 106/2.7 � 104, 1.9 � 105/9.6 � 103, 1.5 � 107/6.2 � 104,1.5 � 106/3.0 � 104, 1.9 � 105/9.6 � 103, 4.6 � 106/5.1 � 104

1.2 � 105/8.3 � 103 ¼ 14.4

C/case 5.3 � 106/3.4 � 104, 7.3 � 106/5.1 � 104, 5.9 � 104/2.9 � 103 1.5 � 104/8.6 � 102 ¼ 17.4

Java/case 4.0 � 105/1.3 � 104, 2.2 � 106/2.3 � 104 4.7 � 104/7.7 � 102 ¼ 61

Java/for 2.8 � 106/4.9 � 104, 2.5 � 106/4.1 � 104 2.4 � 106/2.2 � 104 ¼ 109

Java/switch 4.9 � 105/1.4 � 104, 4.4 � 105/3.5 � 105 2.7 � 105/9.1 � 103 ¼ 29.6

Java/while 1.8 � 107/1.4 � 105, 2.5 � 107/1.4 � 105 2.9 � 104/3.0 � 103 ¼ 9.6

MATLAB/case 1.1 � 107/4.1 � 104, 4.9 � 106/6.5 � 104 5.4 � 106/9.7 � 103 ¼ 556.7

MATLAB/for 3.2 � 106/3.4 � 104, 6.7 � 106/7.3 � 104, 6.0 � 106/6.3 104,1.4 � 106/1.7 � 104, 4.0 � 106/4.7 � 104

4.4 � 105/4.8 � 103 ¼ 91.6

MATLAB/otherwise

6.8 � 105/5.4 � 103, 6.8 � 105/5.4 � 103, 6.8 � 105/5.4 � 103 6.8 � 105/5.4 � 103 ¼ 125.9

235


236

&

www.ietdl.org

Table 6 Comparison of LR parser and CYK parser as a grammar completeness checker

Language/construct

Rule returned by the LR-parser checker Rule returned by the CYK-parser checker No. ofprogs

Avg size ofprogs(LOC)

C/break stmt ! BREAK stmt_list stmt ! BREAK stmt_list 4 114

C/case stmt ! CASE arg_expr_list COLONstmt_list

stmt ! CASE arg_expr_expr COLONstmt_list

6 131

C/for stmt ! FOR (stmt_list assign_expr)stmt_list

stmt ! FOR (stmt_list stmt_listarg_expr_list)

7 66

C/switch� stmt ! SWITCH declarator stmt_list stmt ! SWITCH init_decl_list stmt_list 6 131

C/while expr_stmt ! WHILE (arg_expr_list)stmt_list

stmt ! WHILE arg_expr_list stmt_list 7 101

Java/case LocVarDecOrStmt ! CASE ArrayInitCOLON LocVarDecAndStmt

LocVarDecOrStmt! CASE ArrayInitCOLON LocVarDecAndStmt

4 84

Java/if LocVarDecOrStmt ! IF (ConstExpr)LocVarDecAndStmt

LocVarDecOrStmt ! IF ConstExprLocVarDecAndStmt

5 97�

Java/switch� LocVarDecOrStmt ! SWITCH(ClassNameList) LocVarDecAndStmt

LocVarDecOrStmt ! SWITCH ForIncrLocVarDecAndStmt

4 84

Java/while Block ! WHILE (ArrayInitializers) LocVarDecOrStmt ! WHILEArrayInitializers

6 78

MATLAB/case

assign_expr ! CASE array_listassign_expr

unary_expr ! CASE translation_unit 4 38

MATLAB/for primary_expr ! FOR translation_unitEND

expr ! FOR translation_unit END 4 54

MATLAB/switch

primary_expr ! SWITCHtranslation_unit END

primary_expr ! SWITCHtranslation_unit END

4 38.25

MATLAB/otherwise

stmt ! OTHERWISE array_list stmt ! OTHERWISE array_list 3 27

Cobol/move if_clause ! MOVE file_name_stringloop_cond_part2

if_clause ! MOVE file_name_stringloop_cond_part2

4 68

Cobol/read if_clause ! READ file_name_stringopt_at_end_clause

if_clause ! READ file_name_stringopt_at_end_clause

4 68

Cobol/perform

if_clause ! PERFORM loop_cond_part2stmt_list END_PERFORM

if_clause ! PERFORM loop_cond_part2stmt_list END_PERFORM

3 56

and trhs is the time taken by a modified CYK parsingalgorithm, then we compare the quantities tr � jLj and trhs;we compared these two quantities for C and Cobolgrammars to obtain a first-hand experience of theoptimisation (Table 7). These experiments were conductedon C, Java and MATLAB grammars. In each experiment,we removed a rule corresponding to a keyword and built aset of possible rules. We compared the times taken by thesimple CYK parser and the modified CYK parser inarriving at a correcting rule. The bar charts shown inFig. 10 compares the time. We can see that the modifiedparser is either comparable or better than the simple CYKparser.


6.4 Experiments on multiple rulesinference

This section presents some experiments done on multiplerule inference; that is, when more than one rules areneeded to make the grammar complete. In each experimentmore than one keyword based rules were removed from aPL grammar and then the grammar was completed byinferring rules from a set of programs. These experimentswere done when both the optimisations, that is, unitproduction optimisation and the use of modified CYKparser, were enabled. Table 8 shows the results of theexperiments. Our approach successfully inferred a


IETdoi

www.ietdl.org

correcting set of rules in each of these experiments. The timetaken in most experiments were 2–15 min, but in some cases,it took hours in inferring the rules.

6.5 Experiments to infer C�-specificgrammar rules

We discuss here few experiments in which we inferred rulescorresponding to different keywords, operators anddeclaration specifiers of C� grammar when a C grammarand programs written in C� are given as input. We wrotesmall programs in C� which used constructs specific to C�

(those which are not the part of a standard C grammar).Table 9 summarises the additional constructs (whichcontain new terminals) of C� grammar. This summary isnot complete as the resources for C� are very scarce (After

Figure 9 Comparison of times taken by LR parser and CYKparser in rule-checking module (times are in seconds onY-axis)


an inquiry on forums such as comp.compilers, we couldobtain an incomplete manual and very few exampleprograms.) We found that except few cases additional rulesin C� grammar follow the assumptions we made in Section2. One such exception is the declaration of a parallel datatype. A parallel data type describes a structured collectionof objects with a particular member type. For example,consider the following C� declaration

shape[10]S;

int: S parallel int;

‘shape’ is a new declaration specifier which is used to definethe template of parallel objects. A parallel object is anobject of parallel data type. In the above example, variableparallel_int represents a parallel object of type int whose

Figure 10 Comparison of times taken by modified CYKparser and simple CYK parser in rule-checking module(times are in seconds on Y-axis)

Table 7 Experiments of LHSs filtering optimisation (times are in milliseconds)

Language Statements removed jLj trhs tr tr�jLj Avg size of

progs

C switch, case 672 ¼ 4.5 � 103 386 164 7.3 � 105 16

switch, case, break 673 ¼ 3.0 � 105 816 143 4.4 � 106 16

switch, case, break, default 674 ¼ 2.0 � 107 1.1 � 105 6.0 � 104 1.2 � 1012 131

switch, case, break, default, for,while

36 � 675 ¼ 4.8 � 1010 1.4 � 104 518 2.5 � 1013 16

Cobol perform 51 ¼ 5 3811 3268 16340 43

display 51 ¼ 5 1237 680 3400

read 51 ¼ 5 46172 43520 217600

read, perform, move 5 � 1812 ¼ 1.6 � 105 4092 3297 5.4 � 108 56

237


238

&

www.ietdl.org

Table 8 Experiments on multiple rules inference

Language Constructs No. of programs Avg size of a program Time

MATLAB for, while 2 33 1.5 min

for, switch, case, otherwise, while 5 29 19.5 min

switch, case, otherwise 3 27 2.2 min

Java try, catch 4 98 14.9 min

if, while 3 37 6.7 min

switch, case, enum 7 61 17.6 min

switch, case, try, catch, enum 7 61 24.23 min

C switch, case 1 16 3 s

switch, case, break 1 16 7.3 s

switch, case, break, default, while, for 1 16 14.1 s

Cobol read, perform, move 3 56 1.3 min

Table 9 Summary of new terminals in C�

New terminal Type Description

shape declaration specifier used for expressing template of a parallel data type

shapeof keyword returns a pointer to a shape object

with keyword A control flow statement which operates on all theelements of a parallel data type.

where keyword a control flow conditional statement for parallelobject. This is much similar to the ‘if statement’ buthere operations are performed on parallel objects

CS__elemental declaration specifier used for facilitating parallel and non-paralleloperations

everywhere keyword used for making each element of a parallel variableaccessible within a function

positionof keyword NA

rankof keyword NA

__alignof__ keyword NA

__extension__ keyword NA

__attribute__ keyword NA

current keyword NA

dimof keyword NA

%% operator real modules operator

,? operator minimum operator

.? operator maximum operator

,? ¼ operator minimum assignment operator

.? ¼ operator maximum assignment operator

IET Softw., 2008, Vol. 2, No. 3, pp. 223–240The Institution of Engineering and Technology 2008 doi: 10.1049/iet-sen:20070061

IETdoi

www.ietdl.org

Table 10 Rules inferred from programs written in C� grammar

New terminals Inferred rules Time,s

Size of prog(LOC)

shape, with statement_list ! WITH program statement_list,declaration_specifiers ! SHAPE statement_list \ program

3.2 25

shape, where statement ! SHAPE statement_list program statement_list,statement ! WHERE return_expression statement_list

4.2 28

with, where statement! WHERE \ return_expression, statement ! WITH \ program 3.9 16

shape,elemental, with

external_declaration ! SHAPE statement_list program,statement_list ! WITH program statement_list,declaration_specifiers ! ELEMENTAL statement_list

11.7 28

.? ¼ expression ! MAX_ASSIGN identifier_list 207.5 23

,? ¼ expression ! MIN_ASSIGN enumerator_list 220.6 23

structure is defined by shape S. That is, parallel_int representsan array of ten integers and the operation on each element ofparallel_int can be done in parall on different processors. Thedeclaration statement ‘int : S parallel_int;’ declares theparallel_int to be of type ‘int:S ’. Grammar rulecorresponding to statement ‘int : S parallel_int;’ do notinvolve a new terminal, hence we will not discuss this case.However, the declaration corresponding to shape keywordfollows our assumption, and therefore we considered theinput programs containing shape keyword.

The experiments were conducted on input programscontaining different combinations of new terminals whosecorresponding grammar rules follow the assumptions givenin Section 2. Experiments are done on single inputprogram in which all the optimisations were enabled.Table 10 shows the results of the experiments. The systemcorrectly inferred a grammar complete w.r.t. input programin each of the experiments. The example of inferred rulesare also shown in Table 10.

7 ConclusionsIn this paper, we have presented a technique that reversesengineers grammar rules of a PL dialect when a set ofprograms written in the dialect and the grammar ofstandard language (approximate grammar) is given as input.The approach considers the important syntactic constructsof programming languages and their possible extensions;for instance, extensions specific to constructs containingkeywords, operators or declaration specifier. The approachdoes not cover all types of extensions that could happen inprogramming languages. Hence, there is a need to addressthe problem of grammar inference in other types ofextensions in programming languages. We observed thatthere could be several correcting sets of rules; however,criteria to choose the best set of correcting rules is still not


clear. This opens a possibility for future work in definingthe goodness of a rule.

8 AcknowledgmentThis work is a part of doctoral thesis of the first author whichwas completed at the Indian Institute of Technology,Kanpur, India.

9 References

[1] KLINT P., LAMMEL R., VERHOEF C.: ‘Toward an engineeringdiscipline for grammarware’, ACM Trans. Softw. Eng.Methodol., 2005, 14, (3), pp. 331–380

[2] LAMMEL R., VERHOEF C.: ‘Cracking the 500-languageproblem’, IEEE Softw., 2001, 18, (6), pp. 78–88

[3] LAMMEL R., VERHOEF C.: ‘Semi-automatic grammarrecovery’, Softw. Pract. Exp., 2001, 31, (15), pp. 1395–1438

[4] GOLD E.M.: ‘Language identification in the limit’, Inf.Control, 1967, 10, (5), pp. 447–474

[5] GOLD E.M.: ‘Complexity of automaton identificationfrom given data’, Inf. Control, 1978, 37, (3), pp. 302–320

[6] ADRIAANS P.W.: ‘Language learning for categorialperspective’, PhD Thesis, University of Amsterdam,Amsterdam, Netherlands, November 1992

[7] GEERTZEN J., VAN ZAANEN M.: ‘Grammatical inference usingsuffix trees’. Proc. Int. Colloq. Grammatical Inference (ICGI),Athens, Greece, October 2004, pp. 63–174

[8] VAN ZAANEN M.: ‘ABL: alignment-based learning’.COLING 2000 – Proc. 18th Int. Conf. Computational

239


240

&

www.ietdl.org

Linguistics, Saarbrucken, Germany, August 2000,pp. 961–967

[9] PAREKH R., HONOVAR V.: ‘Grammar inference, automatainduction, and language acquisition’ in DALE R., MOISL H.,SOMERS H. (Eds.) Marcel Dekker, New York, 2000, Chapter 29

[10] LAWRENCE S., GILES C.L., FONG S.: ‘Natural languagegrammatical inference with recurrent neural networks’,IEEE Trans. Knowl. Data Eng., 2000, 12, (1), pp. 126–140

[11] DUBEY A., AGGARWAL S.K., JALOTE P.: ‘A technique forextracting keyword based rules from a set of programs’.CSMR’05: Proc. 9th European Conf. SoftwareMaintenance and Reengineering (CSMR’05), Manchester,UK, IEEE Computer Society, 2005, pp. 217–225

[12] SELLINK A., VERHOEF C.: ‘Development, assessment, andreengineering of language descriptions’. Proc. 4thEuropean Conf. Software Maintenance and Reengineering,IEEE Computer Society, March 2000, pp. 151–160

[13] JAVED F., BRYANT B.R., CREPINSEK M., MERNIK M., SPRAGUE A.P.:‘Context-free grammar induction using geneticprogramming’. Proc. 42nd Annual Southeast RegionalConf., ACM Press, 2004, pp. 404–405

[14] CREPINSEK M., MERNIK M., JAVED F., BRYANT B.R., SPRAGUE A.:‘Extracting grammar from programs: evolutionaryapproach’, SIGPLAN Not., 2005, 40, (4), pp. 39–46

[15] CREPINSEK M., MERNIK M., ZUMER V.: ‘Extracting grammarfrom programs: brute force approach’, SIGPLAN Not.,2005, 40, (4), pp. 29–38

[16] MERNIK M., GERLIC G., ZUMER V., BRYANT B.: ‘Can a parser begenerated from examples?’. Proc. 18th ACM Symp.Applied Computing ACM Press, 2003, pp. 1063–1067

[17] JAIN R., AGGARWAL S.K., JALOTE P., BISWAS S.: ‘An interactivemethod for extracting grammar from programs’, Softw.Pract. Exp., 2004, 34, (5), pp. 433–447

[18] HIGUERA DE LA C.: ‘A bibliographical study of grammaticalinference’, Pattern Recognition, 2005, 38, pp. 1332–1348

[19] LEE L.: ‘Learning of context-free languages: a survey ofthe literature’, Technical report TR-12-96 Harvard


Universitty, 1996, URL: ftp://deas-ftp.harvard.edu/techreports/tr-12-96.ps.gz

[20] RC: RC – safe, region-based memory-management forC, 2001. URL: http://berkeley.intel-research.net/dgay/rc/index.html

[21] C�: UNH C� – a data parallel dialect of C, 1998. URL:http://www.cs.unh.edu/~pjh/cstar/

[22] Java1.5: ‘JDKTM 5.0 documentation’, 2004. URL:http://java.sun.com/j2se/1.5.0/docs/index.html

[23] AHO A.V., SETHI R., ULLMAN J.D.: ‘Compilers principles,techniques, and tools’ (Pearson Education (Singapore)Pte. Ltd, 2002)

[24] HOPCROFT J.E., MOTWANI R., ULLMAN J.D.: ‘Introduction toautomata theory, languages, and computation’ Addison-Wesley, Longman Publishing Co., Inc., Boston, MA, USA,1990)

[25] KASAMI T.: ‘An efficient recognition and syntax analysisalgorithm for context free languages’, Technical reportAFCRL-65758, Air Force Combridge Research Laboratory,BedFord, MA. 1965

[26] YOUNGER D.H.: ‘Recognition and parsing of context-freelanguages in time n3’, Inf. Control, 1967, 10, (2),pp. 189–208

[27] DUBEY A., JALOTE P., AGGARWAL S.K.: ‘Inferring grammar rulesof programming language dialects’. ICGI, Tokyo, Japan,September 2006, LNCS, Springer-Verlag), pp. 201–213

[28] UKKONEN E.: ‘Lower bounds on the size ofdeterministic parsers’, J. Comput. Syst. Sci., 1983, 26, (2),pp. 153–170

[29] Grammars: ‘Compilers & interpreters’, July 2000.URL: http://www.angelfire.com/ar/CompiladoresUCSE/COMPILERS.html

[30] MASARU T.: ‘Graph-structured stack and naturallanguage parsing’. Proc. 26th Annual Meeting onAssociation for Computational Linguistics, Morristown, NJ,USA, 1988, Association for Computational Linguistics,pp. 249–257


Date post:	20-Sep-2016
Category:	Documents
Upload:	sk
View:	215 times
Download:	1 times

Learning context-free grammar rules from a set of program

Documents