Hierarchic syntax error repair for LR grammars

International Journal of Computer and Information Sciences, Vol. 11, No. 4, 1982

Hierarchic Syntax Error Repair

for LR Grammars

David T. Barnard 1'2 and Richard C. Holt 1

Received March 1982; revised June 1982

A description of a technique for handling syntax errors in compilers, called hierarchic error repair, is presented. The technique is simple to understand and to implement. It always repairs a source program into a syntactically valid program and never causes the parser to back up. Details of embedding the technique in the LR family of parsers are given.

KEY WORDS: parsing; LR grammar; syntax error.

1. INTRODUCTION

A parser is a device that recognizes strings in a language and can be built with four components: an inpuf buffer containing the next symbol to be processed, a finite control, a set of tables (encoding the grammar, vocabulary, error messages, etc.), and a parse stack (used in different ways by different classes of parsers). The parser reads symbols from the input program (/-program), manipulates the parse stack, and produces output as directed by the tables. The output is either a parse of t he / -p rog ram (if it is syntactically valid) or an error indication.

For practical purposes this model of a parser must be extended to enable it to give meaningful diagnostic messages and continue processing after an error in the string is detected. For hierarchic repair, the changes to the classical parser model are: 1) modifications to the finite control, and 2) inclusion of an error repair mechanism. A data structure called the Synchronization Stack is built by the modified finite control. The error repair mechanism can access the parse stack, access the state of the finite control,

Computer Systems Research Group, University of Toronto, Toronto, Ontario, Canada. z On leave from the Department of Computing and Information Science, Queen's University,

Kingston, Ontario, Canada.

231

0091-7036/82/0800-0231503.00/0 �9 1982 Plenum Publishing Corporation

232 Barnard and Holt

read but not modify the Synchronization Stack, and access the input buffer. The purpose of the hierarchic error repair mechanism is to force the parser to make a nonerror transition from the state in which an error is detected. This is done by modifying the input buffer to contain a symbol that causes a legal move in that state. This is all that the error repair mechanism is allowed to do. It then returns to the finite control and the parser makes a normal state transition.

As a result of this behavior, the parser will go through a sequence of state transitions that corresponds to some syntactically valid program, thus effecting repair of errors. (14) This valid program is called the O-program (output program). If the /-program contains no syntax errors then the O- program is the same as the /-program. However, if the /-program does contain errors, then the O-program is a function of the / -p rogram and the error repair mechanism.

The general problem of syntax error handling has attracted a great deal of interest. Ref. 4 is an extensive survey of the area (extracted from Ref. 3). Theoretical investigations have established pessimistic bounds, such as a cubic time requirement for general minimum distance string-to-string transformation, (1) but in practice, useful results have been obtained in restricted but interesting cases. Many approaches have distinguished a "global" and a "local" strategy. Minimum distance transformation can be used as a local strategy. (7) Another component of many approaches is the augmentation of the grammar with explicit error productions. ~1~

One of the most important ideas in some techniques is the concept of the "forward move": after detecting a syntax error, parsing is continued on the text following the error to establish a right context. This can be applied, for example, to precedence parsers, t9) Because LR parsers can use complete knowledge of what has already been parsed in making state transitions, it is more difficult to restart them after an error. However, several authors have applied the forward move in LR parsers, t17'19)

The work reported in this paper can be seen as a development of ideas in the work of Levy. ~16) Levy discussed the concept of a "beacon" which was a pair consisting of an input symbol and a state of the parser. When the parser was in the identified state and read the identified symbol, it was known that the parser and the input were synchronized: it was not necessary to consider any previous behayior in dealing with an error that was subse- quently detected. Pai and Kieburtz (~8) use a similar notion. They identify "fiducial symbols," the appearance of which in the input text is very significant, and on which the parser can synchronize. Rohrich describes work that is similar in philosophy to ours. C2~ A significant difference is that his method treats all terminal symbols homogeneously, allowing some differentiation as a final tuning process; our method is phrase oriented and

Hierarchic Syntax Error Repair for LR Grammars 233

exploits the structure inherent in modern programming languages to treat some symbols differently.

2. HIERARCHIC ERROR REPAIR

Most of the structure of the syntax tree derived from a program is irrelevant for a programmer and hence can be ignored for error repair. For example, an (expression) is an essential abstraction for a programmer while (term), (factor), and (primary) are of interest primarily to the compiler.

In order to effect good repair, the parser with a hierarchic error repairer localizes the effect of errors by recognizing the boundaries of text derived from the important nonterminals. When the left boundary of such text is recognized, the augmented parser enters a mode in which it is prepared to recognize symbols indicating the right boundary (these symbols are called synchronizing symbols). When the augmented parser determines that a synchronizing symbol should be the next symbol read, it will skip input until one is found. When a synchronizing symbol is found in the input, the augmented parser will generate some string of symbols that will lead to a state that can read the input symbol. By virtue of this behavior, the augmented parser will ensure that it is synchronized with the input each time it recognizes the boundary of an important nonterminal. The parser will be responsible for:

1. pushing a set of (right boundary) synchronizing symbols onto the Synchronization Stack when the left boundary of text derived from an important nonterminal is recognized in the program, and

2. popping a set of synchronizing symbols off the Synchronization Stack when the right boundary of text derived from the corresponding nonterminal is recognized in the program.

2.1. Abstractions Useful for Hierarchic Repair

Three abstractions are introduced here that are useful in hierarchic error repair: phases, lines, and nests. Informally, programs are lists of phases, phases are lists of lines, and nests are all other "important" nonterminals.

A language is phased (or has phases) if it has "statements" of more than one kind, and if all the "statements" of one kind must appear together, followed by all the "statements" of another kind, and so on. For example, a Pascal block comprises constant definitions, followed by type definitions, followed by variable definitions, followed by procedure and function definitions, followed by executable statements. Each of these is a phase.


Phases are very important for an error repairer as they often give rise to "cascades" of error messages. Consider the following example drawn from VL/l:

1 T: PROCEDURE OPTIONS(MAIN);

2 DECLARE X FIXED;

3 Y(2 ,2 ,2 .... 2 ) / * I * / F L O A T ;

4 DECLARE S CHARACTER(3);

, , .

6 / * 2 */

7 GET LIST(X, Y);

8 END;

PL/1 does not have a phase structure---declarations and executable statements can appear in any order. However, the SP/k subset of PL/1 requires all declarations to appear before any executable code, hence imposing phasing: there is a phase of declarations, followed by a phase of executable statements. ~11) The example has an error at the end of line 2: a semicolon appears instead of a comma. However, a valid prefix parser operating on this program would not detect an error until it reached comment 1--an unbounded distance after the errorl Up until comment 1, the parser would assume it was recognizing a subscripted variable appearing on the left-hand side of an assignment statement. The parser would have assumed that it was in the second phase of the program, and the error repairer would now probably skip over all the remaining declarations, stopping at comment 2. ~2~ An unbounded number of lines could be lost in this way.

If a valid prefix parser is prematurely convinced that a phase of the program is completed, it may skip over much of the text in attempting to resynchronize with the input. It is therefore very important that the error repair mechanism does not allow the parser to move from one phase to another unless strong evidence appears in the input to justify this. There are many reasons for introducing phases into a language--some of them having to do with programming methodology--and many modern languages have phases (e.g., Algol 60, Pascal, and Euclid). Previous work on error handling has not addressed the problem of phases.

A program consists of one or more phases, and each phase is essentially a list of lines. For example, the Pascal "type" phase consists of a list of type definition lines. A "line" may recursively contain phases; for example, a procedure definition can be considered as a line in the declaration phase, but it contains its own declaration phase, executable statement phase, and so on.


(Intuitively, a line is often that fragment of text which should appear on one line of the program listing.) Lines are natural units for localizing the damage caused by repairing an error, because the programmer thinks in terms of lines: they correspond roughly to the various kinds of statements that he uses to build his program.

While phases and lines are the two most important sets of nonterminals for the error repair scheme, there may be other nonterminals in the language that provide natural units for error repair. These nonterminals (that correspond to units the programmer thinks in terms of when building his program) are called nests. The most important example in typical languages is the expression. Expressions appear in several places in different types of lines, for example, Algol 60, and if an error occurs inside an expression, it is desirable to localize the damage to the expression.

Nonterminals that generate phases, lines, and nests are called "distinguished" in the remainder of this paper.

As an example, here is a grammar with explicit designation of distinguished nonterminals.

G =AB

phase: A = vars{V} + varend

line: V = var id

phase: B = stmts{S} + end

line: S = do for id

] l i d

The notation {X}+ indicates one or more occurrences of X. Ref. 3 discusses how to choose synchronizing symbols. The synchronizing sets for the various distinguished nonterminals in this grammar are:

a {}

A {varend, stmts, end, do, i f} V {var}

B {end}

S {do, if}

The set {_L} is pushed onto the Synchronization Stack initially. For the line nonterminals, we use symbols that indicate the beginning of the next line as synchronizers. For phase nonterminals, we use symbols that indicate the end of the current phase, or that unambiguously indicate either a subsequent phase or a line in a subsequent phase. We will not discuss the selection of synchronizers any further in this paper.


2.2. The Context of an Error

At any point in the parsing of a program, a partial derivation tree has been constructed. In a top-down parser, the root exists and the tree is being built toward the leaves. In a bottom-up parser, several subtrees may exist that have not yet been joined together. A syntax error occurs when the parser cannot continue building a tree over the input symbols. (For a valid prefix parser, this means that the next symbol cannot be added to the tree.)

The error is an indication of a local inconsistency between the I- program and the set of tokens the parser expects to see. The error repair mechanism uses contextual information to determine how to handle the error. The left context of the error is available in the parse stack (that is, the contents of the parse stack have been determined by the fragment of program read before the detection of the error).

The right context has not yet been seen. Some schemes for error handling involve parsing ahead to determine the actual right context of the error. However, after accumulating this right context, handling the error may necessitate changing the parse stack, which is undesirable as it precludes repair, allowing only recovery. (Reparsing the input is another undesirable solution as it requires a buffering scheme and causes execution time to be worse than linear in the length of the input.)

In the hierarchic repair scheme, instead of using the actual right context we use the preferred right context. In practice, this means identifying a preferred transition from each state in the parser.

2.3. Preferred Transitions

A preferred transition from a state of the parser is determined by a symbol (called the preferred symbol). If an error is detected, the parser will be forced to proceed as if the preferred symbol were present.

The choice of a preferred transition from a state is governed in part by the distinguished nonterminals mentioned above. Because premature termination of phases can lead to cascading of errors, the error repair mechanism will adopt the strategy of "extending" a phase's subtree over as much of the input as possible. This means that when there is an error detected in a state that could read the beginning of another line in a phase, or else could move on to the next phase, the preferred transition will be one that finds another line in the phase. Phase termination will never occur implicitly during error handling.

For all other nonterminals the error repair mechanism will adopt the strategy of "completing" the subtree associated with the nonterminal. This strategy determines the preferred transition in a state that could read more


symbols into the subtree for a nonterminal that is not a phase, or else stop growing the subtree. In such a state, the preferred transition will be one that stops growing the subtree. This strategy is motivated by the desire to localize the effect of errors.

In states that do not choose between either growing or not growing a subtree, the grammar designer is free to specify any preferred transition that does not lead to a loop of state transitions in attempting to repair the error. (In practice, the states and transitions of a parser's finite control are not explicitly specified by a compiler writer. Instead, a grammar is specified and the parsing tables are automatically derived from it. We have shown below how to write LR grammar specifications so that preferred transitions can be implicitly specified for states of the parser's finite control.)

In brief, this strategy in a hierarchic repairer will

1) extend phases until a symbol is seen that explicitly terminates the phase, and

2) complete all other nonterminals as soon as possible.

It should be evident that any grammar could be treated by a hierarchic error repairer, because the goal symbol is always available as a distinguished

, nonterminal (its boundary is signaled by the end marker). However, a hierarchic repairer will do a better job of error handling for languages with more useful redundancy.~3)

2.4. Handl ing an Error

We will now show how the hierarchic error repair mechanism modifies the state of the parser's finite control, and manipulates the input in dealing with an error.

When an error is detected, the hierarchic repair mechanism will force the parser to take the preferred transition--if necessary, by "finding" the terminal symbol that causes that transition. This "finding" can be effected by generating the symbol in question or by deleting input until the symbol is seen.

When an error is being repaired, the hierarchic error repairer optionally advances the input cursor depending on the context in which the error occurs. This context consists only of the particular terminal symbol required to cause the preferred transition and the particular input symbol.

The following pseudo-code shows how the hierarchic error repairer will behave once it has determined the symbol that will force the preferred transition. A symbol is called a synchronizing symbol if it appears on the Synchronization Stack when an error occurs.


in Input is a Synchronizing Symbol then

Generate Preferred Symbol else

if Preferred Symbol is a Synchronizing Symbol then

Delete Input else

Local Repair

When this pseud0-code has been executed, control returns to the parser. The parser may immediately take another error transition, in which case the pseudo-code is executed again. Thus, this action is effectively embedded in an iteration: it is repeated until the input symbol can be read by the parser; that is, until the input and the parser are synchronized.

Local repair attempts to repair an error when neither the input symbol nor the preferred symbol are synchronizing symbols. This paper will not treat local repair in detail. See Ref. 2 for the effective strategy used in the SP/k compiler. In general, local repair is only allowed to delete and/or insert one input symbol.

3. HIERARCHIC REPAIR IN LR PARSERS

This Section shows how to embed hierarchic error repair in the LR family of parsers, the state-of-the-art method in bottom-up parsing. ~15~

3.1. Review of LR Parsing

An LR parser consists of a stack for state symbols (described later), an input, and a finite control. The finite control is directed by a two-dimensional table.

There is a row in the table for each symbol that can appear on the stack, namely the state symbols and the bottom of stack marker, $. There is a column in the table for every possible lookahead string; for LR(1) these are the terminal and nonterminal symbols, and the endmarker. (The remainder of this paper is restricted to LR(1) for ease of presentation. LR(k) for k > 1 is not currently practical in any case.) The entries in the table are Reduce (with a production number), Shift (with a state), Accept, or Error. (They are abbreviated R, S, A, and E, respectively.) The parser's finite control is said to be in the state whose symbol is on top of the stack.

The actions taken are as follows. On a Reduce action, the number of symbols on the right side of the associated production are removed from the


stack, and the symbol on the left side is inserted in the input. On a Shift action, the state specified is pushed onto the stack and the input is advanced one symbol. The unique Accept entry signifies that the input is accepted. Error entries signify syntax errors.

States (which are designated by numeric state symbols) in an LR parser are formed from sets of items. An item is a production with a marked position on its right side. A right side with k symbols has k + 1 positions: before each symbol and after the last symbol. The occurrence of an item in a state indicates that the input recognized so far could possibly form that part of the production preceding the mark. An item with the mark to the left of the first symbol on the right side is called an initial item. An item with the mark to the right of the last symbol on the right side is called a final item.

One of the differences between the various methods of the "LR family" (LR(0), SLR(1), LALR(1), and LR(1)) is that all of the methods but LR(1) may make some reductions before signaling an error. In essence, the other methods defer reporting errors. Some researchers have noted this in their discussions of LR error handling. Ghezzi prefers no stack operations at all before announcing an error--this is what happens in LR(1). ~8~ Pennello and DeRemer also prefer LR(1) for this reason, t19) This means that the partial derivation tree is not pruned before the error is detected. However, as described above, hierarchic repair requires that the tree be collapsed until the lowest node being worked on corresponds to one of the distinguished nonterminals. Accordingly, we start with full LR parsers and modify them so that some reductions will be done before signaling an error--all nonterminals subordinate to the last distinguished nonterminal which the parser has started to recognize will be reduced. We call the resulting parsers h-LR for "hierarchic LR." (In fact, all of the examples are LALR(1), and the method can be applied to most practical LALR(1) grammars; see the Appendix.)

The modifications required for hierarchic repair comprise adding a new stack to the parser called the Synchronization Stack, marking some parsing table entries to indicate that this stack is to be manipulated, and specifying error moves.

3.2. Pushing Onto the Synchronization Stack

We do not allow a symbol in the Synchronization Stack to be read while handling errors--i t can only be read by an S move that explicitly requires it. At any point during the parse, the Synchronization Stack must have an entry for every distinguished nonterminal currently being parsed~ This is easily accomplished in LL parsers by pushing an entry onto the stack when a produce move for a distinguished nonterminal is taken. We want to do what corresponds to this in LR, namely, mark S moves that accept the


initial symbols of text derived from a distinguished nonterminal. However, a single S move may require several entries to be pushed because recognition of a symbol, say "a" , in the input while in state "p" may correspond to the beginning of zero or several nonterminals (and in particular, of several distinguished nonterminals). Consider the following grammar with distinguished nonterminals explicitly labeled:

S = A B

phase: A = {X}*

line: X = Y Z

nest: Y = a b e

(Here {X}* means zero or more occurrences of X.) The parser generated from this grammar will start to recognize the three distinguished nonterminals A, X, and Y when it shifts an "a" in its initial state. The synchronizing symbols for all of these must be pushed onto the Synchronization Stack on that Shift move, in the order: first A, then X, then Y.

The power of LR derives from the fact that a decision as to which production was applied is deferred until after all the text derived from a production has been seen. So in general it is not possible to tell on the S move that accepts the first terminal corresponding to a production, which production is the applicable one. For example, in

S = A

18 A = c a

B = c b

when "c" is read it is not possible (with lookahead of one) to tell whether A or B is being recognized. It is possible to produce a modified grammar (by left factoring and introducing new nonterminals) for which an LR parser can tell which unique production each terminal was derived from. Such rewriting is not acceptable because it may cause the distinguished nonterminals to be no longer recognizable. The approach of marking S moves that correspond to beginnings of distinguished nonterminals assumes that it is possible to tell when text derived from a phase, line or nest nonterminal in a given grammar is starting. The parser can not defer the decision as to whether the


distinguished nonterminal will appear in the input, it must decide at the beginning.

It is easy to determine whether such early decisions can always be made, given a set of LR tables. Suppose an S move that may start to accept a distinguished nonterminal appears in some state P. The terminal symbol that indicates the start of the distinguished nonterminal is "a". If every item in P that has "a" to the right of the mark corresponds to some expansion of the same distinguished nonterminal, then the state causes no problems. Consider the following grammar fragment:

S = A B

A = a X

In this case, there will be a state containing the item [A = .aX] and the item [A =.aY]. The S move on "a" is acceptable because although there are multiple items with "a" to the right of the mark, they all correspond to the start of the same distinguished nonterminal. Here is another example:

S = a B

IA

A = a X

There will be a state that has the items IS = .aB], IS = ,4], and [A = .aX]. The S move on "a" is now not acceptable because "a" could mark the start of S, or of A.

3.2 .1 . h-LR Grammars

These simple examples illustrate how to decide from a set of tables and the items making up the states whether or not the scheme for manipulating the Synchronization Stack can be applied. A grammar is called h-LR if all of the states in the LR parser can be handled by the approach described. Essen- tially, this means that the LR condition applies at the beginning of phase, line, and nest nonterminals. More precisely, a grammar is h-LR if the following is true.

For every distinguished nonterminal, H, in the grammar, consider all states with items of the form [X = w.Hx]--that is, items with H to the


right of the mark. For each u in First(H), for every item of the form [Y= y.uz] in the states under consideration, it must be the case that the item was added to the state because of closure being applied to an item with H to the right of the mark (i.e., all symbols in First(H) always signal the beginning of text derived from H).

3.2.2. Grammars that are not h-LR

It is possible to apply hierarchic error repair to grammars that are LR but not h-LR. To do so it would be necessary to defer pushing the synchronizing symbols onto the Synchronization Stack until the parser was sure that the distinguished nonterminal was being recognized. This requires a more complex modified parser than our h-LR grammars because pushes may have to be done at R moves for example. In addition, it is possible to construct pathological grammars where errors could prohibit detecting the occurrence of the distinguished nonterminal at all. In other words, the synchronizing symbols would be needed for error repair before the parser had "narrowed down" to looking for only the distinguished nonterminal.

3.2.3. h-LR and Language Design

The restriction of grammars to h-LR makes reading and writing programs easier for humans. Since distinguished nonterminals are important abstractions (by definition) it-should be possible to tell in a left to right scan of the program whether text derived from such a nonterminal is starting or not. This should be true whether the "reader" is a human or a machine (parser).

3.2.4. Summary of Pushing

Symbols will be pushed onto the Synchronization Stack by modified S moves in the modified parser. These moves are indicated by being marked with a list of names of distinguished nonterminals. When such a move is encountered by the parser, the synchronizing sets corresponding to the nonterminals are pushed onto the Stack. The S moves so marked will be those that read the first symbol of text derived from distinguished nonterminals.

3.3. Popping the Synchronization Stack

The parser needs to know when to pop the Synchronization Stack. As will be seen shortly, generation of omitted symbols in the source program is


accomplished in S moves. Thus by the time an R move is made for a distinguished nonterminal, the input and parser are synchronized and the Synchronization Stack should have the entry corresponding to the recognized distinguished nonterminal removed. Since distinguished nonterminals are strictly nested, all that is required is that certain R moves be flagged indicating a pop of the Synchronization Stack. This completes the discussion of the manipulation of the Synchronization Stack.

3.4. Error Moves

This Section specifies the action to be taken on E moves in the modified LR parser. The states in the parser are characterized according to the types of non-E moves made, and the corresponding behavior of E moves is discussed.'

When the non-E move is A: there is a unique state that has an A (accept) move. (It has no other non-E moves because in the LR parser generator we always augment the grammar with the production S ' = S. Instead of reducing by this production, the parser accepts.) Error repair will be achieved by moves in other states so that in this state the parser always accepts.

Other states in the LR parser may have more than one non-E move. There may be R and S moves; there may be R moves for different productions; there may be S moves for more than one terminal symbol. (S moves for nonterminals are of no concern here; hierarchic repair only inserts or deletes terminals.) The error repair mechanism must force the parser to take the "preferred" transition. The question is: which transition is "preferred" for a state? Consideration is first given to deciding between S and R moves if both occur in the same state, and then to picking which S or R move to make if there are several.

When there are both S and R moves it must be decided which type of move to make. Recall that the strategy is to extend phases as long as possible, that is until a synchronizing symbol is seen in the input, and not to extend other nonterminals (including line and nest distinguished nonterminals). This leads to the following rule:

If there is an R move for a nonterminal corresponding to a phase and the input symbol is not one of those currently appearing in the Synchronization Stack, then perform an S move, else perform an R move.

(The meaning of "a nonterminal corresponding to a phase" is clarified by the example presented below.) The rule will cause the parser to continue


gathering text into the current nonterminal if it is a phase, and in any other case will cause a reduction.

Now consider the case where more than one R move can be made, or where more than one S move can be made. In either of these cases a choice must be made among several productions of the grammar: in the case of R moves a production is to be applied, and in the case of S moves a production is to be extended on the parse stack. In order to accomplish this the expansions of each nonterminal must be totally ordered. (A total order will seldom be needed in practical grammars; all cases where ordering is used in deciding on proper recovery in the E moves can be detected and reported to the grammar designer.) The ordering is intended to be from most preferred to least preferred (most probable to least probable). Consider the following grammar fragment as an illustration:

S-~Be

IAaX

IAbY

There will be a state, Q, in the parser containing the items [S = A.aX] and [S =A.bY]. If an error occurs in state Q the parser must decide between S moves for "a" and "b". Although the first production may be the most probable, or preferred, expansion for S, when state (2 is reached, the parser has narrowed the possibilities to some set of other expansions of S and must choose one of them.

3.5. Choosing a Preferred Transition

If all items in a state are derived from productions associated with the same left side, then a linear order imposed on the right sides is sufficient to distinguish one i t em--by ordering the items according to the order of the right sides, and picking the first one. The only difficulty is with states that have items with different left sides in the associated productions. If items appear for two left sides, simply trace back in the LR parser to find the place where expansions of some nonterminal introduced into the state items that eventually (perhaps after several applications of successor and closure operations) led to the state under consideration. Consider the following example.

O. S ' = S

1. S = A


2. IB 3. A = b X x x

4. B = Y y y

5. X = a d

6. Y = b a e

Here is the parser generated from this grammar:

0: i s ' = . s ] s I s = .A] [S = .81 [A = .bXxx] [B = .Yyy] [Y= .bac]

1: [S' = S.] A

2: [S =A.] RI

3: [S =B . ] R2

4: [A = b.Xxx] S [Y = b.ac] IX = .aal

5: [B = Y.yy] S

6: [A = bX.xx] S

7: [Y= ba.c] S IX = a.a]

8 : [ 8 = ry.y] s

9 : [A = bX.xx] S

10: [Y= bae.] R6

11: [X=ad.] R5

12: [B = Yyy.] R4

13: [A = bXxx.] R3

(S, 1)(,4, 2)(B, 3)(b, 4)(Y, 5)

(X, 6)(a, 7)

(y, 8)

(x, 9)

(c, lO)(d, 11)

(y, 12)

(x, 13)

Consider state 7. If an error occurs here (i.e., the input symbol is other than e or d), the parser must choose between the nonterminals X and Y in order to continue. However, on tracing back, it can be seen that state 7 is reached as a result of an S move on "a" in state 4. Here X is introduced because of its appearance in production 3. Now the choice can be reduced to one between A and Y in state 4. Again tracing back it can be Seen that state 4 is reached by an S move on "b" from state 0. In state 0, Y was introduced

828/11/4-3


because of production 4. This means that the choice can be reduced to making a choice between A and B. These were both introduced because they appear in productions of S, the start symbol. So the choice reduces to one among expansions of the same nonterminal. Such a choice can be made if there is a linear ordering on expansions of S.

This simple example illustrates the procedure used to trace back through the parser's states to make a decision as to which is the preferred item (and hence the preferred transition) in a state on the basis of the ordering on the productions of one nonterminal.

The following Lemma states that this can always be done. Following the proof, we identify a class of grammars where our method requires different results than a classical LR parser would produce.

I_emma. Establishing a linear order on the expansions of every nonterminal of an LR grammar is sufficient to establish a linear order on the items of each state in the corresponding LR parser.

Proof. States in the LR parser are generated by repeated application of the successor and closure operations. The basis for the first state in the parser is the set of items { I S ' = .S]}, which is trivially ordered. We assume that the expansions of each nonterminal have a linear ordering, and show how to impose an ordering on items when applying successor and closure.

First consider closure. Closure can be viewed as an iterative operation: before applying closure to a state, all nonterminals are untreated; at each iteration, look down the ordered list of items in the state for an item with the mark to the left of an untreated nonterminal. Immediately following this item, insert into the list the set of initial items corresponding to the productions of that nonterminal in the order given in the grammar. The nonterminal is marked as having been treated. Since there are a finite number of nonterminals, this procedure will terminate for each basis set. At the completion of the closure operation, we have an ordered list of items, with the order determined by the assumed order on expansions of each nonterminal.

Now consider the successor operation. Successor is applied to a state, Q, to obtain the basis for another state. For each symbol (terminal and nonterminal) that appears to the right of the mark in any items in Q, there is a successor operation applied. The basis of the successor state for symbol x is formed from the set of items that had the mark to the left of x in Q, by moving the mark to the right of x. The order of the list of items so found is taken to be the same as the order of the corresponding items in Q.

Now, since the basis of the initial state is ordered, and successor and closure operations maintain order in a natural way, the assumed order on


expansions of nonterminals is sufficient to establish a linear order on the items of each state, as required.

A problem arises when the successor operation applied to two or more different states yields the same set of basis items but with different orderings. In a classical LR parser, states are sets of items so the order is immaterial and there would be one state generated; but this is unacceptable for hierarchic error repair, and the preceeding Lemma implicitly requires generation of different states for each different ordering on the basis items. We call a grammar that yields more than one ordering for some basis set h- ambiguous. A parser generator could either automatically generate different states with the required orderings on basis sets of items, or it could reject the grammar and require the designer to specify explicitly the different orderings by introducing more nonterminals. In fact, most parser generators could easily generate different states simply by omitting the sorting routine that is applied to basis sets of states after generation. Thus, implementation does not require any new techniques.

As an illustration, here is a grammar that is h-ambiguous.

1. S = A

2. ! 8

3. A = a X

4. l a y

5. B = b Y

6. I b X

7. X = z X

8. Y = z y

The corresponding parser is:

0: IS' = . s l [s --.A] [A = .aX] [A = .aY] [S = .8] [8 -- .bX] [8 = .bX]

1: IS' = s.] 2: IS =A.] 3: [ s - -8 . ]

S

A

R1

R2

(S, 1)(A, 2)(B, 3)(a, 4)(b, 5)

828/11/4-3 -~


4: [A = a.X] S (X, 6)(Y, 7)(z, 8) [x = .zx] [A = a. r ] [ r = .~yl

5: [B = b.Y] S (r, 9)(X, 10)(z, 8) j r - - .~y] [B = b.X] Ix = .~x]

6: [A =aX. ] R3

7: [A =aY.] R4 8: [X=z.x] S (x, l l ) (y , 12)

[Y-- ~.y] 9: [B =bY.] R5

10: [B=bX. ] R6

11: [Y=zx.] R7

12: [Y=zy . ] R8

In this parser states 4 and 5 yield different preferred orders for the basis of state 8. Coming from state 4, yields the indicated order for the basis items of state 8. However, coming from state 5 yields the other order. This can be resolved by state splitting as follows: add the state

13: [Y=z.y] S (y, 12)(x, 11) [X=z.xl

and make the z-transition from state 5 go to state 13. The ordering on items now indicates that "x" is the preferred read symbol in state 8 while "y" is the preferred read symbol in state 13.

We now use the Lemma to establish the desired result.

T h e o r e m . Establishing a linear ordering on the expansions of every nonterminal in an LR grammar is sufficient to designate a preferred S or R move for each state in the corresponding LR parser.

Proof. The Lemma shows that the ordering of expansions imposes a corresponding ordering on the items in a state. Finding the preferred S move for a state corresponds to finding the preferred input symbol. Let the items be ordered as implied by the Lemma.

The preferred S move is the one corresponding to the first item in the list with a terminal to the right of the mark.


The preferred R move for a state corresponds to the first final item (i.e., item with the mark to the right of the rightmost symbol) in the list ordered according to the Lemma.

3.6. Summary of Error Moves

The following pseudo-code summarizes this discussion of LR error moves.

if Reduce is Preferred and Input is a Synchronizing Symbol then

Reduce else

if Input is a Synchronizing Symbol then

Insert Preferred Symbol else

if Preferred Symbol is a Synchronizing Symbol then

Delete Symbol else

Local Repair

One final point is that the parser starts with the set consisting of the end marker pushed on the Synchronization Stack.

4. An h-LR Example

4.1. The Grammar

The simple grammar given above will be used in this Section. The iteration is replaced by recursion and the labeling removed, to give the following grammar to be used in generating an LR parser.

O. G ' = G

1. G = A B

2. A = v a r s X~ v a r e n d

3. X 1 = V X z

4. X2 = V X2

5. t e


6. V = var id

7. B = stmts Y1 end

8. Yl = S Yz

9. Y2 = S Y2

10. l e

11. S = do for id

12. ] / f id

4.2. The Parser

Here is the h-LR parser generated from the grammar:

0: [G' = .G] [G = .AB] [A = wars X 1 varend]

1: [G' = G.] A

2: [ G = A . B ] S [B = .stmts Ylend ]

3: [.4 = vars.Xlvarend] S [X 1 = , VX 2] [ V = .var id]

4: [ G = A B . ] R1

5: [B = stmts.Ylend] S [Y1 = "SY2] [S = .do for id] [S = ./aCid]

6: [A = varsXl.varend] S

7: [X~= V.X2] S [X2 = . VX2] R [ & = .] [ V = .vat id]

8: [V=var . id] S

9: [B = stmtsYl.end] S

10: [YI=S'Y2] S [Y2 = .SY2]

= .] R [S = .do for id] [S = ./f id]

S (G, 1)(A, 2)(vars, 3){G,A}

(B, 4)(stmts, 5){B}

(Xl, 6)(V, 7)(var, 8){ V}

(Y1, 9)(S, lO)(do, 11){S} (/f, 12){S}

(varend, 13)

(Xz, 14)(V, 15)(var, 8){V} ( varend, 5)

(id, 16) (end, 17) (r2, 18)(s, 19)(do, ll){S} (/f, 1 2 ) { s } (end, 10)


11: [S=do.forid] S (for, 20)

12: [S = tf.id] S (id, 21)

13: [A = varsXlvarend. ] R2

14: [X 1 = VX2. ] R3

15: IX2 = V.X2] S (X2, 22)(V, 15)(var, 8){V} [X 2 = . VX2] R (varend, 5)

= .]

[ V = .var id]

16: [V=varid .] R6

17: [B =stmtsYlend.] R7

18. [ r , = s r 2 . l R 8

19: [Y2=S.Y2] S (Y2,23)(S, 19)(do, 11){S } [Y2 = .SY2] (/f, 12){S} [I72 = "] R (end, 10) IS = .do forid] [s = Jfid]

20: IS = dofor.id] S (id, 24)

21: [S = tf id.] R12

22: [X 2 = V X v] R4

23: [Y2=SY2"] R9

24: [S=dofor id . ] R l l

4.3. An Example Program

The following program is syntactically incorrect in several ways.

vars var I

struts doI

L

We now trace the parser with this input.

input parse stack

vars var I stmts do 12_ 05

synchronization stack

t2-}s

The S move on input vars indicates that G and A are being recognized, and that the corresponding synchronization sets should be pushed.


var I stmts do 13_ 3, 05 {varend, stmts, end,

do,/ f}{ }{i}5

Is tmts do I l 8, 3, 05 {var}"

stmts do I s 16, 8, 3, 05 "

The required action now is a reduction to the distinguished nonterminal V. Since the input is a synchronizing symbol, no input is discarded.

V stmts do I • 3, 05 {varend, stmts, end,

do, ~f}/}{• stmts do I • 7, 3, 05 "

Here the parser encounters an error move. Since the input is a synchronizing symbol, the R move is to be made.

X z struts do 12_ 7, 3, 05 "

stmts do I • 14, 7, 3, 05 "

X~ stmts do 1 3_ 3, 05 "

stmts do I • 6, 3, 05 "

There is a unique valid terminal symbol in state 6. It must be produced by local repair. Since the input is a synchronizing symbol, the required symbol is generated and the input retained.

varend stmts do I • 6, 3, 05 "

s t m t s d o I • 13,6, 3,05 "

Reduction to another distinguished nonterminal.

A s t m t s d o I • 05 { }{•

struts do I • 2, 05 "

d o l l 5, 2,05 {end}{ }{1 $

1 1 11,5,2,05 {do, if}{end}{ }{l}$

The symbol for is required. Local repair would likely transform the input identifier into the required symbol.

for 3_ 11, 5, 2, 05 "

3_ 20, 11, 5, 2,05 "


An identifier is required and will be generated and inserted as a synchronizing symbol appears in the input.

$ ID 5, 20, 11, 5, 2, 05 "

5, 24, 20, 11, 5, 2, 05 "

S 5, 5, 2, 05 {end}{ }{5-}$

A_ 10,5, 2, 05 "

At this point, there is an error in a state that has both S and R moves. Again, since the input is a synchronizing symbol, the R action is taken.

r 2 L 10 ,5 ,2 ,05 "

5" 18, 10, 5, 2, 05 "

Y1 5" 5, 2, 05 "

5_ 9 , 5 , 2 , 0 5 "

Here end is required and will be generated.

end 5, 9, 5, 2, 05 "

_1_ 17 ,9 ,5 ,2 ,05 "

B_l_ 2,05 {}{5-}$

5, 4, 2 ,05 "

G • 05 {5-}5

5, 1, o5 {5- }5

And the program is accepted. The input has been transformed to

vars var I

varend* struts

do for* $ ID* end*

5_

The symbols marked w i t h , have been generated by the error repair process.

5. CONCLUSION

Hierarchic error repair can be applied to different parsing methods. We have successfully used the ideas presented here as the basis for the error


handling in several compilers based on transition networks. (1~'12's'6'13) Hierarchic repair can also be used with LL parsers. (3)

This paper has shown how to use the general hierarchic repair techniques with LR parsers. A prototype implementation has been done. An h-LR parser skeleton has been written that behaves as described in section 3. Tables for the grammar presented in section 4.1 have been developed by hand. The parser with these tables was run on the input example of section 4.3 and produced exactly the trace that was shown. A parser table generator has not been written.

In a straightforward implementation there is some decrease in speed as the Synchronization Stack is manipulated while parsing correct programs. This decrease is negligible from a practical viewpoint because the entire parsing activity typically requires a small percentage of total compilation time. With good encodings of the synchronization data, R and S moves in an h-LR parser will be only a few machine instructions larger than in a standard LR parser. If this were deemed to be a serious problem (it is not by us) then a more sophisticated error handler could be built which constructed the Synchronization Stack from the parse stack when an error was detected.

The restrictions placed on the LR parsing method in order to use the techniques are reasonable in that they also facilitate human comprehension of the source language. Thus, the error repair techniques can give direction to the design of the concrete syntax of a programming language.

The method always effects repair. This means that the output of the parser always represents a syntactically valid program. Accordingly, when particularized for LR, it yields augmented parsers that always repair, for any grammar. No other published method is a general repair method for LR. The method is thus very attractive for practical compilers as it restricts all knowledge of syntax errors to the parser. Restrictions are placed on LR grammars to form h-LR grammars, and for these better repair is possible; that is, damage caused by errors can be localized.

The hierarchic error repair method is intuitive and easy to understand. It is natural in that it concentrates on the "important" parts of a program at the expense of the "less important" parts--as does a human reader. Its treatment of phases is particularly important. Many modern programming languages exhibit phasing. Hierarchic repair concentrates on this phenomenon to reduce cascading of diagnostic messages that can occur in other methods.

Thus the method warrants practical implementation and investigation of its behavior in real programming environments.


ACKNOWLEDGMENT

The presentation of the material in this paper benefited from comments by W. Lalonde, E. C. R. Hehner, E. S. Lee, and D. G. Corneil.

APPENDIX

In practice, the LR family of methods is only attractive if the LALR(1) tables can be constructed for a given grammar. This is because the LALR tables are smaller (fewer states) than the LR tables for all grammars with k > 0. From the point of view of error repair, LALR(1) tables are less attractive than LR(1) tables because some reductions may occur before detecting an error. Consider the following grammar:

0: S' ~ S

1: S ~ x A a

2: [ yAb

3: ] zAd

4: [ zee

5: A ~ e

This grammar gives rise to the following parser:

o: [s '-~.s] s IS ~ .xAa] [S-~.yAb] [S ~ .zAd] Is -~ .~d]

1: [s'-~ s.] A

2: [S ~ xJta] S [A --, .~]

3: [S ~ y.Ab] S [A - , .c]

4: [ S ~ z.Ad] S [S ~ z.cd] [A -~ .c]

5: [S ~ xA.a] S

6: [A ~ c.] R5

7: [ S ~ yA.b] S

8: [S-~ zA.d] S

(S, 1)(x, 2)(y, 3)(z, 4)

(A, 5)(c, 6)

(A, 7)(c, 6)

(A, 8)(c, 9)

(a, 10)

(b, 11)

(d, 12)


9: [S ~ zc.e] S (e, 13) [A -~c.] R5 ta, b}

10: [S ~ xAa.] R1

11: [S~ yAb.] R2

12: [S-~ zAd.] R3

13: [S-~zce.] R4

Because of the shift/reduce conflict in state 9, this grammar is not LR(0). In fact, it is SLR(1) and removal of production4 would yield an LR(0) grammar (generating a different language, of course).

State 6 is the one of interest. In an LR(0) parser (assuming production 4 were deleted from the grammar) this state would have reduce entries for all terminal symbols. In an SLR(1) parser, it would have reduce entries for {a,b, d} and error entries for all other terminal symbols. In an LALR(I ) parser, as above, it would have reduce entries for {a, b} and errors for all other terminal symbols. In an LR(1) parser state 6 would be split into two states, say 6 and 6': the transition under c in state 3 would go to state 6'; state 6 would reduce only on a, state 6' would reduce only on b.

Consider the action of the given parser on the input xcb.

input parse stack

xcb • 05

cb • 2, 05

b • 6 ,2 ,05

The next input symbol is an error--b cannot be appended to xc to form a valid prefix in the language. Since the LALR parser has the valid prefix property, it promises to declare error before shifting b. However, now in state 6 it is about to reduce, collapsing part of the parse tree. The parser con- tinues:

Ab 2 2, 05

b • 5,05

At this point an error is detected: before shifting b, but after having done a reduction.

The possibility of reductions occurring before errors are detected is a problem for hierarchic error repair if the LALR tables indicate a reduction where the error repair strategy dictates a shift. But hierarchic error repair usually wants to reduce, unless the corresponding nonterminal is one that generates the list of lines in a phase, and the input symbol is not a


synchronizing symbol. This means that in order to have a conflict between the L A L R tables and the hierarchic strategy, it must be the case that:

1. there is an L A L R state, say Q, indicating a reduction to a nonterminal, say Y, representing the list of lines in a phase,

2. Q has at least two predecessor states,

3. one of the predecessor states leads to a state which requires Y to be followed by a symbol, say x,

4. another of the predecessor states leads to a state which does not allow Y to be followed by x, and

5. x is not a synchronizing symbol for Y (or any of the distinguished nonterminals that can produce Y) in the hierarchic repair scheme.

If all of these circumstances occur, then the appearance of x when the parser is in state Q must cause a reduction in an L A L R parser. However, the hierarchic repair strategy requires that there not be a reduction because x is not a synchronizer for Y. This can only happen when the nonterminal Y is used in two different contexts: once to generate the list of lines for a phase (in this use it has the synchronizer x), and again in some other way. (Such dual usage is inadvisable because it makes the language more complex for human readers.)

In summary, it is possible to define h -LALR grammars as having the h- LR property, and the restriction that such states as Q described above do not appear in the generated L A L R parser.

R E F E R E N C E S

1. A. V. Aho and T. G. Peterson, "A Minimum Distance Error-Correcting Parser for Context-Free Languages," SIAM Journal of Computing, 1(4):305-312 (December, 1972).

2. D. T. Barnard, "Automatic Generation of Syntax-Repairing and Paragraphing Parsers," Computer Systems Research Group, University of Toronto, Technical Report CSRG-52 (April, 1975), pp. 1-132.

3. D. T. Barnard, "Hierarchic Syntax Error Repair," Ph.D. Thesis, Department of Computer Science, University of Toronto (March, 1981), pp. 1-253.

4. D. T. Barnard, "Syntax Error Handling Techniques," Technical Report 81-125, Department of Computing and Information Science, Queen's University (Canada), pp. 1-24.

5. D. T. Barnard, H. Boldt, and H. Mankovitz," Interactive Graphics for Compiler Specification," Proceedings CIPS Conference; Canadian Information Processing Society (June 1981), pp. 1-5.

6. J. R. Cordy and R. C. Holt, "Specification of Concurrent Euclid (Version 1)," Technical Report C SRG-131, Computer Systems Research Group, University of Toronto (August, 1981).


7. C. N. Fischer, B. A. Dion, and J. Mauney, "A Locally Least-Cost LR-Error Corrector," Technical Report 363, Computer Sciences Department, University of Wisconsin-Madison (August, 1979), pp. 1-47.

8. C. Ghezzi, "LL(1) Grammars Supporting an Efficient Error Handling," Information Processing Letters, 3(6):174-176 (July, 1975).

9. S. L. Graham and S. P. Rhodes, "Practical Syntactic Error Recovery," CACM, 18(11): 639-650 (November, 1975).

10. S. L. Graham, C. B. Haley, and W, N. Joy, "Practical LR Error Recover," Proceedings SIGPLAN Symposium on Compiler Construction, SIGPLAN Notices, 14(8):168-175 (August, 1979).

11. R. C. Holt, D. B. Wortman, D. T. Barnard, and J. R. Cordy, "SP/k: a System for Teaohing Computer Programming," CACM, 20(5):302-309 (May, 1977).

12. R. C. Holt, D. B. Wortman, J. R. Cordy, and D. R. Crowe, The Euclid Language: A Progress Report; Proceedings ACM National Conference (December 1978).

13. R. C. Holt, J. R. Cordy, and D. B. Wortman, "An Introduction to S/SL: Syntax/ Semantic Language," TOPLAS, 4(2):149-178 (April, 1982).

14. J. J. Horning, "What the Compiler Should Tell the User," in Compiler Construction: An Advanced Course, F. L. Bauer and J. Eickel, Eds. (Springer-Verlag, 1976), pp. 525-548.

15. D. E. Knuth, "On the Translation of Languages From Left to Right," Information and Control, 8(6):607-639 (1965).

16. J.-P. Levy, "Automatic Correction of Syntax Errors in Programming Languages," Acta Informatica, 4:271-292 (1975).

17. M. D. Mickunas and J. A. Modry, "Automatic Error Recovery for LR Parsers," CACM, 21(6):459-465 (June, 1978).

18. A. B. Pal and R. B. Kieburtz, "Global Context Recovery: a New Strategy for Syntactic Error Recovery by Table-Driven Parsers," TOPLAS, 2(1):18-41 (January, 1980).

19. T. J. Pennello and F. DeRemer, "A Forward Move Algorithm for LR Error Recovery," Proceedings 5th ACM POPL Conference, ACM (1978), pp. 241-254.

20. J. Rohrich, "Methods for Automatic Construction of Error Correcting Parsers," Acta Informatica, 13:115-138 (1980).

Date post:	29-Mar-2023
Category:	Documents
Upload:	umanitoba
View:	0 times
Download:	0 times

Hierarchic syntax error repair for LR grammars

Documents