Compilers(Question about Computer Science)

gkswogh7220
compilers7.pdf

Dr. Ernesto Gomez : CSE 570/670 We now consider parsing algorithms. This material is in chapter 4

1. Support algorithms : First and Follow

Recall that parsing is the application of grammar definitions in reverse. Take our example grammar for arithmetic expressions: S → E E → E + T| E → T T → T ∗F | T → F F →id | F ⇁num | F → (E) with N = {S,E,T,F}, T = {+,∗,(,), id, num} (Same grammar, with added terminal “num” and using “or” for a more compact

representation). In order to pars, we need to have some idea of what different grammar rules can

generate, so we can make decisions on what text is a candidate for replacement by the left-hand side of a rule. (In the terminology we defined in the previous lecture notes, we need to identify the “handle” - which is in (N ∪T)∗

FIRST sets tell us, given any symbol in (N ∪T) what is the first character that can be produced by that symbol, FOLLOW tells us what can appear immediately after each symbol in N.

The following material gives a slight variation on the algorithms presented in 4.4.2,

2. The following material gives a slight variation on the algorithms presented in 4.4.2

3. First Sets:

To generate: FIRST(X) for all X ∈ N ∪T. (1) If X ∈ T , FIRST(X) = {X}. (2) If X → ε is a production, add ε to FIRST(X). (3) If X → Y1Y2 . . .Yi . . .Ykis a production: Base case i = 1. Induction:

add everything in FIRST(Yi) except for ε to FIRST(X), then if ε is in FIRST(Yi) increment i and repeat; else stop. If i > k add ε to FIRST(X) and stop. Repeat for all productions.

(4) Repeat step 3 until there is no change to any of the FIRST sets.

4. Follow Sets:

To generate: FOLLOW(X) for all X ∈ N. Add end marker $ /∈ N ∪T to symbol set. (1) Place $in FOLLOW(S), where S is the start symbol in G. (2) If A → αBβ is a production, add everything in FIRST(β) except ε to

FOLLOW(B). Repeat for every production and every variable that is not at the end of the production

(3) If A → αBβ and ε is in FIRST(β), or A → αB is a production, add everything in FOLLOW(A) to FOLLOW(B).

(4) Repeat step 3 for every non-terminal in every production until nothing new is added to any of the FOLLOW sets.

1

2

5. LR parsing automata

We here discuss material from sections 4.5 to 4.7 in Aho and Ullman, on bottom- up parsers using LR methods.

We are not going into recursive descent topdown methods (section 4.4) because they only work for a strict subset of the languages that can be parsed by the LR(k) bottom-up methods. The top-down method described in

https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler is more general than LR(k), it is claimed to be able to parse all the context-

free languages, whereas LR(k) can parse a subset of the deterministic context free grammars. As developed so far, the new top-down methods and tools work sub- stantially more slowly than LR(k) - both in complexity which is worst-case cubic, and in speed on equivalent languages. At present, the bottom-up methods that we will study are more practical, and have more mature tools that we can use. This may change in the future, so we should be aware of ongoing research in parsing.

The last two sections of Lecture notes 6 are essentially commentary on section 4.5 in Aho and Ullman. Read that section mostly for the examples of how bottom-up parsing works, but the real work begins in 4.6.

As we have commented previously, we are (most likely) unable to develop any intuition of how a specific bottom-up parse is going to work on all but the simplest of cases. What we are going to do is develop algorithms and theorems that will allow us to construct a push-down (shift-reduce) automaton, that will input source text w,shift characters onto the stack and, by performing reduction operations on the stack to build a string α ∈ (N ∪T)∗ on the stack. Appending the stack to the front of the part of wnot yet processed results in a right sentential form, part of the derivation of w. By showing the correctness of our algorithms we can show that our automaton will ether reduce w to our start symbol S (in which case the record of the parse steps is a derivation of w) or reject was not in the language.

6. LR(0) parse automaton

We construct the canonical LR(0) sets, these are the basic states of an LR automaton without lookahead. These will be used as the baseline on which we will construct more powerful automata with lookahead. It is easiest to go through an example of our construction - we will again start from a version of our example grammar.

We have T = {+,∗,(,), i} where istands for an identifier or number. U = {S,E,T,F}. Our grammar corresponds to example 1 in section 4.1 of the text, since it is left-recursive and unambiguous (of the examples, 2 is right recursive designed for top-down, and 3 is ambiguous - although it generates the same strings since, + and * are both produced from E,there is no restriction on the order in which these operations are generated so we can have more than one derivation for some of the strings.

Our example grammar is: S → E E → E + T E → T T → T ∗F T → F F → i

3

F → (E) When doing a derivation, we always begin with the first rule - also called pro-

duction.

Definition 1. LR(0) items : productions in a context free grammar L ← α with a dot (or period) anywhere in α.

For example, add a period before the E in to create an LR(0) item: S → .E Informally, the “.” means that we have not yet “read” or processed the E. After

we process the E our item would become S → E. - moving the dot is equivalent to a step in the parse. S → .E gives what we call the kernel of the state.

The kernel is the starting point for a state, the kernel S → .E also the starting point for every derivation using our example grammar, so we use it to construct our first LR(0) state. Since this is the start state, we will call it I0. Initially, I0 contains just the kernel.

to build the full state we define:

Definition 2. closure(I) : given an LR(0) kernel in a grammar G construct all the items in the state:

(1) Base case: add all items in the kernel of I to the set: closure(I). (2) For each item in closure(I), of the form A → α.Bβ , and there are one or

more production in Gof the form B → γi (for any γi), add items B → .γi to closure(I). Repeat until no more items are added. (Note that A → α.Bβ includes the possibility that α or β or both are the empty string.)

What we are doing here - if we have an item with a “.” in front of a non-terminal A, then A could be replaced by the right-hand side of any production that has A on the left hand side without having to process anything. For example, since the kernel of I0 is S → .E, then without processing anything this could be E → .E +T or E→ .T. Once we have these in the closure, then we have E → .T, and so we can add T → .T ∗F and T → .F . Once we have an item with .F in the closure, we can add F → .i and F → .(E) to the closure. So ultimately we get: I0 =closure(I0): S → .E E → .E + T E → .T T → .T ∗F T → .F F → .i F → .(E) The closure defines all the possibilities in the state - before anything gets pro-

cessed. We will take the closure to define the complete state. The characters immediately to the right of each dot are the things we could process next, either by a SHIFT action - read ( or i, and push it on the stack - or through a REDUCE action - pop a handle off the stack and push E, T or F on stack top. So - processing a character - signified by moving the dot from one side of it to the other - results in that symbol ending up on stack top, and these are the only symbols that could end up on top of the stack after this state.

How do we get other states?

4

Definition 3. GOTO: We define goto(Ik,X) → Ij where Ij,Ikare states (which could be the same state). We construct goto as follows:

(1) Let {A → α.Xβ} be the set of items in state Ik with a specific symbol X ∈ N ∪T .

(2) Move the . over X to construct the set of items {A → αX.β}. This is the goto action applied to Ij for items with .X, it builds the kernel of set Ij

(3) Closure(goto(Ij,X)) is a state Ik with items {A → αX.β} in the kernel. Note that Ij may already exist. When defining the goto function from any state, we compare the kernel {A → αX.β} to the kernels of existing states, if it matches a state Ik (where k = i is a possibility) then the transition is to that state, otherwise we create a new state with a new subscript.

For example, closure(I0,E) gives a state: I1 : S → E. E → E. + T (since neither item has a dot in front of a non-terminal, closure(I1) does not add

any items. Consider closure(I0,”(” ) to get the kernel for a state I2 . This gives a single

kernel item F → (.E). However, closure(I2) is: I2 : F → (.E) E → .E + T E → .T T → .T ∗F T → .F F → .i F → .(E) Because .E gives both E productions, which give both T productions and then

both F productions. Notice, however, that goto(I2, “(“ ) gives us the kernel item F → (.E) so it just goes back to itself.

We define a recursive algorithm to construct all possible states as follows:

Definition 4. Algorithm to construct the canonical LR(0) states for gram- mar G

Add the production S → A where Ais the left hand symbol of the first production in G.

Base case: State I0 is the closure of S → .A, the unique start production such that A ∈ N. Add I0 to the set LR(0).

Recursion: For every unmarked state I in LR(0) : calculate closure(goto(I),X) for all symbols X in I with dot on the left ( all .X). Mark state I as processed. Repeat until all states in LR(0) have been processed.

We know the algorithm must halt: As we add states, two things happen: we have more existing states so it becomes more likelier that the kernel produced by a goto() from any existing state will match a kernel that alreadey exists. Also, the dot keeps moving farther to the right on new kernel items, and once it is all the way to the right (as in S → E. in state I2) that item will not give us anything in the goto function. (Otherwise, we note that all items must be constructed from Gwhich has a finite number of productions, each production has a finite number of

5

possible dot positions, and there are a finite number of combinations of all possible items that may appear in a kernel). .

Does it produce all possible states? We begin from the kernel of the start state, which is built using the first rule in a context-free grammar. Since this is the starting point for every derivation, it is correct.

Does closure identify all the possible items in a state that do not require process- ing a symbol? Closure applies to every item in a state that has a lefthand side of the form α.Aβ . The dot signifies the current state, symbols to the left of the dot have already been processed, symbols after A will not be reached until A is processed. Since A is a non-terminal, it is the lefthand side of at least one production, so it could be replaced by the righthand side of those productions - that is, the next thing we see in the parse could be any of those productions - so it is correct to add the corresponding items A → .γ to the state. A dot in front of a terminal does not give us added possibilities, because the symbol can not be the lefthand side of a rule. Therefore closure correctly generates all the items in the state given a kernel.

Does goto generate all the states that can be reached from any state I? Goto takes every item in I and builds new kernels by moving the dot one place to the right of a symbol X, which would result from processing X. In LR(0) with no lookahead we can not distinguish between two items with the same next symbol X, a deterministic automaton would always transfer to the same state on the same symbol. Goto examines all instances of .X in every item of I , so it identifies all symbols that could appear in transitions from state I.

We conclude : Given an LR(0) item or set of such items, and a grammar G, closure(I) correctly generates all and only those additional items that do not re- quire processing any symbols. Then, repeated application of goto to every item in I correctly generates all and only those kernels that represent reachable states. Re- peated application of goto and closure from any starting state must halt, generating all and only states that are reachable from our start.

Every proof in G begins with the kernel written from the start production, S → .A, so our algorithm generates all and only those states that can be generated from grammar G.

Remark. When applying the above algorithm, by hand or in a program - be sure to save the GOTO functions as well as the Ij states. A common error is to calculate all the LR(0) states and then realize that to actually build the automaton we need the transitions between states, and that is given by GOTO. Also note - as is the case with finite automata, the particular state labels don’t matter, if the transitions are the same it is the same automaton. Notice that the states we showed in the example in these notes match states in the automaton constructed from this same grammar in the text (page 244) although our labels here are not always the same.