Compilers(Question about Computer Science)
Dr. Ernesto Gomez : CSE 570/670
Look at section 3.5 in the text, which goes into how LEX works and how to use it. There is a useful summary of the material in Chapter 3, starting on page 189.
1. Parsing
Read section 4.1, which is a general introduction to parsing. In particular, 4.1.2 lists three equivalent grammar specifications for the same arithmetic expressions - we will be using these, and variants of them in most of the examples. 4.1.3 and 4.1.4 are useful background, but: With modern compilers and computer speeds, the compilation cycle is much faster than it used to be - trying to find and report all the errors in a program can be necessary if the edit-compile program cycle takes hours (typically it took a day or more before the personal computer era), but when you can compile in minutes or less on an interactive system, it is legitimate to report the first one or two errors and halt the compilation. In practice, if we see a list of 50 errors (as an example), most of us will correct the first two and then try again, based on the experience that many of the later errors are triggered by earlier errors, many errors will disappear when we fix the early ones.
1.1. Context-Free languages and grammars (CFG).
1.1.1. Parsing and Derivation. We have defined a language L as a subset, selected from a set of strings that are combinations of specific building blocks. For regular languages, the building blocks were a set of symbols we called an alphabet Σ, we later allowed some of our building blocks to be more complicated things, that could be defined by regular expressions (keywords like if and while, things like number, built up of a restricted set of alphabetic characters and having a format that can be defined by a regular expression (RE)). We have called this the set of terminals T. Therefore L ⊆ T∗ (or Σ∗) and we have a set of rules defined in some way that say what combinations of symbols are in the set L and what combinations are not.
We divide languages into classes, depending on how complicated the rules can get, and what theoretical machine can implement the rules. Our simplest class, the regular languages, are defined by a regular expression and an alphabet Σ. There are is an allowed start state, denoting the symbol(s) that can appear at the start of a string, and what can be done next after seeing a particular character. The rules can be used to generate a string by successive application of rules, this is called a derivation. Or, given a string, the rules can be used to decide if the string is in the language or not, called a parse. We have seen that a particular machine, the Finite State Automaton (FA), which can be deterministic (DFA) or non-determinstin (NFA), is equivalent to a regular expression , and in fact that NFA is equivalent to DFA. This means - any derivation that can be done using the regular expression can also be done by a DFA or an NFA, and that any language that can be implemented on an FA of any kind can be defined by a regular expression. Parse of a regular language is usually done by submitting the characters of a string, in order, to a FA and verifying that the machine is in a halt state when the last character has been read.
We have other classes of languages, which correspond to more complicated def- initions and machines The regular expressions RE give us the regular languages. The next class of languages, the context-free languages CFL, are equivalent to a pushdown automaton PDA, which is an FA plus a stack. Same as is the case for
1
2
FA, There are two sub-classes, the deterministic context-free languages - DCFL and the non-deterministic NCFL. Unlike the case with FA, the NSCFL are a bigger class than DCFL - the NCFL can handle strings that a DCFL can not.
Example 1. Consider the language of palindromes - these are strings that look like 1123333211 - the second half of the string is the reverse of the first half. It looks like this could be done with a stack, like we do for parenthesis matching ( when we see “(“ we push it on the stack, when we see “)” we pop. If we have to pop but the stack is empty, we reject because we have a close parenthesis without an open. If we reach the end and the stack is not empty, then we reject because it means some open parentesis Like what we do to match parentesisis not matched by a close). We can pushing every character in the first half of our example, then the stack looks like 33211 (from top to bottom), the second half of the string is 33211, so every time we read a character in the second half then we match it with stack top. If they are the same, we POP the stack, read the next character and see if it matches. Keep this up - if we reach end of text without finding any mismatch between character and stack, and the stack is empty we accept - the string is a palindrome - like our example. If we find some character that does not match the stack, or the stack is not empty wehen we have finished the string, we know the second half is different and reject. But how do we know we have reached the middle, so we can stop pushing? A DCFL does not know - it would have to read the entire string and count it, then “unread” it - you can’t do this with just one stack. An NFCL can correctly guess where the middle of the string is.
Replacing a stack with a tape that you can move right or left, so you can read (or write) something and go back to it multiple times gives us a Turing Machine which is our most powerful computational device (quantum computers could be exponentially faster on some problems, but it looks like the total set of languages (problems) that could be solved by a quantum computer is the same as for a TM (also true for every other known computational device). A two way tape, by the way, can be simulated with two stacks - moving the tape is equivalent to moving the item at the top of one stack to the other. So you can think of:
• FA : a state machine • PDA : a state machine with a stack - if you ignore the stack, it is exactly
the same as an FA • TM : a state machine sith two stacks - take away one stack, it becomes a
PDA. Take away both, it is an FA • Adding more stacks or tapes allows a TM to work faster, but it doesn’t
change the set of problems/languages it can do. Neither does non-determinism, like FA a non-deterministic TM can solve the same set of problems as a de- terministic FA, it is just faster and easier to program.
• There are problems that can not be solved by a TM - the uncomputable problems.
So we get the following language classes: RE ⊂CFL ⊂TM languages ⊂ Uncomputable Many people divide the Turing Machine languages into sub-classes like context-
dependent, phrase-structure, others. I find the more general classification more useful, since it maps to hardware. The sub-classes of the Turing languages all need a TM to derive or parse.
3
Just in case you are wondering, it is not hard to define a language that fits into the Uncomputable category. Consider the language “all Lisp programs that can be written in less than a million bytes, that do halt on all inputs”. This is a finite set, because we have an upper limit on the program size. If we had the list of programs in L it could be generated or parsed by FA, our simplest machine. We build a specific FA for each proream text. then call them with a giant OR statement. An NFA could do this by always picking the right alternative, and we can always convert it to a DFA or simulate parallelism like Lex. The problem is, we cant build the list because it requires us to solve the halting problem, and we can prove that our most powerful device (TM) can’t sove it.
What kind of language do we need? It is tempting to go for a TM language, this is the largest category, it includes everything we know how to do on a machine with an algoritm. Recall, however that we are going for a very limited, formal and unambiguous language. Perhaps a simpler machine would do. The context-free languages start out with two properties we want.
To begin with, it turns out that the property that distinguishes a non- deter- ministic PDA from a deteministic one is ambiguity - an ambiguous language can be parsed by an automaton that always chooses the right path when chosing mul- tiple alternative possibilities. Unlike FA, there is no algorithm for converting an NPDA to a deterministic one. So by choosing deterministic context-free languages we restrict to unambiguous languages. We like context-free, it means translation depends on local information, not on context. Technically, a CFL is not defined by any property we can describe in English - it is defined by the form of a Context Free Grammar (CFG) we use to desfibe such a language. The end result, however, is that translation for CFL is local - we have replacement rules for parsing or deriva- tion that only depend on the text we are replacing or generating without regard for the rest of the program.
So what does a CFG look like? An example: T = {+,∗, (, ), id } - this is the set of terminals, that is text that can appear in
the final program. N = {S,E,T,F} - these are non-terminals - essentially these are names of gram-
matic structures, things like “noun” and “verb”, which can appear in specific places in a string but not others, and which will be replaced by terminals in a string of the language. In this case, E stands for expression, T is a term (that can be added), F is a factor (can be used in multiplication). S is the start symbol, technically it does not have to be an additional symbol - we could select one of the other symbols as start, but it is convenient to designate a special symbol S and a start rule that just says S can be replaced by some other string of symbols. We will use the symbol → to mean that whatever is to the left can be replaced by whatever is on the right, and the symbol | to mean “or”.
A grammar for arithmetic expressions could be (similar to example 4.1 in the text): S → E E → E + T E → T T → T ∗F T → F F →id
4
F → (E) Each line is a replacement rule called a production, it means - in a derivation, the
lefthand symbol can be replaced by the string on the right (in a parse, the string on the right can be replaced by the symbol on the left). The arrow links the two sides, and points in the direction of the replacement in a derivation.
We could have written this more compactly using “or”, for example the two rules that begin with E could have been written in the single line - this is still two replacement rules. E → E + T|T
Example 2. Derivation: we derive the string “x1+a2”, ( x1 and a2 follow the rules for an id). In a derivation, we replace the lefthand side of a rule by the string on the right. S → E → E + T → E + F → E+id → E+a2 → T+a2 → F+a2 → id + a2 →x1
+ a2 This is a rightmost derivation - we always replace the term farthest to the right.
A leftmost derivation would derive the same string, but expanding the term farthest to the left at each step. A parse is the reverse of a derivation, we would identify the left-hand side of a rule in the text (we call this the handle), then replace the handle with its left-had side (always a non-terminal in N). If we can reduce text to the single start symbol, the parse is successful, and reversing the steps in the parse will produce the derivation.
Notice that we start a derivation from a symbol in N(non-terminals), and end it with a string in T∗(terminals). The intermediate steps are in (N ∪T)∗. In order to make these distinctions clear without having to explain every time we use them, the Aho & Ullman use notational conventions described in section 4.2.2, p. 198, and so will these notes. To briefly summarize,
1.2. Notational conventions. • Lower case letters early in the alphabet (a, b, c, ...), math symbols, digits
and lower case short words like id in our grammar are in T. • Upper case letters early in the alphabet, (A, B, C,...) and upper case letters
that are meaningful initials (S for start, E for expression ...) are in N • Lower case letters late in the alphaber (w,x,...) are strings in T∗ • Lower case Greek letters eary in the alphaber (α,β, ...) are strings in
(N ∪T)∗ • Upper case letters late in the alphaber are single characters/symbols in N ∪T • Greek lower case � is the empty string, any set defined with the * operation
includes the empty string as a possibility. The book describes this in greater detail and uses more examples - you should bookmark page 198 - all pages after this can show something like X1X2X3 and expect you to know that this is a string of three symbols in N ∪ T and you will need to refer back to the page where the notation is defined.
Anyway, we can now give a formal definition of a Context Free Grammar (CFG):
1.3. CFG definition.
Definition 3. A CFG is a grammar with a set T of terminals , a set Nof non- terminals, a selected start symbol S ∈ N,and a list of productions (replacement
5
rules) of the form A → α, such that the first rule in the list is of the form S → β (NOTE: the start symbol could be S, but it could also be a different letter, and there could be more than one rule with the start symbol on the left. Our example grammar could have taken E to be the start symbol. Adding an unique start symbol S and defining a single production that uses it - sometimes called an extended grammar - avoids comfusion and makes proofs and algorithms a bit easier).
The form of all the productions in a CFG is what gives us context-freedom. This form says that the replacement of A with α. or vice-versa, is not dependent on surrounding characters (context). There are other advantages. Assume the grammar is epsilon-free, there are no productions of the form A → � . Consider what happens in a derivation, we can describe it as: S → β1 → β2 → . . .βn → w (we can write this S →∗ β → w - meaning we start
with S, make zero or more derivation steps until we end up with w ∈ T∗) A parse is the reverse of a derivation, every step replaces a string α in one of the
β steps with a single non-terminal, the resulting string will not be longer than the original. Starting from w, every step in the parse gives us somthing the same length or shorter. Therefore we can guarantee that the steps in the parse never lengthen our string, so we always halt by either getting stuck along the way or producing a single non-terminal character at the end of the parse, if we get here and end up with S we accept. If we get stuck before we reach a single symbol, or if we don’t end up with the start symbol, we reject. The parse has to halt and answer YES or NO.
Some grammars have productions with the empty string � on the right hand side, this violates the assumption that the parse step always makes the string we are working on in the parse shorter. But there is an algorithm to transform any grammar to epsilon-free form, so we can extend our argument that the parse always halts to grammars with epsilon productions. Our algorithms will have to be able to deal with epsilon productions, but all the grammars we will use in class or lab will be epsilon-free.
1.4. Derivations, parse trees and ambiguity. The book describes two methods of parsing in chapter 4, the Recursve Descent method, a top down method that works with left-recursive grammars and the bottom-up LR(k) methods that work with right-recursive grammars. Our example grammar is right recursive: Consider E → E +T and E → T . The first rule is recursive, it means we can keep adding +T to the right of E , to get a string of sums E+T +T +T.... The second rule is the halt condition - we can say that E becomes a Tand that lets us stop adding more +T . An alternative set of rules is in example 4.1 in the book: E → TE′, E′→ +TE′ and E′ → �. The first rule starts the recursion with a T, the second rule keeps adding +TE’ (this expands the chain of sums on the left, hence left-recursive), the third rule gets rid of E’ by changing it to epsilon and halts the recursion.
Remark 4. The left-recursive rules are alot more complicated than the right recursive rules - couldn’t we just say E → T + E , E → T ? Two problems - in parsing top- down we see a T - how deep do we have to read into the part of the input after the T to figure out if there is a +E following, or another +T? This has practical implications for the complexity of the algorithm. In the syntax from example 4.2, if the next thing after a T is not a +, we know immediately that we have reached the end of the sums.
6
A deeper problem - How do we know the two grammars 4.1 and 4.2 represent the same language? For examples 4.1 and 4.2, we have an algorithm that transforms the right-recursive 4.1 to the left-recursive 4.2, and the algorithm can be proved correct - this gives us a proof that both grammars define the same language. Absent a specific algorithm tht we can prove correct, we have no way of proving that two grammars define the same language. We could do this to FA because they are finite - so we could define straightforward (although high complexity) algorithms to make an NFA deterministic, and to minimize the resulting DFA and compare the final, minimal aotomata. Since a regular language is given by the structure of the automaton, if we minimize two different automata they either end up having the same graph or not. Since transforming NFA to DFA and minimizing do not change the language, comparing the final minimal DFA gives us the answer.
Pushdown automata (PDA) and Turing machines are technically infinite, not finite machines - because a complete state of a PDA includes an arbitrarily deep stach (or tape, for a TM). It turns out that we do not have an algorithm to decide if two PDA have the same language, we have no algorithm to determine a minimal PDA. The situation gets more complicated when we consider the language (of a PDA or CFG). Recall that a language L ⊆ T∗ (Why not (N ∪T)∗? because text in the language is made up of terminals - the non-terminals are part of the grammar, and different grammars, with different sets N can represent the same language). There is a chicken-egg problem here - how do we know a what a language L is, without a grammar? So typically we start with a context-free grammar (CFG). Once we have a grammar G for language L, we know there are other grammars for L. Is some grammar G′ have the same language? We dont have an algorithm to decide this, unless G′ is the result of an algorithm applied to G - for example, if G is a grammar with epsilon productions, we coule apply an algorithm that converts it to epsilong-free. We can start with our example grammar 4.1 and apply an algorithm to eliminate left-recursion - this gives us the right-recursive grammar in example 4.2. So we know that 4.1 and 4.1 are grammars for the same language. If we just write a new grammar with E → T + E insted of E → E + T it certainly looks like the same language, but we have no way of proving it.
Back to terminology: in a derivation, we begin from S and the first step might be something like S → E + T . What is our next step? In our example, we always expanded the rigthtmost non-terminal first, after a series of transformations gets us to a terminal, we go back to the next non-terminal to the left of our starting point (which is now the rightmost because we expanded every non-terminal as far as we went. We can represent the steps of a derivation as a tree, (see parse trees in section 4.2.4), a rightmost derivation generates the branches stating on the right, a leftmost derivation expands the leftmost term and generates the branches on the left first.
Definition 5. A grammar is not ambiguous if a rightmost derivation generates the same tree as a leftmost derivation. This is equivalent to saying that a grammar is not ambiguous if there is only one rightmost (or leftmost) derivation for any string w. Why define ambiguity in this way? Ambiguity implies that a string w has more than one meaning; but we are going to do a syntax-directed translation (because we are dealing with formal languages, the syntax is a hard line between strings in L and strings not in L) . We are going to attach the meaning to the parse tree - two different parse trees will correspond to different actions and different meaning.
7
1.5. Strings that are parts of derivations. We need terminology to describe a derivation. In particular - we know that the derivation begins with the start symbol, and ends in a string w in the language. What happens to all the terms in between? They are strings in (N ∪T)∗, which our standard notation lables with a lower case Greek letter (see above). we define:
Definition 6. Sentential form: an intermediate step in a derivation, a string in (N ∪T)∗. A right sentential form is part of a rightmost derivation, same concept for a left sentential form.
We are more concerned with right sentential forms, because we are going to be studying bottom-up parsing, which is equivalent to a rightmost derivation in reverse order. (The top-down recursive descent algorithms described in the text are generally not suitable for languages not designed for them, because their worst- case complexity can be exponential. The newer methods, which you can read about in the link from Lecture notes 5 and also linked from the class webpage, have worse case cubic complexity, but I dont know them well enough to describe them in class yet).
The bottom-up method we are going to study is:
Definition 7. LR(k) - L means “read the text Left to right”, R means we are going to match a rightmost derivation, k is the level of “lookahead”. As in a finite automaton, our procedure is to read a character from the input text, decide what to do and advance to the next character- we are not allowed to step back, read the character again and try something else. Lookahead means we are going to cheat and look at k characters from the input text before we actually read them. The various theorems and algorithms defined in LR(K) allow us to start from a CFG and construct a table-driven push-down automaton that parses that grammar. So an LR(1) parser for a grammar G is a pushdown automaton that parses input text, using a lookahead of 1 to decide on its next step. Input text is pushed onto the stack and processed there, so the stack will contain a mix of symbols from Tand N.
LR(0) follows the standard read character rules, LR(1) is what we really need for most languages, Yacc uses a slightly weaker form of LR(1) to save memory. We will study up to LR(1) in this class, our work in lab will build the routines we need for an SLR parser which is somewhere between LR(0) and LR(1) in strength. (Recursive descent parsers discussed in our text are LL(k), like LR(k) but the parse corresponds to a leftmost derivation).
In a top-down parse, you are generating text as you go, and whatever you gen- erate has got to match the input string. You follow a leftmost derivation, but in the same order as the derivation. So you try a production that you know gets you the first character in the text, and go on from there. it is easy to understand where you are and what to do next. The problem is, you might not know for many steps that you are on a wrong track, the mismatch you get in step 45 might follow from a decision you made in step 30 - you need to backtrack and try again, and your complexity can go exponential.
In LR(k) parsing you are following a rightmost derivation backwards. Since the rightmost derivation expands things at the start of the text last, the last thing generated, which is the first thing in your pars, is going to be close to the start of the text. So you read the text left to right, which is waht LR says, and replace the text that was generated last in the derivation. As mentioned, that is near the start
8
but not necessarily the actual first few characters - the thing derived just before the last step might have looked like xAy - a string of terminals x followed by a non-terminal followed by more terminals. Your last step removes A and replaces it with terminals, but the string x was generated by some previous step that had A in the middle, and the last step changed xAy to xzy. So how do you know to replace z first? Because z is the first handle.
Definition 8. handle - the right hand side of a production that will be used to reduce a substring β of a right sentential form that fits a rule A → β , so we replace β with the single non-terminal on the left of the rule we are using. (In a parser, this means that we have just processed the last symbol in β and pushed it on the stack - so β is on top of the stack, we pop it and push A.)
We will develop a shift-reduce parser. It alternates between reading text and pushing it on the stack, and popping a handle of the stack and replacing it with a non-terminal. The stack top is always the last symbol read or pushed. The order of symbols on the stack from bottom to top is what you would read from the left of a right sentential form - you would get a right sentential form by appending the stack to the rest of the text, since in a rightmost derivation you generate the text from right to left, the non-terminals not yet converted to final text would all be on the left of the form, that is, on the stack.
Looking at the derivation, it is evident that each time we apply a rule, replacing a non-terminal with the right hand side of the production, we are generating what will be a handle for the parser, but this handle will often be generated inside what was a handle generated by some previous text. So we expect handles to be nested inside other handles. To process this, we introduce the definition of:
Definition 9. viable prefix - this is a a string of symbols in (N ∪T)∗that could be part of a handle (possibly of more than one handle).
During a pars, the stack must always hold a viable prefix - in fact, the first character read and pushed on the stack has to be a viable prefix. If this were not the case, that would mean that text in the language could not ever start with the character just read, and we can reject immediatly (in our example arithmetic expression grammars the first character could not be “+”, as an example. It is not a viable prefix for any of the start productions).
So - how do we know we have a handle? Because the next character we push on the stack completes a viable prefix
Why do we have a viable prefix on the stack? Because the stack plus the re- maining unprocessed string is a right sentential form, and the first part of a right sentential form is a viable prefix
Why is this a right sentential form? Because it is part of a rightmost derivation. Why is this a rightmost derivation? Because reversing it yields the single symbol
S on the stack when we have read the entire string. But this is what we are trying to prove in the first place - isn’t this whole chain of definitions completely circular? Yes. The moment we get stuck in the parse (because we are in a state which does not lead to a following state, given what is on the stack and what we read from the input string) - when/if this happens we realize that whatever is on the stack is not a viable prefix, the stack+rest or input is not a right sentential form - in fact every previous step was not part of a derivation, so nothing done before was in fact
9
a handle, a viable prefix or a right sentential form - because the input string is not in the language.
What we have is a set of definitions and methods that are correct only if they result in a successful parse, otherwise they are meaningless. Of course, this does answer the question - is w in L? If the parse succeeds, then all our definitions were correct. Otherwise, everything we attempted to do in the parse was incorrect and w is not in L. But this is exactly what we expect - if we submit a string that is not in L, the pase should fail!
We still have a problem, however. The correctness of our definitions can only be justified by a successful parse, but they don’t give us any indication of which handle we should pick next or what that particular handle looks like. We can not directly understand what the next step in a parse is, we would need to know if the next step step is read or pop a handle and replace it (reduce), we need to know what the next handle looks like so we can tell if it is on the stack - we don’t know these things.
What the LR(k) method gives is is a way of defining a state machine from the grammar. For example, in our grammar 4.1 it tells us that, before we read anything at all, we could be looking at the start of either of the two productions that begin with E, because S could actually transfer to E before we read anything - then E could be part of an E+T production, or it could become a T, because we have a rule that says E could start by becoming a T. We continue in this fashion until we collect all the possible places we could be in the grammar. Then we consider what happens next - what symbol could we read? We realize, looking at the grammar, that there is no way we could see a + sign from the start state - the only production that has a + sign is E → E + T , and we can’t reach the + without first processing E. By continuing in this manner we can construct states that we reach from the start state depending on which symbol we process - this part looks like a DFA. We will reach some states in which we have processes an entire rule - for example, if we process the symbols E, + and T in sequence, we have reached the end of the rule E → E + T , the stack must have the handle E + T on top so we can pop those 3 symbols and push E. We don’t have to understand all the details, if we have an algorithm that generates the states and transitions between them, and identifies states in which we perform a reduction, we can build a pushdown automaton that parses the language. This is the LR family of algorithms, we will see how this is done in section 4.5.
1.6. FIRST and FOLLOW. We need some preliminary work, though. We can tell, from the grammar, which symbols could appear first in a string generated by a non-terminal (the FIRST sets) and which symbols could appear after a non- terminal - for example the symbol E could be expanded to a string that begins with “id” or “(“ so they would be in the set FIRST(E). The symbol “+” could not be the first character in any string produced from E. However, “+” could be the first symbol you see after the E, it would be in the set FOLLOW(E). FIRST and FOLLOW sets are a property of a grammar, there are algorithms to generate this.
The FIRST and FOLLOW algorithms (which you will implement in LAB 2 are on in section 4.4.2, starting on page 220 . There are also notes on FIRST and FOLLOW on the 570 web page, at the bottom of the page in the section >NOTES . We will need to implement these algorithms as the first stage in constructing the PDA for the shift-reduce parcer we just described for bottom-up parsing.
10
Because each symbol in N ∪ T has a FIRST set, and each symbol in N has a follow set, it looks like the FIRST/FOLLOW sets are properties of the symbols, and this can lead you to write programs to compute sets for one symbol at a time. This does not work. FIRST and FOLLOW sets are properties of the grammar! Keep this in mind when reading and programming the algorithms - you need to loop over the entire grammar, each iteration of the loop makes changes to some of the sets which lead to changes in other sets. Repeat until nothing changrs, then stop.