Compilers(Question about Computer Science)

gkswogh7220
compilers9.pdf

Dr. Ernesto Gomez : CSE 570/670

1. Syntax-directed translation

Human languages communicate information through a combination of syntax, meaning and context. Arguably meaning and context are more important, we can communicate well with bad grammar. One of the consequences is that human languages convey a much richer, but less precise meaning. Computer languages are much more limited in what they communicate, all they do is specify what actions a machine will perform to carry out some algorithm. We design computer languages to be unambiguous - any string in the language should have a unique meaning. We use context free grammars because this makes meaning local to the text we are translating - an expression in the language should be translatable only using information that appears next to it in the text. Technically, a context-free grammar lets us guarantee that we can always decide if a text string is in the language or not, and we can do this efficiently (The LR grammars and shift-reduce parsers we have defined earlier can parse in linear time, other algorithms may take somewhat longer (and allow dealing with extensions to context-free grammars) but we can still guarantee worst-case polynomial time, usually better than n3).

2. Simple translation withoug optimization.

Consider our expression grammar from chapter 4, which might be appropriate for a calculator - we here extend it to an assign statement. We add A (assign statement) to the non-terminals, and the symbols v,= to the terminals. In this context, istands for either a vatiable or a numeric value. We have annotated each line with an expression in curly brackets that says what it does - the arithmetic operators have their usual meaning, and we are adding two functtions - the comment field here describes what the functions do. We are using a pseudo-code to describe what we do to translate the expression, but the actual code, like what we would write in the curly brackets in Yacc, depends on what we we are translating our statments to - another language? binary code? functions of an interpreter?

2.1. Assign statement. (1) A → v = E { store( E.value, v ) // if v is a variable, store a value to its

address in memory - else error } (2) E1 → E + T { E1.value = E.value + T.value } (3) E → T { E.value = T.value } (4) T1 → T ∗F { T1.value = T.value + F.value } (5) T → F{ T.value = F.value } (6) F → i { F.value = value_of( i ) // if i is a string, get numeric value - else

if i is a variable, get its value - else error} (7) F → (E) { F.value = E.value }

Note that we have added something that is not a synctactic property, in productions 1 and 6, the term “variable”. We can define the syntax for what a variable should look like ( a string tht begins with a letter, followed by more letters, digits, special characters like $ and _ which ends just before a space, line end, punctuation or operator which are not allowed in the name ). In modern languages, however, such a name is only a variable if it has been assigned a type and a location in memory so we can store items of the correct type in the name.

1

2

Example 1. Derivation of x=2+y*(3+z) We have previously see how a the start production of the expression grammar

gets expanded into a series of string of mixed terminals and non-terminals (called right sentential forms for a rightmost derivation) that contain a handle (see notes 6 for more examples and definitions, see also page 218 in the text, examples of annotated parse trees in section 5.1, syntax tree in section 5.3). We now show how to display the derivation as a tree:

L → EiL → Ei

2.2. Variable declaration. Here is a grammar for what a variable declaration might look like: nonterminals are D for declaration, L for list, T for type. Terminals are i for identifier, int, string, and comma to separate list items.

(1) D → TL { // declare a list of names as variables of type T } (2) T → int { T.value = n // set type to number } (3) T → string { T.value = s // set type to string }

3

(4) L1 → L,i { add_symbol(i, T.value) // insert i in symbol table with type T.value }

(5) L → i { add_symbol(i, T.value) // insert i in symbol table with type T.value }

Example 2. Derivation: of int a,b,b

D ->TL ->TL ,c- >TL b, c ->T a, b, c ->n a, b, c , we will now show this as a tree.

818.44

2.3. Semantics. Two advantages of the tree representation - 1) The tree is more compact and easier to read than the list of derivation steps. 26) Any order of steps in a correct derivation (from an unambiguous grammar) will produce the same tree - for example, we can do a rightmost or a leftmost derivation, we not only get the same result, we display the derivation as the same syntax and parse trees.

Parse and syntax trees describe the semantics of a statement, as shown in some examples in chapter 5, when annotated with the translation for each rule (as in the annotated gram818.44mar), the graphs also describe the sequence of actions. What aentation - 1) The tree is more compact and easier to read than the list of derivation steps. 2) Any order of steps in a correct derivation (from an unambiguous grammar) will produce the same tree - for example, we can do a rightmost or a leftmost derivation, we not only get the same result, we display the derivation as the same syntax and parse trees.

Parse and syntax trees describe the semantics of a statement, as shown in some examples in chapter 5, when annotated with the translation for each rule (as in the annotated gram818.44mar), the graphs also describe the sequence of actions. What a statement does can be described as the actions resulting from a leftmost, depth first, (in order) traverse of the tree:

4

(1) depth first - we follow a path from the root as far as we can take it (that is, until we hit a leaf, or a point where we have already explored all downward paths, then we go up until we find another downward path we have not visited. When we return to the root and there are no paths we have not already explored, we stop. (The algorithm uses a stack to keep track of nodes visited - see your data structures book/notes. A breath first search uses a queue instead of a stack, so explores the tree one level at a time - first does all nodes one down from the root, then two, and so on)

(2) leftmost - given a choice, we take the unexplored branch farthest left. You can see, from the statements we are graphing, that this means we process terms in the same order in which we read them. When we do a rightmost derivation, the last term we expand is the leftmost node in our graph - since a bottom up parse follows the derivation in reverse, it matches the path followed by leftmost traverse.

(3) in-order - I put this in parenthesis, because it is a little misleading. You can think of it like - you have leaves labelled 1 and 5 and both come from a node labeled +, so read 1, then read +, and then read 5 and add. I find it cleaner to think of - process a node when you can’t go any deeper. So the situation really looks like - go left and store the one, go right (because you already did left), store the 5. That puts you back at the node with the + , and you cant go any further because you already did both paths. So you apply + to whatever data you have stored, and get 1+5.

The situation just described in 3. above is the easyest - we say that values (at- tributes, a more general term) that we get at an internal node in our tree that we read from links pointing down from our node - are synthesized. Looking back at how a shift-reduce parser works, this is the stuff that would be on the handle that we are popping from the stack. The addition example we described corresponds to parsing E → E + T , we pop the right hand side - so we can read the values of E and T from the stack (in Yacc notation, E+T is $1 $2 $3, and we can write something like $$.value=$1.value+$3.value (we don’t need $2, we know it is a + sign already if we are processing this rule).

A different thing happens in the grammar for declarations - Look at the anno- tated grammar above, and you see that the start production D doesnt do anything - it means get the type from the left branch and apply it to the list on the right, but since we don’t know how long the list or what is in it, we actually put values into the symbol table when we process the actual name in identifier i. This iacks fine, because our parser stack has all the i valies. However, the actual type name will be on the stack below all the i’s (think leftmost depth first, we put the type on the stack, then we process each of the i’s, then we are back in the node with D → TL , but by the time we get there we have done all the work - we really need the type when we are going down the right branch of our definition, and the type is on the stack but we don’t know how deep (and if you are using Yacc, the way to get at stuff below stack top is not well documented - if you are interested, things on the actual Yacc stack are $-1, $-2, .. and so forth. But which one is the type? It depends on how many names in the list have not yet been processed.

Things like the type name are called inherited values. Looking at the parse tree, inherited values were set on a branch to the left of whatever we are doing now. We have read them, they are on the stack, but you don’t know exactly where -

5

the link in the tree is through the parent node, the node we are processing inherits the values from its parent. Your book, the Lex/Yacc references tell you to insert a dummy production to hold inherited values - you add something that looks like H → � to the grammar, and you put this into the rules where it generates nothing, but you something like process L → Hi and the Hdoesn’t change the text but it becomes a synthesized thing that gives you the value you need. If you want to do this, go right ahead - in the parsers I have written, I don’t like this approach - even though adding an epsilon production doesn’t change the generated text, it messes up the grammar and can result in needing a higher level of lookahead and introducing conflicts where otherwise you would not have them.

What I actually do, in cases like this is define a second stack to hold inherited values, push values I will need later onto that stack, then when I need the inherited value, whatever is on top of the extra stack is what I need. I have not proved the correctness of this, but since in most languages, everything is nested, it works. Looked at another way - an inherited value deeper in the stack corresponds to a production that includes whatever created whatever is on stack top - we will process the production that used the top value before we get to whatever pushed values farther down. The lesson - we are producing a prctical piece of software. Doing things like adding ad hock code is not theoretically clean, but if it gets the job done, we are fine with it.

2.4. Symbol table. We have mentioned before that declared variables are not context-free. We don’t know if a statement in C++ like x=5.3+y; is correct without verifying that x and y are actually declared, and are of numeric type. Even if the statement is correct, its meaning is different depending on which numeric type we are using - if x and y are int, then the .3 will be truncated and the value of x will be different. So meaning is contextual, it is not determined by syntax. Same applies to function declarations, and the more extensible a language is, the farther we go from pure context freedom. An object-oriented language is much harder to do in anything like a context-free grammar.

Depending on the language, you will need different formats in the symbol table. When you are storing variables , all you need is: name, type, size and memory address. For a function, you also need the list of parameters. For a class, with multiple local variables, inheritance, public and private values, overloading - symbol tables become much harder to define, and their use can also limit the features you can build into the language. This is behind the current push to develop top-down parsing methods.

3. Basic blocks and control-flow graphs

(described in section 8.4 of text)