Compilers(Question about Computer Science)

profilegkswogh7220
compilers4.pdf

Dr. Ernesto Gomez : CSE 570/670

1. A note on parsing strategy

Suppose we have a set of TERMINALS T= { terminals - things that compose text in the language }. T stands for “Terminals”, they are a more general form of the definition of alphabet Σ. The alphabet is a finite list of symbols. The set of terminals is also a finite list, but the terminals themselves don’t have to be finite. Imagine we want to describe the syntax for adding a list of numbers. It might look something like:

ADD_LIST = { RESULT = SUMS, where SUMS = NUMBER or SUMS = SUMS + NUMBER }

(In Chapter 4 we will call this kind of thing a context-free grammar). T= {NUMBER, =, +}. Why not SUMS and RESULT? We can see that these

two words are placeholders (we will be calling them NON-TERMINALS later), when we actually express a sum, it will look like: NUMBER = NUMBER + ... + NUMBER. SUM and RESULT never appear in the actual sum, all we have are numbers, and the symbols “+” and “=”.

So : if we want to express a sum, we could start from RESULT = SUMS, then expand SUMS into SUMS+NUMBER. We can keep on doing this to SUMS as long as we want, then when we want to stop we use the rule SUMS = NUMBER. The we can add everything, and replace RESULT with whatever we have added ad we end up with NUMBER = list of numbers separated by + signs. (do we need a rule RESULT = NUMBER?). Anyway, once we have replaced everything with numbers and symbols + and =, we stop - there are no rules for changing a number or {+,=} into anything else. That is why we call them “terminals” - they end (terminate) the sequence of converting our syntactic definition to a particular sum.

(Whar, then are the terms RESULT and SUMS? They don’t appear in the final expression, we have rules that allow us to change them into something else - they are not “terminals” so we call them “non-terminals” (sometimes we can be very literal in our naming conventions!). Now, what are we to make of the term NUMBER? The word NUMBER appears in our description for the sum of a list of numbers - but when we actually write such a sum down, we will replace each instance of NUMBER with actual numbers - NUMBER=NUMBER+NUMBER would actually be something like 42=29+13. So NUMBER is a terminal, but it is not a fixed symbol - rather it describes a pattern that allows us to generate the actual text. For example, NUMBER=(−|ε)(1 . . . 9)(0 . . . 9)∗|0, the regular expression wi used before to describe what an integer looks like. When we said in (lecture notes 2) that we were going to use multiple machine types in translation, this is the kind of thing we meant. )

We will use finite automata to describe simple patterns that we will then use to simplify higher-level definitions, this is in Chapter 3 of our text.

1.1. Derivation. What we have done: ADD_LIST is a set of rules that describe what adding a list of numbers look like - it is a formal language definition, that describes which strings in (numbers, + and = signs)∗ are actually in the langugage ADD_LIST. When we show, using the rules, that 42=29+13 is in the language ADD_LIST, we are doing a derivation. The rules define the language ADD_LIST, which is a subset of all possible combinatios of (NUMBER,=>+). Give any specific

1

2

string in (numbers, + and = signs)∗, if it is in ADD_LIST, there mukasust be a derivation of that string:

Definition 1. Derivation: Generate a string in the language L by repeated aplica- tion of the rules that define the language until the string contains only terminals.

Example 2. ADD_LIST→RESULT=SUMS→RESULT=SUMS+NUMBER→RESULT=NUMBER+NUMBER→RESULT=NUMBER+13→RESULT=29+13→42=29+13

Definition 3. Parse: given a string, prove that it TOP DOWNeither is in language L or is not in language L . This is done by producing a derivation, or showing that it is not possible to produce a derivation.

1.2. Parsing. There are in general two ways to do a derivation: given a string w we can start from the rules that define the language and generate w - this is what we did in example 2 above, it is called a TOP DOWN parse. Chapter 2 of our text (Aho and Ullman) gives an example of top down parsing. BOTTOM UP parsing is, starting from w, replace things in the text with non-terminals, producing a derivation backwards - the last replacement we did in example 2 (replacing RESULT with the number 42) would be the first thing we do in a bottom up parse. A successful parse is when we can keep on doing until we get back to the start symbol of a derivation, in example 2 this is ADD_LIST. Chapter 4 of the text describes how to do this.

Top down parsing is easier to understand, but as languages get more complicated, it becomes harder to do. The issue is, at the start of a string, we need to choose a translation path. This is easy to do in our example of defining a sum of numbers, but in more complex languages you may be fairly deep into a translation before you know you are on the wrong path (a simple example: in FORTRAN the first statement in a loop could be “do i = 0, 400, 5 for a loop that starts at 0, and goes to 400 in steps of 5. Fortran does not require spaces, so it could be written doi=0,400,5. doi is a perfectly legitimate variable name, so you don’t know if this is the start of a loop or an assignment statement until you read the first “,”. It can get much worse in more complex languages.)

If you are doing top-down parsing you usually backtracking algorithms, which in the general case have exponential complexity. In the past, top-down methods were mostly used only on languages that were explicitly designed for top-down (for example, Lisp program syntax takes the form of lists delimited by parentheses, the first item in the list tells you what it does).

Bottom-up methods are really hard, if not impossible, to understand intuitively, but we have theorems that let us write algorithms to do it. These are limited to a specific class of language, the unambiguous context-free languages, but this is a pretty broad category - most declarative and functional languages fit comfortably here (we add a “symbol table “ tokasu deal with variable and function names, which are not context free, and mostly it works). We have tools called LEX and YACC (see : “John R. Levine, Tony Mason, Doug Brown; "lex & yacc", second edition, O’Reilly” for description of these tools - it may be out of print, but do an online search for pdf downloads, which may be free of charge). These tools can input definitions of finite automata and context-free grammars, and use them to generate parsers and translation programs. In our labs, you will write programs that show you how to do this - essentially you will implement theorems and algorithms from the text to produce your own code that does the kind of things lex and yacc do. This process will help you to understand what the programs can and can’t do (in

3

particular, most people will be unable to understand yacc error messages without understanding the code, which you get from doing the labs). You will use lex and yacc for your term project (see web page).

The limitation of the bottom up methods we are studying (unambiguous, context- free) are usually things we want in a computer language. We want the language to be unambiguous, this means our program can do only what we ask it to do (of course, we may not really understand what we are saying, this leads to logic errors, which are the hardest errors to fix). Context-free is a technical term that defines the form of the grammar that describes a language, and results in something that can be translated by a pushdown automaton - this is good, it means we can always get a true/false answer to the question “is the program in the language?” and we can guarantee that it won’t fall into an infinite loop (because we are not using a Turing machine).

Translated into human terms, context-free means that we can read a statement in the language and decide if it is correct or not without looking at the whole program - translation is local. Consider the following examples:

FORTRAN IV: is X=1/3 correct? Yes, because FORTRAN IV did not require variable declaritions, and any undeclared name that did not start wih the letters “i,j,k,l,m,n” was automatically floating point (i to n was integer, and we all still tend to use those letters as loop indices, continuing old habits from people who came before us).

C: Is x=1/3; correct? We don’t know. If x has not been declared, it is an “undefined” error. If x is declared as a string or an int, it is at least “incompatible type” warning, and maybe an error depending on how strict you tell the compiler to be when you compile. We can know immediately if every time we see a variable declaration, we put the variable name into a symbol table and list its properties (type, address, possibly other things).

C++: is x.func(1/3); correct? Well, if x is an object, with a member function func, and it is an imput function of some kind, and one of the things it is allowed to input is a floating point number - we don’t know without going deep into the class definition. We can fix some of this in the symbol table, but it becomes much more complicated - also it is not enough to put variable types in a symbol table, we need tables to describe classes, overloaded operators, many other things.

1.3. Languages that are not context free. Bottom up methods are easily ex- tended to deal with variable or function declarations. All we are dealing with here is names and possibly input parameters, but we have syntax in the language that describes what a variable or function name behaves like. Objects can extend the syntax, so they are much harder to handle by extending context-free methods. Top-down methods are easier to extend, because we have a better idea of what the program is doing.

Three things are making top down translators more practical than was the case even in 2007, when our text was pucblished:

• Algorithms based on Dynamic Programming (see https://www.geeksforgeeks.org/dynamic- programming/ - do not see the book we used for our Algorithms class, its section on this is confusing). This can reduce the complexity of recursive algorithms. As applied to backtracking it eliminates following the same path multiple times, which can significantly reduce the work required

4

• Modular programming - modern programming methods divide programs into many small chunks. Particularly true for objects, even 2n complexity is feasible for small n.

• Much faster computers allow us to run algorithms of higher complexity (and datasets divisible into medium sized chunks) in reasonable time.

The current (at least since version 5) GNU C++ compiler uses a top down trans- lation strategy, for example. The newer top down methods have not yet made it into texts, and there are as of this writing (2020) no available tools like LEX and YACC for top down compiler development. Look on the ACM or IEEE web sites for recent papers on parsing to get a better idea on the current state of the art.

Bottom up methods remain useful for many languages, and we still find them preferable for smaller projects and any language that is not object oriented. Further, the study of bottom up methods gives useful insight into language structures and methodology which will carry over to other methods.