Compilers(Question about Computer Science)

profilegkswogh7220
compilers1.pdf

Dr. Ernesto Gomez : CSE 570/670 We will be covering material from : Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman; "Com-

pilers - Principles, Techniques and Tools",Second Edition, Addison Wes- ley

(This is the classic text on Compilers - it has been around in various versions and updates since the 1970s and has been revised and updated, most recently in the second edition in 2007. Unlike many other texts, it gets better in every edition.

It is often called "the Dragon book", in part because it has a dragon on the cover, but mainly because it is not a friendly book - you will find that there are places where you have to study one page for an hour to understand a theorem or algorithm. It feels like the dragon is eating your brain - it is worth it, however - or it will feel that it is worth it after your brain has been rewired. You will love the book, appreciate and understand it the third time you read it. And you will come back to it - it is the compiler text that has shaped everybody in the field, the standard reference for compilers. It belongs in the professional library of every computer scientist and engineer.)

A tentative schedule: Chapter 1 is an introduction, chapter 2 gives an outline of how compilers work,

usint top-down parsing. We will be concentrating on bottom up parsing in the class, this starts with chapter 3. We will study most of chapters 3,4,5 ad then skip around the rest of the book - details wil appear in lecture notes.

We will not cover everything in the book,re and we will explore some topics that are not in the book - these will appear in the notes. All material covered in class may appear in examinations.

1. motivation

Why study compilers? Few of us will ever be called upon to write or modify a compiler. Here are some justifications for all the work involved.

<1> Pragmatism: Compilers are tools that we use to build all our applications. We need to understand the properties and limitations of our tools.

<2> Art: To the right(?) kind of mind, compilers are beautiful!

<3> Science: We need to be able to express things in a language understandable to us, then translate it to a language understandable to machines. The problem: how do we make sure that the meaning is translated?

• Syntax: the textual rules for expressing things in a language. • Semantics: the meaning conveyed by the expression. • The syntax is fully visible, but the meaning can be context dependent, may

require additional knowledge to interpret. • Syntax + (context) + (world knowledge) => Semantics. • Ambiguity: same syntax can mean different things. English is ambiguous. • "Time flies like an arrow." - What does it mean?Consider: Is the subject

"time" or "flies" (or "time flies", a kind of fly)? Is the verb "time", "flies", or "like"?

• Early computer translation example:"Out of sight, out of mind." => to Russian; back from Russian => "Invisible insanity."

1

2

Two tracks: "the Dragon book": Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jef- frey D. Ullman; "Compilers - Principles, Techniques and Tools",Second Edition, Addison Wesley

• How to understand natural languages => A.I. • How to build artificial languages that are not ambiguous (and: what can

be expressed by such languages, and how can they be understood.) Compilers follow track 2: We study classes of languages, what they can express, and what computation is required to determine what is expressed. This gets at the foundations of computer science, it turns out some languages are equivalent to finite automata, others to Turing machines, some are uncomputable. It also turns out that you can express uncomputable problems in languages which are themselves perfectly computable. There are open questions in semantics.

<4> Technology: Compilers are among the largest and most complicated single programs, and require very sophisticated programming techniques, worth studying in and of themselves. Many of these techniques are applicable to other problems in many areas.

The study and specification of unambiguous languages can be extended to many kinds of user interface control problems, in which one requires complicated input and it is necessary to distinguish correct combinations from incorrect or even dan- gerous sets of inputs.

A complicated application can often be understood in terms of a language that allows a user to specify combinations of actions or properties. Compiler technology provides tools to build and analyze such "languages".

<5> Curiosity: It’s nice to know how things work.

2. Computer and human languages

Human languages are contextual, ambiguous and meaningful. We can speak without paying much (or any!) attention to grammar, because the surrounding environment, situation context, and word meaning usually communicate what we want. For example, I once was lost in the Hiroshima train station, having taken the wrong bus from the airport. I was able to communicate that I did not know which train to take mearly by saying the name of the university where the conference was taking place and looking lost - the context of being in a train station supplied the added information that I needed to know where I could take a train there. Shouting “FIRE!”, as the professor in a compilers class talking about language is immediately recognized as an example - shouting “FIRE!” from the back of a crowded lecture hall is immediately recognized as a warning.

Human languages are ambiguous, however - sometimes we can’t really tell what something means without information not readily available. When I said “no sysad- min is better than Derec” as a student at my old university, it could be taken as a compliment. I could plausibly deny what I really meant, which was to say that the CS computer systems would run better without a system administrator.

Computer languages are different - to begin with, they are reall not languages for computers - they are languages designed for humans, which we can use to com- municate with computers. However, unlike normal human languages, computer languages are formal systems - we separate the language text from its meaning, and define the language as a set of grammatical rules that govern the shape (the

3

form in formal) of text written in the language. We define a set T of symbols that we call terminals - these are things that can appear in language text. Terminals can include letters, digits and special characters like arithmetic symbols and punc- tuation, but can also include higher level things like names that follow certain rules (for example, “a name must start with an alphabetic character, which is followed by any combination of letters and digits” is one possible rule. Same kind of rule applies to formats for integers and floating point numbers). Terminals may also in- clude specific keywords (like “if”, “while”). Text in the language (which we usually refer to as a “string”, meaning a sequence made up of characters and symbols) must be composed entirely of terminals, which also comply with rules of a grammar. A string satisfies all the rules, or it is not in the language.

Parsing and semantics. Some standard definitions: Given a set Tof symbols, let X and Y ∈ T we have operations: X and Y → XY , X or Y → X|Y, and “Kleene’s star” X∗ → �|X|XX|XXX|. . . (� is th e empty string, so X∗is 0 or more copies of X). So T ∗ is the set of all strings that can be constructed by concatenation from symbols in T , and a language L ⊂ T . The trick is to define the subset of T using rules that are unabiguous and computable. A string is said to be in L if it follows the rules so we can prove that it is in the subset. Meaning does not come into it at all, in formal systems. The advantage of using a formal system is that we can decide if a string is in Lor not using an algorithm that can be implemented on a computer.

A disadvantage is that the system is very rigid. Sometimes the translator will find a garbled part of a string, and mark it as an error, sometimes it will find something trivial like a missing semicolon at the end of a statement, and you will get an error message like “missing semicolon at the end of line 45” - which can result inthe programmer thinking “if you know there is a missing semicolon ther, just add it in and keep going!”. But even a seemingly trivial failure to follow the rules makes the translator decide “this string is not in L” and reject it as not in the language. The analysis of whether a string s ∈ L is called parsing, and it is the first step in computer language translation.

The actual translation requires assignment of meaning, this is defined by the semantics of L. This may be done after the entire program has been parsed, but usually we can assign meaning to smaller chings of code as we go - for example the string “x + 5” in C++ means “add 5 to the contents of variable x”, so we can translate that before we finish parsing everything else. This supports our intuition of what a computar program means : “It means what it does!” . (This is in accord to existentialist philosophy - Sartre believed that you are what you do, (in “Being and Nothingnes”) and this is what we want in a program, also known as Operational Semantics. We will annotate the grammar with the semantic equivalent of grammar definitions, which is to say what we want a particular chunk of code to do.

How we express the semantics, what is our output, depends on the form of translator we are building.

Compilers and interpreters and things in between. We have three general classifications of translaton programs:

• Compilers (C, C++) take a string in L as input, and generate an executable string in machine language as output. “x+5” becomes something like the machine language equivalent of “LOAD R1 @x , ADDI R2 R1 5”. All the

4

work of translation is done by the compiler before the program runs. The executable code is machine code, runs without any translation overhead. Disadvantages: (unless you compile in Debug mode), any runtime errors have no access to the original text, so you get messages like “overflow at 0AA023FF” which don’t tell you much about what happened or where. Not portable between systems, sometimes not even between different versions of the same system (try to run Win95 code on Windows 10 - or even on Vista)! You need access to the source code and must re-compile it if you want to run the program on a different OS, or machine architecture.

• Interpreters (old versions of Basic, like the ones that came with DOS) read the program one statement at a time and execute it directly: “x+5” may result in a function call that adds the two values, rather than an actual translation. Advantages - it is very easy to debug, you know where any er- ror occured in the source code. The source text is visible in the interpreter and you can edit it immediately inside the system. Disadvantage - inter- preted code runes very slowly, because translation has to be done at every statement. A loop with 10000 iterations traslates each statement inside the loop 10000 times every time it runs. You can’t do any optimization of code because you are always running the original source.

• Bytecode compilers (Java) translate to something that looks like assembly or machine language, that is then iterpreted by a virtual machine. The bytecode is designed to be easy and fast to read and execute by the virtual machine, so most of the heavy translation work is done before the program runs. Optimization can be done in going from the source language to the bytecode. So execution is much faster than on an interpreter. Compiled code is very portable, it can run unchanged in any system that has a virtual machine (you can download Java code and run it anywhere). There is sup- posed to be a security advantage because you can limit access of the virtual machine to real hardware - but experience with Java is that it doesn’t ac- tually work. Disadvantages - just as hard to debug as true compiled code, and not as fast due to the extra translation layer

• Many languages are between the above categories - Python and Lisp usually have the look/feel of an interpreter, but they actually translate the code to an intermediate version that looks like a machine language. It is then iterpreted by a virtual machine that is part of the interpreter. Microsoft languages like Visual Basic look like compilers, but in order to actually run the code you need massive runtime libraries (.dll files) which essentially do the same thing that the Java virtual engine - the “compiled” code passes fairly high level code to the dll which effectively interprets it like a vir- tual machine would - so no speed advantage over bytecode compilers. In systems that look like interpreters, youhave most of the advantages of in- terpreters, exceptonce you type in the program, you cant read it because it got translated it into code for the virtual machine. You usually have ability to enter and edit one line of code, but for example in Lisp or Scheme, any large program needs to be written in an external editor and loaded into the system. Code will run anywhere the interpreter runs (like bytecode compil- ers). Again like bytecode compilers, code does not run as fast as a machine code compiler, but runs faster than a traditional interpreter. Some of these

5

systems (python, GNU scheme, Matlab) can link to externally compiled modules, or even generate actual machine code from inside the interpreter.

Do not fall for the hype! Any action a computer takes is going to consume cycles. If you are compiling code for a virtual machine, when you run the code the virtual machine is using cycles, so your code gets fewer cycles to do its thing and therefore runs slower. A virtual machine on the cloud has the same problem - it will use cycles that are not needed if you run the program locally. (The cloud has the advantage that you may be able to use hardware you don’t otherwise have access to - your compute cluster on the cloud may only have half the speed of the same cluster of real machines on your own site, but if you can rent 100 nodes on the cloud and your local cluster only has 8, you still win. But it is just not true that there is no performance loss on virtual machinery)

Structure of compilers and other translators. In this class we are studying compilers, which input source code and output executable code. Note, however, that this description actually is valid for all the categories we described above with the possible exception of the classic interpreter, and even there we still have to do parsing and translation. It will turn out that the techniques we are studying will be applicable in a lot of cases that are not traditional compilers, and even the tools we will use in the term project can be used to generate interpreters or bytecode compilers, or even translators from one computer language to another (LEX and YACC, more on these later, also see the class webpage).