Compilers(Question about Computer Science)

profilegkswogh7220
compilers2.pdf

Dr. Ernesto Gomez : CSE 570/670 The structure of a computer language translator

1. Overview and practical issues

Recalling Lecture 1 - we are going to define a computer language using a formal description because we want to be able to decide in an unambiguous way if inout text is in the language or not. (This is a big part of the reason for the invention of formal systems - mathematicians got into arguments about whether something was a proof of a theorem or not. Formal systems came into being as definitions of the rules of resoning that are acceptable for proving stuff, so all could agree that if the proof folllowed the rules, it was a correct proof). When we do this, we split the translation problem into two hopefully simpler chunks: First, decide if the text follows the syntax rules. If it doesn’t, we reject it as not being in the language, rather than trying to guess what the programmer really wanted to say. Once we know some string is in the language, then we decide what it means (the semantics). Language theorists have several ways of defining meaning, but from the point of view of compilers, we need to translate the input text into actions by the computer. We therefore use operational semantics - the text means what the computer is required to do.

How should we implement these concepts? We have seen that we have a range of options, from a traditional interpreter to a compiler that outputs machine code, with a whole range of possibilities in betwen. There are advantages and disadvan- tages to every design choice, and as we have seen, there are real world versions of every one of them.

Consider first the traditional compiler: we want it to translate from source text to machine language. We could just build a large, monolithic program that incor- porates both the input in a computer language and the machine executable output. The problem here is, bot the input language and the machine environment are mov- ing targets. Languages are revised periodically, sometimes drastically. For example, there have been five official revisions of the C++ standard, in 1998, 2003, 2011, 2023 and 2017, and a new revision is due this year. But this is not the whole story - C++ started in 1985, and a lot changed before the International Organization for Standardization (ISO) defined the 1998 standard (for example, templates were not in the original language).

The runtime environment also changes. Early compilers (COBOL, FORTRAN, others) could build on the assumption that the generated code would have the ma- chine to itself, but modern compilers need to target both the machine architecture and the operating system that runs on it. Both of these change quickly with time, and are not uniqe - at any given time we may need to generate code for multiple versions of multiple architectures (different versions of Intel, AMD and ARM pro- cessors just to cover the most basic types), and operating systems (multiple versions of Windows, Linus, Unix, IOS, and others). We may choose to limit our compilers to a particular machine and operating system, but even then we have operating system and machine version changes.

To deal with the problem, we generally split the compiler into (at least) two parts, named with great imagination and originality:

• FRONT END - recognizes the input language and outputs what it means. optimization

1

2

• BACK END - gets the meaning from the FRONT END, generates exe- cutable code. optimization

This structure allows us to put al the language information into the FRONT, and all the machine information into the BACK - we can ideally have a single FRONT end for a specific language, and then multiple BACK ends, each specific to a particular machine/OS. Ideally we could write FRONT ends for different languages, generat- ing the same I-code. This should allow us to just mix and match any FRONT end translator for a language to any BACK end for a specific machine or OS.

1.1. I-code (intermediate code). We are missing something here - how do we transfer the meaning from the FRONT to the BACK? we need an intermediate language that both sides can recognize. It can’t be the original input - that is designed for humans and translating it is too complex. It cant be the machine code - machine code has too much dependence on hardware features and peculiarities (for example, MIPS processors have no instruction to copy from one register to another - to copy from R1 to R2, you do: R2 = R1 + 0 ). We usually define an intermediate code (I-code) and use that to transfer the meaning.

What is the I-code? It has to be at a high enough level to capture the meaning of the input code and ideally be readable by humans (otherwise how do we tell if a compilation error is in the output machine code or in the output from FRONT?). I-code should be easy to read by the back end, so it does not have to do the heavy translation work of the front end. So again, what is the I-code?

The quick answer is: we dont know. More detailed: we have tried many things, but we dont know which alternative is better, we have not decided on rues that allow us to decide which is the best way to design an I-code. We require (at a minimum!):

(1) readable by humans, (2) trivially easy to translate by machines (3) expressive enough to capture all the meaning of the input text

Bytecode (see bytecode compilers from Lecture 1) meets 2 and 3, but fails at 1. Our text uses 3-address code - this looks like assembly language, x=a+b trans-

lates to ADD x, a, b. This is easy to translate because it is a fixed format - 4 items in each statement, and it is always: operator, output, input, input. We can prove that operators that require more than two inputs can be translated to a series of binary operators. It looks a lot like assembly language, but without a lot of the machine dependent details that make assembly hard to read. It can have issues with requirement 3, however - for example, what do you do with a switch state- ment? You can always translate it into a sequence of IF statements, which easily fit 3-address code (if x then a else b translates to: IF test.x goto.a goto.b ). But executing the switch directly, as a conditional jump to one a set of addresses in an array is much faster. (This is not the only issue - in my parallel C compiler, which does asynchronous communication, I need information about high level variable use that gets buried in the details of 3-address code).

To deal with the problems of 3-address code, some compilers have used a high- level language as I-code, but one which is easier to read by machine such as Lisp or Scheme, or even Modula-2; this approach can run into problems with requirement 2. Another possibility that has been tried is to represent the program as a graph,

3

and use that as I-code - this preserves the high-level structure and makes it easier to optimize, but has problems with requirement 1 for large programs.

We will follow the book here, and use 3-address code in class and labs, but remember that this is not a definitive thing. What is the right I-code and why is not a settled issue.

A remaining question - does our suggestion at the end of section 1.1, that we can just take a front end from any language and match it to a back end for any machine or operating system? Not quite, unfortunately. The FRONT - BACK organization does allow us to split the language and hardware dependences, so for example we just have to change the back end to allow the same language to use new hardware featurs. The ideal of using the same I-code for a large set of languages seems to have more problems than we expected. The issue seems to be that there are enough detail differences between languages to force us to extend the I-code to support each particular language. But making the I-code broad enough to support everything makes it harder to translate by the back end, and increases its complexity (the back end would then need to support all the features of the I-code and be able to generate suitable machine code, which increases its complexity in ways that would not be needed to support a single FRONT end.

As a result, you get things like this: older versions of GNU Fortran front ends generated C code instead of any I-code. The C code was then passed to GNU gcc compiler, translated to I-code by a front end and then translated to executable. GNU could then match the same Fortran front end to whatever system specific gcc variant existed on the target system. C is sufficiently high-level to support whatever structure Fortran needs, and this avoids the need to put the complexity of the front end into the back end code. (C and C++ turn out to be similar enough that they can even share the front end - gcc and g++ can and do both use the same front-back end combination).

1.2. Optimization. In describing FRONT ans BACK ends, we placed optimiza- tion in both. What is going on?

First, what is optimization? Well, it is not what the word says - optimization should be producing optimal code. But deciding if a piece of code is optional or not is uncomputable! (For example, it is clearly not optimal if our code never halts and gives an answer, but this is the uncomputable Halting Problem. For similar reasons, we can’t even guarantee the code is correct - debugging is uncomputable.)

What optimization actually attempts to do is to improve the performance of the code - by making it run faster, use less memory, or both. It does this by modifying the code - whereas we try to guarantee that whatever the programmer wrote matches the executable, optimization breaks this promise. Optimization changes the code as written to different code. Does it work? Usually, yes. Does it always work? No - sometimes it fails by not producing significant improvement. Sometimes it fails by producing code that does not correctly do what was intended, or even code that fails at runtime. For this reason, optimization is optional. Gnu compilers give you for levels - O0, O1, O2, O3. Default is O0 - no optimization. levels 1, 2 and 3 perform increasingly aggressive optimization - each modifies the program more than the lower levels, but higher level optimization needs to make assumptions about what the code should be doing that may be wrong.

4

In my experience, aggressive optimization usually works. For example, O3 can more than double the speed of some programs. It is safer to debug without op- timization first, so you know if your code does what it is supposed to do, before optimizing.

So - is optimization in the BACK or FRONT end? Depends on what you are trying to optimize. If you want to optimize at all, you need to put it in the back end, because that is when you are able to take advantage of machine features that you can use. You are constrained in doing this by whateve information you are getting fro your I-code. If you are using 3-address code, then the I-code hides language structures that you could have used in optimization. So you need to do language-dependent optimization, as well as anything that depends on high level features (for instance objects and data structures) in the FRONT end.

1.3. Bytecode compilers and interpreters. The FRONT end is needed for everything, we always need to parse the code to determine if the syntax is correct and translate it - either directly into actions as in a pure interpreter (old BASIC ), or into some kind of intermediate code (almost everything else). A bytecode compiler like Java can have the same structure as a traditional compiler - the virtual machine still acts like a real machine, so the compiler can be structured in the same way. A key difference is that moving the compiler to a differrent machine or operating system does not require changing the compiler at all - the virtual machine makes all systems look the same.

Interpreters, either traditional or more recent ones that translate to an internal runtime representation, are essentially augmented FRONT ends - the main differ- ence between something like a Lisp interpreter and a Java compiler is that, in Java, the runtime engine is separated from the language, and so you can separately save the bytecode. Unless you change the program, you don’t have to compile more than once. In Lisp, the runtime engine is inside the same program that does the translation, so you don’t save the runtime code separately. It is possible in some interpreters to save the system runtime state when you shut down the interpreter. This state includes the translated code, so as long as you do this, you do not need to reload source code or translate it again.