Compilers(Question about Computer Science)
Dr. Ernesto Gomez : CSE 570/670 We revisit some concepts from lecture 1, then start on formal language defini-
tions.
1. Thoughts about human and computer languages
Human languages tend to be contextual and ambiguous. Contextual: the meaning of anything we say must be interpreted taking into
account the context in which it is said. For example: if I say “Look out! A shark!” in class, context tells you this is an example of something. If I yell the same thing while we are swimming in the ocean, it is either an urgent warning or a bad practical joke. Cultural context also plays a large part in interpretation of human language. Inoffensive statements in some cultures can be insulting in others, even when the same language is spoken in both - when I moved from Cuba to Puerto Rico (both Spanish speaking countries) as a child, I found that some common everyday Cuban words were at best impolite in Puerto Rican usage.
Ambiguous: the same phrase can be understood multiple ways. In this case, we are talking about the same expression in the same context. Consider the signs on the stairs in this building: IN and OUT. Does IN mean in to the building, or in to the stairwell? Or consider: if cat food is fed to cats, and baby food to babies, what is “cheese food”?
An interesting thing about human language is that even incorrect expressions carry meaning: for example, when we are starting to learn a language we can make ourselves understood by others before we have fully mastered the grammar. “I have hungry” is not correct, but we would understand it to mean the speaker is hungry.
All the above makes human languages drastically unsuitable for communicating with a computer. People have long tried to get a human language interface to control computers (computers that accept voice commands are the latest variation on this). What we really want, however, is a computer that does what we mean, and in human language we can’t always tell what we mean by what we actually say.
Our best solution to this problem to date is to design artificial languages for computers, with properties different from human languages. We want the following properties:
• Freedom from ambiguity: A statement in a computer language should have only one meaning.
• No dependence on external context: The meaning of a statement should be understandable given only the statement itself (or at least the program of which it is a part), and the meaning should not change with circumstances.
To these we add two practical consideration: • It should be possible to determine the meaning of a statement without doing
too much work or spending too much time. That is, the time complexity of translation should be small.
• The syntactic correctness of a statement should be unambiguous. That is, there should be no doubt, independent of meaning, whether a statement is correct (in the language) or not.
As computers become more powerful, what was considered “too much work” in the past may become practical, but some problems - e.g. exponential complexity -
1
2
remain intractable; we don’t want a compiler that could take exponential time to translate something.
The second practical consideration may not be obvious at first. We are not here saying that a statement needs to be unambiguous; we are saying that deciding whether something is a statement or not needs to be unambiguous. This is because, when we define a language, we don’t want one compiler translating a statement and a different compiler treating it as an error (and yes, we don’t really achieve this, as anybody who has ever uploaded a Visual C++ program to the lab and tried to compile it with Gnu knows..).
2. Sets
We limit our considerations to written languages. We are then faced with two problems: recognition (deciding if a particular expression is in the language) and translation (deriving the meaning of an expression and rendering it in a different language, e.g. machine language). The first problem refers to the syntax of the language, that is, its structure. The second problem is that of semantics (meaning), which we will try to derive from the syntax.
We first address the recognition problem. To do this we need a formal description of the syntax of a language; as we will see there are a number of ways of doing this. All, however, start from an alphabet, which is a set of characters which we denote by the symbol Σ (this is also true of human languages). We will use a superscript ∗(Kleene star operator) to denote zero or more of the expression to which it is applied; thus a∗ means a string of zero or more a characters, and Σ∗ is the set of all possible combinations of zero or more characters in Σ. Some of those combinations will be expressions in the language, others will not. The requirement that we unambiguosly be able to decide whether an expression is or is not in the language leads us to reject anything not syntactically correct, even if its meaning may be clear (e.g. a C++ statement missing a final semicolon, or the English examples given above).
We can then define a language L as a subset of Σ∗; the definition of the syntax then becomes the definition of what is included in this subset. From this definition, we would like to produce an algorithm that allows us to determine whether a particular string is or is not in L; such an algorithm solves the recognition problem for L.
2.1. Formal languages and machines. Let T be a set of symbols, it could be the same thing as the alphabet Σ , but we may later extend T to include not just single characters, but more complicated structures - for example specific words such as “while”, syntactic patterns such as floating point numbers. We can still say T∗ includes all possible combinations of things in T, and define a language L ⊂ T∗ in terms of rules to select patterns that are in the language. Now, the rules can be anything we can define as an algorithm. Some examples:
(1) T1 = {0, 1} ; strings of 0 and 1 in any order. ( binary numbers) (2) T2= T1and{+, =}; string_in_T1= string_in_T1+string_in_T1 (addition
expressions) (3) T3= T2and{∗, ”, )};string that consist of strings in T1 with “*”, “*(“ ,” +”,”
+(“ between each pair, and where every “(“ has a matching “string_in_T1)” - (describes syntax for arithmetic expressions with *,+ and parenthesis)
3
(4) T4 =T3and(=);string of the form : number = string_in_T3 (assignment statements)
(5) T5= T4 representing correct binary arithmetic.
1 through 4 are formal languages, 5 is not because it assigns meaning to the string. As we go down the list, the machinery required becomes more complicated. The first two can be done by a simple finite automaton (FA) - this is just a state machine, as we will see it decides what to do next based entirely on the state it is in. Three and four and more complicated, to match parenthesis you need to remember how many open parenteses you have that must be closed before the end of the string. We will see later that a stack is enough to keep track of things that need to match. A push-down automaton (PDA), essentially an FA with a stack can do this. T5, in addition to matching parenthesis, must keep track of partial results - it takes a Turing Machine (TM) - turns out that a TM can be thought of as a state machine (like an FA) with an unlimited tape, so it can go back and forth on the data, modify it, and read it multiple times.
The languages we want to traslate are Turing equivalent languages. Turing ma- chines were conceived as a mathematical definition of “algorithm” - we cant actually prove that this is true (the claim that all algorithms can be computed on a TM is the Church-Turing Hypothesis, from Alonzo Church and Alan Turing). However, we have never found anything that violates it, and every other computational model we have ever devised turns out to be able to solve exactly the same problems that a TM can solve - with the possible exception of quatum computers, all other com- putational models are within a polynomial factor of each other i terms of efficiency - no model is enough better or worse than any other to make a real difference to what can be done in practice. So we think there is something really basic about machine models.
Anyway, computer languages like Python and Lisp are equivalent to TM, and we write our programs in them. However, they are too powerful for the language translation task. TM programs can include the “Partial Recursive Functions” - these are functions that don’t have solutions for all input data. Submit such a function to a TM, sometimes it will run forever. If it halts, the time it will take is unpredicatable - just because a program has been running for millions of years doesn’t mean it won’t halt, it just means it hasn’t halted yet. This means that some problems are not computable. For example, the question “does a TM running program X halt when you input data Y “ is not computable - this is the halting problem.
Human languages are complicated enough to require this kind of power - we have all encountered phrases which are grammatically correct, but we don’t understand what they mean. Consider “What is the sound of one hand clapping?” or “Why is a mouse when it spins?” . The first phrase is a Zen koan - its is not supposed to have an answer (maybe?) - it is something you think about on the road to enlightenment (Does the Buddah understand this koan? Could he proved an answer we could understand? Not being enlightened, I dont know.) The second actually does have an answer - it is supposed to mean “the higher, the fewer”, and is said to be a reference to spinning valves which were used to indicate pressure on the first steam engines in Scotland. But given some random phrase, how do we know if there is some context that gives us a meaning or not?
4
As humans, we are intelligent enough to give up (or not - we could spend our entire lives thinking about Zen koans). Computers are not that smart - we want to be able to guarantee that a compiler will halt in finite (and reasonabley short) time, and either give us an error message or a translation. We can not guarantee this for a TM. Simpler machines can be guaranteed to halt, we can make good estimates of the time they will take, we can make them structurally unambiguous in ways we can not do on TM. Less powerful machines are simpler to build, as well.
2.1.1. Machine Models. You may wonder - how many machine choices do we have? It turns out, not as many as you would think, most of the alternatives are equivalent to broad theoretical categories. Consider Turing Machines - we would not actually compute anything on a TM, we would use something else that is faster/easier to program. Our computers are more or less VonNeumann machines - a CPU that does the work, a random access memory to store program and data. This is completely equivalent to a TM. We could use the Lambda Calculus, introduced by Alonzo Church to define mathematical functions - it is completely equivalent to a TM. string_in_T1e
In fact, we have just three computational models - everything is equivalent to one of them. These are:
• Finite automata : FA • Push Down Automata : PDA - an FA with a stack • Turing Machine : TM - an FA with a tape (and it turns out we can simulate
a tape with two stacks, wo we can define everything with FA and stacks!) Notice, by the way, that every machine can do everything that less powerful machine can do - Build a TM with 2 stacks - take away one stack and it is a PDA. Take the stack away from the PDA and it is an FA. The more powerful machine can always do whatever the less powerful machine does by simply not using some of its hardware.
Having decided not to use TM for the translation task, we have the choice of PDA or FA. Which will we use? We are going to use both FA and PDA. Even though PDA is more powerful, it is easier to define simple patterns (like numeric formats, rules for variable names) using an FA, so we will start with a Finite Automaton, and use its output as the input to a Push Down Automaton.
3. Regular languages
(The following discussion is similar to, but gives somewhat different algorithms from, sections 3.6-3.7 in the book)
We define a Regular Expression (R.E.) inductively as follows (see also pg. 94): Given an alphabet Σ: (1) a ∈ Σ is a regular expresion, denoting the set < a > (the set containing the
character a). (2) ε (denotes a string of length 0) is a regular expression, denoting the set<
ε >. (3) ∅ is a regular expression, denoting the empty set ∅. (4) Union: If A and B are regular expressions, A∪B is a regular expression. (5) Concatenation: If A and B are regular expressions, AB is a regular expres-
sion. (6) Star: If A is a regular expression, A∗ is a regular expression.
5
(7) Nothing else is a regular expression.
A regular expression over some alphaber defines a subset Σ∗; we call this a regular language. We claim that a specific type of machine, called a Finite Automaton, expresses an algorithm which is capable of recognizing any regular language. That is, if we can define a syntax by a regular expression, we can create a finite automaton that recognizes that syntax.
4. Finite automata
We usually depict a finite automaton as a directed graph, in which circles denote states and arrows denote transitions from one state to another. The input to a finite automaton is a string over some alphabet. There is some defined starting state that reads the first character of the string. The arrows departing a state are labelled with characters; the finite automaton takes an arrow labelled with the character it has just read and transitions to the state that arrow points to; then a new character is read and the process is repeated.
If, when all characters in the string have been read, the finite automaton is in a state marked as “accepting” (indicated by a concentric circle inside the circle which denotes the state), then the finite automaton is said to accept the string - that is, the string is in the language.
The finite automaton is deterministic if the choice of which transition to take out of every state is unique (no pair of departing arrows is marked with the same character).
It is useful to formalize this description of finite automata: We define a Non-deterministic Finite Automaton (NFA) as a 5-tuple: N = (Σ,Q,τ,q0,F) is an NFA, where: Σ is an alphabet. Q is a finite set of states. τ : Q× Σ → 2Q is a transition function from the cartesian product of the set of
states with the alphabet to the power set of the set of states (The power set of Q is the set of all subsets of Q). q0 is a single initial state. F ⊆ Q is the set of final states. A Deterministic Finite Automaton (DFA) differs from an NFA only in the tran-
sition function: τ : Q× Σ → Q. Note that Q ⊂ 2Q, therefore a DFA is a restricted case of NFA.
5. Construction of NFA from R.E.
Construction based on notes from Ernst Leiss, Department of Computer Science, University of Houston. The text gives a similar approach in section 3.7 (page 121). The construction given here has the advantage that it avoids the need for ε transitions.
The graphs are as described in the text. We use the following format for transi- tion tables:
state input smbol accept state number new state or - T or F
6
The start state is always numbered 0. “New state” is the number of the state to transition to on the given “input symbol”, “-“ denotes a blank entry, indicating that there is no transition.
5.1. Basis:Regular languages are defined by rules which only describe which possible combinations of characters from Σ are in the language, and which are not. Consider the following definition: Let Σ = {-,0,1,2,3,4,5,6,7,8,9} ; that is, numeric digits and the “-” sign.
Consider now the language: L=(−|ε)(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗|0 What does L describe? The first parenthesis says that a string ∈ Lcan optionally
start with − , but the | with the empty string ε says it is not required. The next character after -, or the first character if - is absent, is a single digit from 1 to 9. This can be followed by zero or more digits between 0 and 9 (the * operator). Finally, the digit 0 by itself is also in L. This lexical description happens to match I, the set of the integers. without the -, we get the postitive integers, with the - we get the negative integers. a non-zero integer (positive or negative) may not begin with the digit 0, so we “or” all the non-zero integers with the digit 0.
We can use the notation (1 . . . 9) when we have a standard lexical ordering, as we do from the digits 1 to 9, or the letters i . . .n in alphabetical order, to mean a single letter selected from the start and end symbols and anything in between. This is a convenient abbreviation, but not part of the regular language definition, becausL=(−|ε)(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗|0e it requires the use of an ex- ternal context (like numbers ordered by value, or alphabetical order). Using this, we could rewrite our language as: L=(−|ε)(1 . . . 9)(0 . . . 9)∗|0, which is both shorter and easier to read.
See if you can define a deterministic finite automaton that accepts this language (see chapter 3 of text, notes 3). < a > :
q q 0
States: Q = {q0,q1}, initial state q0 Transition function: τ(q0,a) → q1, no other transitions. Accepting states: F = {q} Table:
Regular languages are defined by rules which only describe which possible combinations of characters from Σ are in the language, and which are not.
Consider the following definition: Let Σ = {-,0,1,2,3,4,5,6,7,8,9} ; that is, numeric digits and the “-” sign. Consider now the language: L=(−|ε)(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗|0 What does L describe? The first parenthesis says that a string ∈ Lcan optionally
start with − , but the | with the empty string ε says it is not required. The next character after -, or the first character if - is absent, is a single digit from 1 to
7
9. This can be followed by zero or more digits between 0 and 9 (the * operator). Finally, the digit 0 by itself is also in L. This lexical description happens to match I, the set of the integers. without the -, we get the postitive integers, with the - we get the negative integers. a non-zero integer (positive or negative) may not begin with the digit 0, so we “or” all the non-zero integers with the digit 0.
We can use the notation (1 . . . 9) when we have a standard lexical ordering, as we do from the digits 1 to 9, or the letters i . . .n in alphabetical order, to mean a single letter selected from the start and end symbols and anything in between. This is a convenient abbreviation, but not part of the regular language definition, becausL=(−|ε)(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗|0e it requires the use of an ex- ternal context (like numbers ordered by value, or alphabetical order). Using this, we could rewrite our language as: L=(−|εSeeifyoucandefineadeterministicfiniteautomatonthatacceptsthislanguage(seechapter3oftext,notes3).)(1 . . . 9)(0 . . . 9)∗|0, which is both shorter and easier to read.
See if you can define a deterministic finite automaton that accepts this language (see chapter 3 of text, notes 3).
state a not a accept 0 1 - T 1 - - F
< ε >:
q 0
States: Q = {q0}, no transitions. Accepting states: F = {q0} Table:
state a accept 0 - T
∅:
q 0
States: Q = {q0}, no transitions. Accepting states: F = ∅ (no accepting states). Table:
state a accept 0 - F
8
5.2. Induction: Σ is the union of the alphabets of all the NFAs in the following operations. Union: N1 = Nα ∪Nβ
Description: Merge the initial states of the two automata into a single initial state. The transitions out of the combined initial state are the same as the transi- tions out of the two original initial states. The automata are otherwise unchanged.
- Q1 = (Qα −q0α ∪Qβ −q0β) + q0 - q0 = combination of q0α and q0β - τ: {τ(q0,a) = τα(q0,a) ∪ τβ(q0,a) τ(q,a) = τα(q,a) ∪ τβ(q,a)} - F1 = Fα ∪Fβ
Algorithm 1. Given: two tables, T1 and T2 representing finite automata A and B, to combine into a table T3 representing A∪B.
Assume all states in each table are numbered starting from 0. Let N1 be the highest numbered state in T1. If K is the number of a state in T2, replace number K with K+N1.
Now construct T3 as follows: 1. Start with a copy of T1, with the same states and column entries for all
columns labeled with input symbols. 2. If there is an input symbol x that appears as the label of a column in T2 that
is not in T1, add a column to T3 and label it with that symbol x. Enter a “-“ in the new column x for all rows representing states in T1.
3. Add rows for each state in T2 other than state 0, with the same entries for all columns labeled with input symbols. If some symbol y appeared as a column label in T1 and not in T2, enter a “-“ in all such columns.
4. Copy all non-blank entries from columns labeled by input symbols in row 0 of T2 into the corresponding column in row 0 of T3. If some column in row 0 of T3 already has a non-blank entry (that is, there is already a state number there), add the new state number separated by a comma.
5. Enter T in the accept column for any state in T3 copied from either T1 or T2 that had T in its accept column, enter F in the accept column of T3 for all other states.
Concatenation : N2 = NαNβ Description: eliminate the initial state of Nβ. Connect each of the final states
of Nα to every one of the states of Nβ that had transitions from the initial state of Nβ, with transitions on the same characters that are on transitions from q0β. The final states of N2 are the final states of Nβ; if Nβ accepts ε then the final states of N2 also include the final states of Nα.
- Q2 = Qα ∪Qβ −q0β - q0 = q0α - τ: { τα(q,a) ; q ∈ Qα −Fα τα(q,a) ∪ τβ(q0,a) ; q ∈ Fα τβ(q,a) ; q ∈ Qβ −q0β} - F2 = Fβ if Nβ does not accept ε, F2 = Fα ∪Fβ if Nβ accepts ε.
Algorithm 2. Given: two tables, T1 and T2 representing finite automata A and B, to combine into a table T3 representing AB. Assume all states in each table
9
are numbered starting from 0. Let N1 be the highest numbered state in T1. If K is the number of a state in T2, replace K with K+N1.
Now construct T3 as follows: 1. Start with a copy of T1, with the same states and column entries for all
columns labeled with input symbols. Copy all T and F entries from the accept column into the new table.
2. If there is an input symbol x that appears as the label of a column in T2 that is not in T1, add a column to T3 and label it with that symbol x. Enter a “-“ in column x for all rows representing states in T1.
3. Add rows for each state in T2 other than state 0, with the same entries for all columns labelled with input symbols. If some symbol y appeared as a column label in T1 and not in T2, enter a “-“ in all such columns.
4. If state 0 in T2 has a non-blank entry in the column for some input symbol z, copy that entry into the row labeled z for all states of T1 marked T in the accepts column. If some column and row of T3 already has a non-blank entry (that is, there is already a state number there), add the new state number separated by a comma.
5. Enter T in the accept column of T3 for any state of T2 that had T in its accept column. If state 0 of T2 was not an accepting state, replace the T entry in all states copied from T1 with F, else leave them unchanged. Enter F in the accept column of T3 for any states not otherwise marked.
Star: N3 = N∗ Description: If q0 is not an accepting state, make it an accepting state. Connect
each accepting state to the same states which can be reached from q0, by transitions on the same characters that are on the transitions from q0.
- Q3 = Q - q0 = q0 - τ1: { τ(q,a) ; q ∈ Q−F τ(q,a) ∪ τ(q0,a) ; q ∈ F } - F3 = F ∪{q0}
Algorithm 3. Given a table T1 representing a finite automaton A, construct a table T2 representing A∗.
1. Copy table T1 into table T2. 2. If a state K is marked T in the accept column of T1, copy all non-blank
entries in the input symbol columns of state 0 into the corresponding column of state K. If some column and row already has a non-blank entry (that is, there is already a state number there), add the new state number separated by a comma.
3. If a state is marked T in the accept column of T1, mark it T in the accept column of T2. If state 0 of T2 is not already marked as an accepting state, mark it as T in the accept column of T2. Mark the accept column of all states not already marked with F.
5.3. Construction of NFA from R.E.. Identify basis items in the expression (letters in Σ, ε, ∅), draw graphs from the basis set for each basis element. Combine pairs of elements using the operations in the recursive construction. Combine the resulting NFAs. Repeat until the entire expression is a single NFA.
5.4. DFA and minimal DFA:. See algorithm 3.2 on page 118 of text, algorithm 3.6 on page 142.
10
5.5. Equivalence of Regular Expression and Finite Automata. The above algorithms, taken together, constitute a proof that, for every regular expression there is a finite automaton. We have also proved the equivalence of NFA and DFA by providing algorithms to convert from NFA to DFA. It is not necessary to show that from a DFA we can produce an equivalent NFA, because every DFA can be considered to be an NFA in the limiting case in which every combination of state and simbol has only one possible path.
For our purposes in compiler construction, it is sufficient to show the above. However, for completeness, we also need to consider conversion of a finite automa- ton to a regular expression. See Hopcroft and Ullman, Introduction to Automata Theory, Languages and Computation, Theorem 2.4 for a proof that from every DFA we can produce a regular expression.
An alternative way of convincing ourselves that finite automata can always be converted to regular expressions is to devise an algorithm that does this. Following is an approach to such an algorithm.
We represent a FA as a directed graph G = {V,A} where states qi are vertices V , paths qi → qj are arcs in A, and we annotate the paths with the symbols that cause a transition along that path.
A depth-first traverse of G can be used to generate a regular expression that corresponds to the particular FA. Multiple symbols along a path are related by or = | in the regular expresion, multiple paths from a state are also connected by or. The ∗ operator is applied when a path qi → qj connects to a state that is on the stack, to all the symbols (on the stack) between qjand qi. Try to devise an algorithm that accomplishes this - consider pushing states and symbols onto a stack while traversing the graph, constructing the regular expression during the backtracking phase of the traverse when popping symbols off the stack.
The text book constructs (chapter 3) a detailed algorithm, first for converting a non-deterministic finite automaton (NFA) to a deterministic (DFA)y, and then another algorithm to minimize the resulting DFA to the smallest possible DFA that accepts the same language. We then go full circle by converting the DFA to a regular expression - this whole sequence serves as a proof that Regular Expressions (RE) are equivalent to NFA, and that NFA are equivalent to DFA. That is, RE, DFA and NFA can recognize the same set of languages.
Further, the ability to minimize a DFA gives us the ability to determine if any pair of language definitions (RE, DFA or NFA) accept the same language - if two RE, or two FA minimize to the same DFA, then they accept the same language - that is, they mean the same thing.
All of this is of great theoretical interest but very little practical interest. The problem is, the algorithm to convert NFA to DFA has exponential complexity. Therefore the conversion can only be done for small languages and automata - the amount of work required for a more complex automaton or expression grows to fast. (The same applies to the backtracking algorithm we suggested in Notes 3 - it is a backtracking algorithm, and such algorithms also have exponential complexity).
5.6. Examples. Regular languages are defined by rules which only describe which possible combinations of characters from Σ are in the language, and which are not. Consider the following definition:
Let Σ = {-,0,1,2,3,4,5,6,7,8,9} ; that is, numeric digits and the “-” sign. Consider now the language:
11
L=(−|ε)(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗|0 What does L describe? The first parenthesis says that a string ∈ Lcan optionally
start with − , but the | with the empty string ε says it is not required. The next character after -, or the first character if - is absent, is a single digit from 1 to 9. This can be followed by zero or more digits between 0 and 9 (the * operator). Finally, the digit 0 by itself is also in L. This lexical description happens to match I, the set of the integers. without the -, we get the postitive integers, with the - we get the negative integers. a non-zero integer (positive or negative) may not begin with the digit 0, so we “or” all the non-zero integers with the digit 0.
We can use the notation (1 . . . 9) when we have a standard lexical ordering, as we do from the digits 1 to 9, or the letters i . . .n in alphabetical order, to mean a single letter selected from the start and end symbols and anything in between. This is a convenient abbreviation, but not part of the regular language definition, becausL=(−|ε)(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)∗|0e it requires the use of an ex- ternal context (like numbers ordered by value, or alphabetical order). Using this, we could rewrite our language as: L=(−|ε)(1 . . . 9)(0 . . . 9)∗|0, which is both shorter and easier to read.
See if you can define a deterministic finite automaton that accepts this language (see chapter 3 of text, notes 3).