Build an interpreter and a compiler to C ++ for the language B OOLexp

profilevic
PLCI_Chapters2_3_4.pdf

CO NF

ID EN

TI AL

D RA

FT

Chapter 2

Formal Languages and Grammars

Author: Saverio Perugini Copyright © 2018 by Saverio Perugini ALL RIGHTS RESERVED

2.1 Chapter Objectives

• Establish an understanding of syntax and semantics.

• Establish an understanding of formal methods for defining the syntax of a programming language.

• Establish an understanding of regular languages and grammars (lex- emes/tokens).

• Establish an understanding of context-free languages and grammars (expressions).

• Establish an understanding of the use of Backus-Naur form to define context-free grammars.

• Establish an understanding of role of context in programming lan- guages and the challenges in modeling it.

39

CO NF

ID EN

TI AL

D RA

FT40 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSTable 2.1: Progressive levels of sentence validity. candidate sentence lexically valid? syntactically valid? semantically valid? Socrates is a mann. no no no Man Socrates is a. yes no no Man is a Socrates. yes yes no Socrates is a man. yes yes yes

2.2 Introduction to Formal Languages

A sentence is a string of characters over some alphabet. For instance, given an alphabet Σ = {a, b, c}, the strings a, aa, aaa, bb, aba, and abc are sen- tences. A formal language is a set of sentences. For instance, {a, aa, aaa, bb, aba, abc} is a formal language. The legal strings of a formal language are called sentences. Recall that syntax refers to structure or form of language and semantics refers to meaning of language. Previously, both syntax and semantics were described intuitively. Now, well-defined, formal systems are available to define the syntax and, to a much lesser extent, semantics of formal languages. A formal grammar can be used to define the syntax of a formal language. There are finite and infinite languages. Finite lan- guages have a finite number of sentences. The language above is a finite language (i.e., it has six sentences). The C programming language is an in- finite language. Most interesting languages are infinite. A set of sentences define a formal language extensionally while a formal grammar defines a formal language intensionally. Of course, we cannot extensionally define an infinite language.

Since this text takes an implementation-oriented approach, we motivate our discussion of formal languages and grammars by thinking about the steps an interpreter or compiler must execute to determine the validity of a program (i.e., determine if the program is a sentence in the language). We then examine these steps from a more formal and theoretical perspective. Armed with an understanding of the theory of formal language definition mechanisms and methods, we return to practice and study how those de- vices can be used to recognize a valid program during interpretation or compilation.

There are three levels of sentence validity. A sentence is lexically valid if all the words of the sentence are valid. A sentence is syntactically valid if it is lexically valid and the order of the words is valid. A sentence is semanti-

CO NF

ID EN

TI AL

D RA

FT2.3. REGULAR LANGUAGES AND GRAMMARS 41cally valid if it is lexically and syntactically valid and if it has a valid mean- ing. Consider the sentences in Table 2.1. The first candidate sentence is not lexically valid because ‘mann’ is not a word and, therefore, the sentence cannot be syntactically or semantically valid. The second candidate is lexi- cally valid because all of its words are valid, but it is not syntactically valid because the arrangement of those words does not conform to the subject- verb-article-object structure of English sentences and, therefore, it cannot be semantically valid. The third candidate is lexically valid because all of its words are valid and syntactically valid because the arrangement of those words conforms to the subject-verb-object structure of English sen- tences, but it is not semantically valid because the sentence does not make sense. The fourth candidate sentence is lexically, syntactically, and seman- tically valid. Notice that these levels of sentence validity are progressive. Once a candidate sentence fails any test for validity, it automatically fails a more stringent test for validity. In others words, if a candidate sentence does not even have valid words, those words can never be arranged cor- rectly. Similarly, if the words of a candidate sentence are not arranged correctly, that candidate sentence can never make semantic sense. For in- stance, the second sentence in Table 2.1 is not even syntactically valid so it can never be semantically valid.

2.3 Regular Languages and Grammars

A regular grammar (also called a linear grammar) is a type of formal gram- mar used to define a formal language. Regular grammars define regular languages. Regular grammars have their own syntax, not to be confused with the syntax of the languages they are used to define. Thus, a reg- ular grammar is a metalanguage or a language to describe a language. Some characters in the alphabet of regular grammars, not the language being defined, are special and called metacharacters. Two important reg- ular grammar metacharacters are * and +. The semantics of the *, oth- erwise known as the Kleene star or Kleene closure operation, are 0 or more occurrences of previous character, and the meaning of + is union (i.e., OR). For instance, the regular grammar hw⋆ defines the language {h, hw, hww, hwww, . . . , hwwww, . . .}. In this case the set of sentences is infinite. All finite languages are regular, but the reverse is not true. More

CO NF

ID EN

TI AL

D RA

FT42 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSgenerally, any language which can be defined by a regular grammar is a regular language. Any finite language can be defined by a regular gram- mar because we can simply list all sentences of the language with interven- ing + metacharacters in the grammar. For instance, the regular language {a, b, c} is defined by the regular grammar {a+ b+ c}. A regular grammar is a generative device for a regular language. In other words, we can use the grammar to generate sentences from the language it defines (e.g., h, hw, hww).

There are extensions to the syntax of regular grammar which provide a shorthand notation to help us define regular grammars in a more concise manner. For instance, . often represents any single character. The se- mantics of [c1c2c3. . . cn] or [c1–cn], called a character class, is any one of the characters in the range c1–cn. For instance, [abcxyz] (which is equivalent to the regular grammar a+ b+ c+ x+ y + z) specifies any of a, b, c, x, y, z, and [a−zA−Z] indicates any alphabetic character. More examples include

• hw[1−9][0−9]⋆, which defines the language {hw1, hw2, . . . , hw9, hw10, hw11, . . .}, and

• [0−9][0−9][0−9]−[0−9][0−9]−[0−9][0−9][0−9][0−9], which defines the language of social-security numbers.

Prefacing a character class with a ∧ indicates the complement of the speci- fied class (i.e., [∧n−m] represents all characters outside of the range n−m). For example, [∧a−zA−Z] specifies any character except an alphabetic one. If you desire to define a language whose sentences contain the − special character, place it either at the beginning or end of a character class (e.g., [−abcd]). If you desire to define a language whose sentences contain the ∧ special character, place it somewhere other than the beginning of a charac- ter class (e.g., [abcd∧]).

While not technically part of the syntax of regular grammars we use these extensions here for ease of exposition. Since any regular grammar using these syntactic extension can be re-written using ⋆ and +, these ex- tensions are non-essential and, thus, are typically referred to as syntactic sugar. These extensions to regular grammar form the basis of UNIX regular expressions which involve a much richer set of extensions.

A regular grammar is said to match strings (i.e., the sentences of the language the grammar defines). For instance, the regular grammar

CO NF

ID EN

TI AL

D RA

FT2.3. REGULAR LANGUAGES AND GRAMMARS 43[_a−zA−Z0−9] [0−9]

s1

s2

s3

[_a−zA−Z]

[1−9]

Figure 2.1: A finite-state automaton for a legal identifier and positive integer in C defined by the regular grammar [ a− zA− Z][ a− zA− Z0− 9]⋆ + [1− 9][0− 9]⋆.

Table 2.2: Two-dimensional array implementing a finite state automaton for a legal iden- tifier and positive integer in C: [ a− zA− Z][ a− zA− Z0− 9]⋆ + [1− 9][0− 9]⋆.

current state input character 1 2 3

2 2 error [a− z] 2 2 error [A− Z] 2 2 error

0 error 2 3 [1− 9] 3 2 3

[∧ →][∧ →]⋆[ →][ →]⋆[∧ →][∧ →]⋆[ →][ →]⋆[∧ →][∧ →]⋆

matches any phrase of exactly three words separated by whitespace. Here tab is represented as→. This grammar matches the three underlined sub- strings (which are sentences by virtue of being matched) in the following string:

This is a short sentence.

Finite state automaton (FSA) is the model of computation used to recog- nize sentences defined by a regular grammar. An FSA accepts a string (i.e., candidate sentence) as input, and outputs ‘yes’ if, after running the en- tire string through the automaton, the automaton is left in an accepting state (i.e., one represented by a double circle, e.g., states two and three in Fig. 2.1) indicating that the string is valid and, thus, a sentence, and ‘no’ otherwise indicating that the string is not a sentence. A regular gram- mar is a model for the design of an FSA. Formally, an FSA decides a lan- guage. We return to this theme of the dual nature of grammars in while discussing context-free grammars below in § 2.4. Fig. 2.1 presents a finite

CO NF

ID EN

TI AL

D RA

FT44 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSstate automaton1 which recognizes sentences defined by the regular gram- mar [ a− zA− Z][ a− zA− Z0− 9]⋆ + [1− 9][0− 9]⋆ which describes pos- itive integers and legal identifiers in C.

Table 2.2 illustrates how one might implement an FSA using a two- dimensional array representation where the indices model current state and input character and the values in the cells represent the transitions (i.e., which state to transition to based on the current state and the input character).

In summary, a regular language (which is a formal language) is defined or generated by a regular grammar (which is a formal grammar) and rec- ognized by a finite state automaton (which is a model of computation). See the second row of Table 2.4.

2.3.1 Conceptual Exercises for Section 2.3

Exercise 2.3.1: Identify a keyword in C or C++.

Exercise 2.3.2: (true / false) The lexeme int is a keyword in C and C++.

Exercise 2.3.3: (true / false) There does not exist an infinite language L, where L is regular.

Exercise 2.3.4: Define a regular grammar which defines a language whose sentences are the set of all strings of letters which contain each of the five vowels exactly once and in order. Use only lower case letters. For example, your regular grammar must be capable of generating the strings aeiou, ateitou, and zaeztiouzpq, but not aeou, aaeiou, and aeitgaou. You may use only the ., ⋆, +, [ ], and ∧ metacharacters in your regular grammar.

Exercise 2.3.5: Explain with examples how a regular grammar using the . (dot) syntactic extension to regular grammar can be rewritten without use of the . (dot).

1The FSA in Fig. 2.1 is not a pure FSA because it, like the grammar which defines the language it recog- nizes, uses syntactic sugar. While this FSA only has three transitions, it should have one for each individual input character which moves the automaton from one state to another. For instance, there should be nine transitions between states one and three, one for each positive digit.

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 452.4 Context-Free Languages and Grammars There is a limit on the expressiveness of a regular grammar. In other words, there are languages which cannot be defined by a regular grammar. As a result, there are computational limits to finite state automata. Consider, for instance, the language of balanced parentheses, where the strings ((()())()) and ()() are balanced and, thus, sentences in the language and the strings ((()()), )()(, and ())(( are unbalanced and not in the language. This lan- guage cannot be defined by a regular grammar. Regular grammar is not powerful enough to model such a language. What construct or capabil- ity is missing from regular grammar which prevents the definition of this language? Consider how we might implement a computer program to recognize strings in this language. We might use a data structure such as a stack to match the parentheses. Specifically, we might push onto the stack whenever we encounter an open parenthesis and pop from the stack when we see a closed parenthesis. When all the characters in the string have been consumed, if the stack is empty, the parentheses in the string are balanced and we have a sentence, otherwise we do not. Using such a data structure implies that we need some form or concept of memory to match parentheses (to keep track of the number of unclosed open paren- theses). While regular grammars can model the lexemes (e.g., identifiers) of programming languages, they cannot define languages such as those containing sentences involving balanced parentheses or other similar ex- pressions involving combinations of lexemes structured according to a pre- defined pattern relevant to programming languages. Therefore, we must study more expressive and powerful grammars to model and define more expressive languages. In short, regular grammars are powerful enough to model the lexemes of programming languages, but not the expressions and statements.

Linguist Noam Chomsky formalized a set of grammars in the mid- 1950s. Chomsky’s work resulted in what is known as the Chomsky Hier- archy (see § 2.6 below) which is a progressive classification of formal gram- mars used to describe the syntax of natural languages. Level 2 of the hi- erarchy defines a type of grammar, called a context-free grammar, which is most appropriate for defining programming languages. Context-free grammars define a class of languages called context-free languages.

CO NF

ID EN

TI AL

D RA

FT46 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARS2.4.1 Backus-Naur Form A stream of tokens must conform to a grammar (i.e., they must be arranged in a particular order). Grammars define how sentences are constructed and are defined using a metalanguage (i.e., a language for defining a language). A metalanguage for defining context-free grammars is called Backus-Naur Form (BNF). BNF gets its name from the last name’s of John Backus who developed the notation and used it to define the syntax of Algol 58 at IBM and Peter Naur who later extended the notation and used it for Algol 60. Backus was the recipient of the 1977 ACM A.M. Turing Award while Naur was the recipient of the 2005 ACM A.M. Turing Award for these contribu- tions.

The following is a context-free grammar defined in BNF for very simple English sentences.

(r1) <sentence> ::= <article> <noun> <verb> <adverb> . (r2) <article> ::= a (r3) <article> ::= an (r4) <article> ::= the (r5) <noun> ::= dog (r6) <noun> ::= cat (r7) <noun> ::= Socrates (r8) <verb> ::= runs (r9) <verb> ::= jumps

(r10) <adverb> ::= slowly (r11) <adverb> ::= fastly

The elements of BNF notation are

• a set of production rules which constitute the grammar (each line in the previous example is a production rule),

• a set of non-terminals (e.g., {<sentence>, <article>, <noun>, <verb>, and <adverb>}),

• a start symbol (e.g., <sentence>), and

• terminals (e.g., a, an, the, dog, cat, Socrates, runs, jumps, slowly, fastly).

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 47We can think of terminals as lexemes and non-terminals as tokens. The formal definition of a grammar is G = (V,Σ, P, S), where

• V is a set of non-terminal symbols,

• Σ is a set of terminal symbols (i.e., an alphabet)

• P is a set of production rules (or a finite relation P : V → (V ∪ Σ)∗, where ⋆ represents the Kleene star operation), and

• S is a starting symbol and S ∈ V ,

An extension made to BNF by Peter Naur for Algol 60 involved the ad- dition of syntactic sugar. While we discuss the details of the extension, called Extended Backus-Naur Form (EBNF), later, we cover one element of the extension, alternation, now since we use it in the following examples. Al- ternation allows us to consolidate production rules whose left hand sides match into a single rule whose right hand side consists of the the right hand sides of each of the individual rules separated by the | symbol. There- fore, alternation is purely syntactic sugar in that any grammar using it can be re-written without it. With alternation we can define the grammar above which contains eleven production rules with only five rules.

(r1) <sentence> ::= <article> <noun> <verb> <adverb> . (r2) <article> ::= a | an | the (r3) <noun> ::= dog | cat | Socrates (r4) <verb> ::= runs | jumps (r5) <adverb> ::= slowly | fastly

2.4.2 Language Generation: Derivations

Context-free grammars not only define language, but can also be used to generate sentences in the language by applying the rules in a top-down fashion from the start symbol to construct a derivation which is a sequence of applications of the production rules of a grammar starting with the start symbol and ending with a sentence (i.e., a string of all terminal symbols arranging according to the grammar). For example, consider deriving the sentence ‘the dog runs fastly.’ from the above grammar (below the seman- tics of the symbol ⇒ are ‘derives’, and the rn parenthesized on the right

CO NF

ID EN

TI AL

D RA

FT48 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARShand side of each expansion indicates which production rule was used in the expansion):

<sentence> ⇒ <article><noun><verb><adverb> . (r1) ⇒ <article> <noun> <verb> fastly . (r5) ⇒ <article> <noun> runs fastly . (r4) ⇒ <article> dog runs fastly . (r3) ⇒ the dog runs fastly . (r2)

The result (on the right-hand side of the ⇒) of each step is a string con- taining terminals and non-terminals called a sentential form. A sentence is a sentential form containing only terminals.

Consider the following context-free grammar for arithmetic expressions for a four-function calculator:

(r1) <expr> ::= <expr> + <expr> (r2) <expr> ::= <expr> − <expr> (r3) <expr> ::= <expr> ⋆ <expr> (r4) <expr> ::= <expr> / <expr> (r5) <expr> ::= <id > (r6) <id> ::= x | y | z (r7) <expr> ::= (<expr>) (r8) <expr> ::= <number> (r9) <number> ::= <number> <digit>

(r10) <number> ::= <digit> (r11) <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Notice that we can shorten this grammar without changing the language it defines by consolidating rules 1–4 in two rules by adding a new non- terminal <operator>.

(r1) <expr> ::= <expr> <operator> <expr> (r2) <expr> ::= <id > (r3) <operator> ::= + | − | ⋆ | / (r4) <id> ::= x | y | z (r5) <expr> ::= (<expr>) (r6) <expr> ::= <number> (r7) <number> ::= <number> <digit> (r8) <number> ::= <digit> (r9) <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 49In the following derivations, we use the larger, first grammar. A derivation is called leftmost if the leftmost non-terminal is always replaced first in each step. The following is a leftmost derivation of 132.

<expr> ⇒ <number> (r8) ⇒ <number><digit> (r9) ⇒ <number><digit><digit> (r9) ⇒ <digit><digit><digit> (r10) ⇒ 1 <digit><digit> (r11) ⇒ 13 <digit> (r11) ⇒ 132 (r11)

A derivation is called rightmost if the rightmost non-terminal is always re- placed first in each step. The following is a rightmost derivation of 132.

<expr> ⇒ <number> (r8) ⇒ <number><digit> (r9) ⇒ <number> 2 (r11) ⇒ <number><digit> 2 (r9) ⇒ <number> 32 (r11) ⇒ <digit> 32 (r10) ⇒ 132 (r11)

Some derivations, such as the following two derivations, are neither left- most nor rightmost.

<expr> ⇒ <number> (r8) ⇒ <number><digit> (r9) ⇒ <number><digit><digit> (r9) ⇒ <number><digit> 2 (r11) ⇒ <number> 32 (r11) ⇒ <digit> 32 (r10) ⇒ 132 (r11)

CO NF

ID EN

TI AL

D RA

FT50 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARS parsergenerator

grammar

sentencestart symbol

grammar

string if start symbol, then yes, a sentence, otherwise no

Figure 2.2: Dual nature of grammars as generative and recognition devices. (left) A lan- guage generator which accepts a grammar and a start symbol and generates a sentence from the language defined by the grammar. (right) A language parser which accepts a grammar and a string and determines if the string is in the language.

<expr> ⇒ <number> (r8) ⇒ <number><digit> (r9) ⇒ <number><digit><digit> (r9) ⇒ <number> 3 <digit> (r11) ⇒ <digit> 3 <digit> (r10) ⇒ 13 <digit> (r11) ⇒ 132 (r11)

Therefore, random replacement is acceptable. The following is a rightmost derivation of x+ y ∗ z.

<expr> ⇒ <expr> + <expr> (r1) ⇒ <expr> + <expr> ⋆ <expr> (r2) ⇒ <expr> + <expr> ⋆ <id> (r5) ⇒ <expr> + <expr> ⋆ z (r6) ⇒ <expr> + <id> ⋆ z (r5) ⇒ <expr> + y ⋆ z (r6) ⇒ <id> + y ⋆ z (r5) ⇒ x + y ⋆ z (r6)

2.4.3 Language Recognition: Parsing and Parse Trees

In the prior sub-section we used context-free grammars as language gen- eration devices to construct derivations. In other words, we can imple- ment a computer program to randomly pick how to expand non-terminals and given an input grammar output a random sentence in the language defined by that grammar (see left side of Fig. 2.2). One of the seminal discoveries in computer science is that grammars can also be used for lan- guage recognition—the reverse. In other words we can implement a com-

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 51puter program to accept a candidate string as input and construct a reverse derivation to determine if the input string is a sentence in the language de- fined by the grammar or not (see right side of Fig. 2.2). Such a computer program is called a parser and does what is called parsing (i.e., constructing a reverse derivation)—the primary topic of Chapter 3. In constructing the reverse derivation (i.e., working the derivation backwards), if we arrive back at the start symbol when the input string is expired, then the string is a sentence, otherwise it is not.

language generation: start symbol → sentence language recognition: sentence → start symbol

Run the grammar forward and you have a generator; run the grammar backward and you have a parser.

Consider parsing the string x+ y ⋆ z. In the following parse, . denotes ‘top of the stack.’

1 . x + y ⋆ z (shift) 2 x . + y ⋆ z (reduce r6) 3 <id> . + y ⋆ z (reduce r5) 4 <expr> . + y ⋆ z (shift) 5 <expr> + . y ⋆ z (shift) 6 <expr> + y . ⋆ z (reduce r6) 7 <expr> + <id> . ⋆ z (reduce r5) 8 <expr> + <expr> . ⋆ z (shift;

why not reduce r1 here instead?) 9 <expr> + <expr> ⋆ . z (shift)

10 <expr> + <expr> ⋆ z . (reduce r6) 11 <expr> + <expr> ⋆ <id> . (reduce r5) 12 <expr> + <expr> ⋆ <expr> . (reduce r3; emit multiplication) 13 <expr> + <expr> . (reduce r1; emit addition) 14 <expr> . (start symbol; this is a sentence)

The left-hand side of the . represents a stack and the right-hand side of the . (i.e., top of the stack) represents the remainder of the string to be parsed called the handle. The process is straightforward. At each step you either shift or reduce. To determine which to do you examine the stack. If the items at the top of the stack match the right-hand side of any production

CO NF

ID EN

TI AL

D RA

FT52 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSrule, you replace those items with the non-terminal on the left-hand side of that rule. This is known as reducing. If the items at the top of the stack do not match the right-hand sides of any production rule, you shift the next lexeme on the right-hand side of the . to the stack. When the input string has been entirely consumed (i.e., shifted), if the stack only contains the start symbol, then the string is a sentence, otherwise, it is not. This process is called shift-reduce or bottom-up parsing because you start with the string or, in other words, the lexemes or terminals, and work your way up through the non-terminals to the start symbol. Later we contrast this parsing method with top-down or recursive-descent parsing. The above parse proves that x + y ⋆ z is a sentence.

2.4.4 Ambiguity

The following parse, although different from the above parse, proves pre- cisely the same thing.

1 . x + y ⋆ z (shift) 2 x . + y ⋆ z (reduce r6) 3 <id> . + y ⋆ z (reduce r5) 4 <expr> . + y ⋆ z (shift) 5 <expr> + . y ⋆ z (shift) 6 <expr> + y . ⋆ z (reduce r6) 7 <expr> + <id> . ⋆ z (reduce r5) 8 <expr> + <expr> . ⋆ z (reduce r1; emit addition;

why not shift here instead?) 9 <expr> . ⋆ z (shift)

10 <expr> ⋆ . z (shift) 11 <expr> ⋆ z . (reduce r6) 12 <expr> ⋆ <id> . (reduce r5) 13 <expr> ⋆ <expr> . (reduce r3; emit multiplication) 14 <expr> . (start symbol; this is a sentence)

Which of these two parses is better? How can we evaluate which is better? The short answer to these questions is it does not matter. The objective of language recognition and parsing is to determine if the input string is a sentence (i.e., does its structure conform to the grammar). Both of the

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 53above parses meet that objective and, thus, with respect to syntax, they both equally meet the objective. The only concern is if the string is syn- tactically valid, not whether it makes sense or not (i.e., semantically valid). Parsing deals purely with syntax rather than semantics. However, often we, admittedly impurely, yet sometimes necessarily, address semantic is- sues with techniques originally intended only for dealing with syntax. The reason is that unfortunately unlike syntax, we do not have formal models of semantics which are easily implemented in a computer system. There- fore, although unsettling, we sometime address syntax and semantics at the same time. One way to gently introduce semantics into syntax is to think of syntax implying semantics as a desideratum. In other words, the form of an expression or command (i.e., its syntax) should provide some hint as to its meaning (i.e., semantics). Arguably one of the major com- plaints against UNIX systems versus systems which are perceived to be more user-friendly is that the form of UNIX commands does not imply the function of the command (e.g., ls, ps, grep versus date and cal). Con- sidering another example, the concept of pushing some semantics into syn- tax should not seem so foreign a concept. For instance, we are taught in our first course in computer programming to use identifier names which im- ply the meaning of the variable to which they refer (e.g., rate and index versus x and y). See the programming style guide in Appendix A for more information.

Here we would like to push semantics into parsing in a very identifiable way. We would like to evaluate the expression while parsing it. This helps avoid making unnecessary passes over the string if it is a sentence. Again, realize we are shifting from the realm of syntactic validity into interpre- tation. The two should not be confused as they serve different purposes. Determining if a string is a sentence is completely independent of evalu- ating it for a return value. We subconsciously impart semantics onto an expression such as x+ y ⋆ z because without any mention of meaning we presume it is a mathematical expression. However, it is simply syntax (i.e., form) and can have any interpretation or meaning we impart on it. There- fore, the meaning of an expression such as x+ y ⋆ z could be a list of five elements.

Thus, in evaluating the expression while parsing it we are imparting knowledge of how to interpret the expression (i.e., semantics). Here we

CO NF

ID EN

TI AL

D RA

FT54 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSinterpret these sentences as standard mathematical expressions. However, to evaluate these mathematical expressions, we must impart even more semantics beyond the simple interpretation of them as mathematical ex- pressions. If they are mathematical expressions, to evaluate them we must determine which operators have precedence over each other (i.e., is x+ y ⋆ z interpreted as (x+ y) ⋆ z or x+ (y ⋆ z) as well as the order in which each operator associates (i.e., is 6− 3− 2 interpreted as (6− 3)− 2 or 6− (3− 2). Precedence deals with the order of distinct operators (e.g., * computes be- fore +) while associativity deals with the order of operators with the same precedence (e.g., − associates left-to-right). Notice that both parses of the expression x+ y ⋆ z are the same until line 8 where a decision must be made to shift or reduce. The first parse shifts while the second reduces. Both lead to successful parses. However, if we evaluate the expression while parsing it (which we should not because parsing is solely concerned with syntax) each parse leads to different results. One way to evaluate a mathematical expression while parsing it is to emit the mathematical op- eration when reducing. For instance, in step 12 of the first parse, when we reduce <expr> ⋆ <expr> to <expr> we can compute y ⋆ z, and simi- larly in step 13 of that same parse, when we reduce <expr> + <expr> to <expr> we can compute x + <the result computed in step 12>. This in- terpretation (i.e., x+ (y ⋆ z)) is desired because in mathematics multiplica- tion has higher precedence than addition. Now consider the second parse. In step 8 of the second parse, when we (prematurely) reduce <expr> + <expr> to <expr> we compute x+ y, and then in step 13 of that same parse, when we reduce < expr > ⋆ < expr > to < expr > we compute <the result computed in step 8> ⋆ z. This interpretation (i.e., (x+ y) ⋆ z) is obviously not desired. If we shift at step 8, multiplication has higher precedence than addition (desired). If we reduce at step 8, addition has higher precedence than multiplication (undesired). Therefore, we prefer the first parse. These two parses exhibit a shift-reduce conflict. If we shift at step 8 then multiplication has higher precedence than addition (which is the desired semantics). If we reduce at step 8 then addition has higher precedence (which is the undesired semantics).

There is also the possibility of a reduce-reduce conflict (not exhibited in the above parses). Consider the following grammar to illustrate a reduce- reduce conflict.

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 55<exp> <exp> <exp>+

<id> *<exp> <exp>

<id>x

y z

<id>

<exp>

<exp> <exp>

<exp> <exp>

<id><id>

<id>

*

+

z

yx

Figure 2.3: Two parse trees for the expression x+ y ⋆ z.

<exp>

<id>

x

<exp>

<term>

<id>

x

Figure 2.4: Parse trees for the expression x.

(r1) <expr> ::= <term> (r2) <expr> ::= <id> (r3) <term> ::= <id> (r4) <id> ::= x | y | z

and a bottom-up parse of the expression x.

. x (reduce r4) <id> . (reduce r2 or r3 here?)

The underlying source of shift-reduce and reduce-reduce conflicts is an ambiguous grammar. A grammar is ambiguous if there exists a sentence which can be parsed in more than one way. A parse of a sentence can be graphically represented using a parse tree. A parse tree is a tree whose root is the start symbol of the grammar, non-leaf vertices are non-terminals, and leaves are terminals, where the structure of the tree represents the sen- tences conformity to the grammar. Therefore, a grammar is ambiguous if we can construct more than one parse tree for the same sentence from the language defined by the grammar. Fig. 2.3 gives parse trees for the expres- sion x+ y ⋆ z. The left tree represents the first parse above and the right tree represents the second parse above. The existence of these trees prove

CO NF

ID EN

TI AL

D RA

FT56 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSTable 2.3: Effect of ambiguity on meaning. Sentence Derivation(s) Parse Tree(s) Meaning(s) 132 multiple one one (132) 1+3+2 multiple multiple one (6) 1+3*2 multiple multiple multiple (7 or 8) 6-3-2 multiple multiple multiple (1 or 5)

that the grammar is ambiguous. However, a much simpler proof of am- biguity exists in Fig. 2.4 which contains two parse trees for the expression x.

We can say ambiguity is a property of a grammar while shift-reduce and reduce-reduce conflicts are properties of a parse. If a grammar is am- biguous, a bottom-up parse of a sentence in the language it defines will exhibit either a shift-reduce or reduce-reduce conflict. In other words, if a bottom-up parse of a sentence exhibits a shift-reduce or reduce-reduce conflict, the grammar which defines the language of the sentence must be ambiguous.

Note that a shift-reduce conflict means that both a shift and a reduce action lead to successful parses. A situation where both a shift and re- duce are possible at a particular point in a parse, but only one of the two actions at that point in the parse will lead to a successful parse is not a shift-reduce conflict and must not be confused with one. When yacc, a shift-reduce parser generator, encounters a shift-reduce conflict, it shifts by default. However, when both a shift or reduce action are possible at a particular point in a parse, yacc always chooses the action which leads to a successful parse because the parsing tables are generated from the gram- mar a-priori.

Similarly, a situation where both a reduce using a rule and reduce using another rule are possible at a particular point in a parse, but only one of the two reduces will lead to a successful parse is not a reduce-reduce conflict and must not be confused with one. When yacc encounters a reduce- reduce conflict, it reduces based on the first rule in lexicographic order in the grammar file. However, when both a reduce using one rule and a reduce using another rule are possible at a particular point in a parse, yacc will always reduce on the rule which leads to a successful parse.

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 57

2

<exp>

<number>

<number>

<number><digit>

<digit>

<digit>

1

3

Figure 2.5: Parse tree for the expression 132.

3

<exp>

<exp> <exp>

<exp> <exp>

<number><number>

<number>

+

+

2<digit> <digit>

<digit>

1 2

<exp>

<exp> <exp>+

<number>

<digit>

<exp> <exp>

<number><number>

+

<digit> <digit>1

3

Figure 2.6: Parse trees for the expression 1+ 3 + 2.

2

<exp>

<exp> <exp>+

<number>

<digit>

<exp> <exp>

<number><number>

*

<digit> <digit>1

3 3

<exp>

<exp> <exp>

<exp> <exp>

<number><number>

<number>

*

+

2<digit> <digit>

<digit>

1

Figure 2.7: Parse trees for the expression 1+ 3 ⋆ 2.

CO NF

ID EN

TI AL

D RA

FT58 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARS<exp> <exp> <exp>

<exp> <exp>

<number><number>

<number>

2<digit> <digit>

<digit>

6 3

<exp>

<exp> <exp>−

<number>

<digit>

<exp> <exp>

<number><number>

<digit> <digit>6

3 2

Figure 2.8: Parse trees for the expression 6− 3− 2.

How to Prove a Grammar is Ambiguous

Use the following steps to prove a grammar is ambiguous:

1. generate an expression E from the grammar and show the expression

2. give two parse trees (using the grammar) for that expression E

Notes:

• the expression must be derived from the grammar

• a parse tree must be fully expanded; it has no leaves which are non- terminals; they are all terminals

• collected leaves in each parse tree must constitute the expression

• we cannot change the grammar while building the parse trees

• we cannot change the expression while building the parse trees

Following these steps, it is trivial to prove a grammar is ambiguous. All we need is two parse trees. Much more difficult, on the other hand, is proving that a grammar is unambiguous.

It is important to note that a parse tree is not a derivation. A deriva- tion illustrates how to generate a sentence. A parse tree illustrates the opposite—how to recognize a sentence. Moreover, while multiple deriva- tions of a sentence (which we illustrate above in § 2.4.2) are not a problem, having multiple parse trees for a sentence is a problem, not from a recog- nition standpoint, but from an interpretation (i.e., meaning) perspective.

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 59Consider Table 2.3 which contains four sentences from the above gram- mar. While the first sentence 132 has multiple derivations, it only has one parse tree (see Fig. 2.5) and, therefore, only one meaning. The second sen- tence 1+ 3+ 2 on the other hand has multiple derivations and multiple parse trees. However, those parse trees (see Fig. 2.6) all convey the same meaning (i.e., 6). The third sentence on the other hand, 1 + 3 ⋆ 2 also has multiple derivations and parse trees. However, its parse trees each convey a different meaning (i.e., 7 or 8). Similarly, the fourth sentence 6− 3− 2 has multiple derivations and parse trees, and those parse trees each have dif- ferent interpretations (i.e., 1 or 5). The last three rows of Table 2.3 show the grammar to be ambiguous even though the ambiguity manifest in the sen- tence 1+ 3 + 2 is of no consequence to interpretation. The third sentence demonstrates the need for rules establishing precedence among operators, and the fourth sentence illustrates the need for rules establishing how each operator associates (left-to-right or right-to-left).

Again, keep in mind, that we are addressing semantics in an area in- tended for syntax. We are addressing semantics using formalisms and techniques reserved for syntax primarily because we do not have sound methods for dealing with context, which is necessary to effectively address semantics, in computer systems. By definition, context-free grammars are not intended to model context. However, notice that the semantics we ad- dress through syntactic methods, namely precedence and associativity, are not dependent on context. In other words, multiplication does not have higher precedence than addition in some contexts and vice versa in others (though it could since we are defining the language2). Similarly, subtrac- tion does not associate left-to-right in some contexts and right-to-left in others. Therefore, all we need to do is make a decision for each and codify it.

Typically semantic rules like precedence and associativity are specified in English (due to the lack of formalisms for semantics) in the program- ming manual of a particular language (e.g., ⋆ has higher precedence that +, − associates left-to-right). Thus, English is one way to specify semantic rules. However, we know English itself is ambiguous. Therefore, when the ambiguity (in the formal language, not English) is not dependent on context, as in the case here with precedence and associativity, we can in-

2In the programming language APL, addition has higher precedence than multiplication.

CO NF

ID EN

TI AL

D RA

FT60 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSstrument the grammar so that the ambiguity is removed making the mean- ing (or semantics) determinable from the grammar (syntax). When ambi- guity is dependent on context, disambiguating the grammar to force one interpretation is not possible because you want more than one interpreta- tion, but only one per context. For instance, the English sentence ‘Time flies like an arrow’ can be parsed multiple ways. It can be parsed to indi- cate that there are these creatures called ‘time flies’ which like arrows (i.e., <adjective> <noun> <verb> <article> <noun>), or metaphorically (i.e., <noun> <verb> <preposition> <article> <noun>). Obviously, English is defined using an ambiguous grammar. How can we determine intended meaning? We need the surrounding context provided by the sentences be- fore and after this sentence. Consider parsing the English sentence ‘I shot the man on the mountain with the camera.’ which also has multiple in- terpretations corresponding to the different parses of it. ‘Eats shoots and leaves’ is another example of an of English sentence with multiple inter- pretations.

We have two options for dealing with an ambiguous grammar and both have disadvantages. We can state disambiguating rules in English (i.e., attach notes to the grammar) which means we do not have to alter (i.e., lengthen) the grammar, but this is at the expense of being less formal (by using English). Alternatively, we can disambiguate the grammar by re- vising it which is a more formal approach than using English, but this in- flates the grammar as we see below. As we see later the size of a grammar, the number of production rules it contains, is directly proportional to the size of the parser (in lines of code), and a larger parser runs slower than a shorter parser. While disambiguating the grammar is always an option, stating a disambiguating rule may not be because the disambiguation may depend on context.

2.4.5 Disambiguating Grammars by Revision

Here, ‘having higher precedence’ means occurring lower in the parse tree because expressions are evaluated bottom-up. In general, disambiguating a grammar involves creatively introducing new non-terminals to prevent a sentence from being parsed two ways. To remove the ambiguity caused by (the lack of) operator precedence, we introduce new steps (i.e., non-

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 61terminals) in the non-terminal cascade so that multiplications are always lower than additions in the parse tree. Recall, we desire part of the mean- ing (or semantics) to be determined from the grammar (or syntax).

Precedence

Consider the following updated grammar which addresses precedence.

(r1) <expr> ::= <expr> + <expr> (r2) <expr> ::= <expr> − <expr> (r3) <expr> ::= <term> (r4) <term> ::= <term> ⋆ <term> (r5) <term> ::= <term> / <term> (r6) <term> ::= (<expr>) (r7) <term> ::= <number>

With this grammar it is no longer possible to construct two parse trees for the sentence x+ y ⋆ z. The sentence x+ y ⋆ z, by virtue of being parsed us- ing this revised grammar will always be interpreted as x+ (y ⋆ z). How- ever, while the grammar above addressed issue of precedence, it is still ambiguous because it is still possible to use it to construct two parse trees for the expression 6-3-2 since it does not address associativity. Recall, as- sociativity comes into play when dealing with operators with same prece- dence. Subtraction is left associative (e.g., 6− 3− 2 = (6− 3)− 2 = 1) while unary minus is right associative (e.g., − − − 6 = −(−(−6))). Associa- tivity is mute with certain operators, including addition (e.g., 1+ 3+ 2 = (1+ 3) + 2 = 1+ (3+ 2) = 6), but significant with others, including sub- traction and unary minus. Theoretically addition associates either left or right with the same result. However, when addition over floating-point numbers is implemented in a computer system, associativity is significant because left and right associativity can lead to different results. Thus, the grammar is still ambiguous for the sentences 1+ 3 + 2 and 6− 3− 2, al- though the former does not cause problem because both parses result in the same interpretation.

CO NF

ID EN

TI AL

D RA

FT62 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSAssociativity Consider the following updated grammar which addresses precedence and associativity.

(r1) <expr> ::= <expr> + <term> (r2) <expr> ::= <expr> − <term> (r3) <expr> ::= <term> (r4) <term> ::= <term> ⋆ <factor> (r5) <term> ::= <term> / <factor> (r6) <term> ::= <factor> (r7) <factor> ::= (<expr>) (r8) <factor> ::= <number>

In disambiguating the grammar for associativity we follow the same theme as above (i.e., obviate multiple parse trees by adding another level of indirection by introducing a new non-terminal). If we want an opera- tor to be left-associative then we write the production rule for that opera- tor in a left-recursive manner (i.e., left-recursion leads to left-associativity). Similarly, if we want an operator to be right-associative then we write the production rule for that operator in a right-recursive manner (i.e., right- recursion results in right-associativity). Since subtraction is a left-associa- tive operator we write the production rule as <expr> ::= <expr>−<term> (i.e., left-recursive) rather than <expr> ::= <term> − <expr> (i.e., right- recursive). The same holds for division. Since addition and multiplication are non-associative we write the production rules dealing with those op- erators in a left-recursive manner for purposes for consistency. Therefore, the final non-ambiguous grammar is shown above.

More Examples of Ambiguity: A Word of Caution

It is possible for a production rule (e.g., <expr> ::= <term>−<term> in a grammar of the C programming language) to generate sentences which have only one parse tree but which can still lead to different interpreta- tions. For instance, we can generate the sentence (10)− 2 interpreted as a subtraction or the sentence (int)− 3.5 interpreted as a type cast.

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 63

x

<stmt>

if <cond> <stmt> <stmt>else

(a < 2)

if <cond> <stmt>

(b > 3)

y

y

<stmt>

if <cond> <stmt>

(a < 2)

if <cond> <stmt>

(b > 3) x

else <stmt>

Figure 2.9: Parse trees for the sentence if (a < 2) if (b > 3) x else y. (left) Parse tree for an if-(if)-else construction. (right) Parse tree for an if-(if-else) con- struction.

Classical Example of Grammar Ambiguity: the Dangling else Problem

The dangling else problem is another classical example of grammar am- biguity in programming languages. The problem can be described as fol- lows. In the absence of curly braces for disambiguation, when we have an if-else statement such as if expr1 if expr2 stmt1 else stmt2, the if to which the else is associated is ambiguous. In other words, without a semantic rule, the statement can be interpreted in the following two ways

1 i f expr1 2 i f expr2 3 stmt1 4 else 5 stmt2

1 i f expr1 2 i f expr2 3 stmt1 4 else 5 stmt2

Indentation is used to indicate to which if the else is intended to be as- sociated, though as we know in free-form languages indentation has no bearing on program semantics. In C, the semantic rule is that an else associates with the closest unmatched if and, therefore, the first interpre- tation above would be used in C.

Consider the following grammar for generating sentences involving if-else.

CO NF

ID EN

TI AL

D RA

FT64 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARS<stmt> ::= if <cond> <stmt> <stmt> ::= if <cond> <stmt> else <stmt>

Using this grammar we can generate the following sentence (save for the comment):

1 i f (a < 2) 2 i f (b > 3) 3 x

4 else /* a s s o c i a t e s with which i f above ? */ 5 y

for which we can construct two parse trees (see Fig. 2.9) proving that the grammar is ambiguous. Again, since we do not have practical (i.e., easily implementable) formal methods for modeling semantics, we need to re- instrument the syntax (i.e., grammar) to imply the desired semantics. We can do that by disambiguating this grammar so that it is capable of gen- erating if sentences which can only be parsed (i.e., interpreted) to imply that any else associates with the nearest unmatched if (i.e., parse trees of the form of that on the right side of Fig. 2.9). We leave it as an exercise to develop an unambiguous grammar to solve the dangling else problem (see Exercise 2.4.26).

2.4.6 Extended Backus-Naur Form (EBNF)

We have already seen the extension to BNF for alternation. EBNF adds the following syntactic extensions to BNF.

• | means ‘alternation’

• [x] means ‘x is optional’

• {x}∗ means ‘0 or more of x’

• {x}+ means ‘1 or more of x’

• {x}∗(c) means ‘0 or more of x separated by cs’

• {x}+(c) means ‘1 or more of x separated by cs’

Consider the following context-free grammar defined in BNF (from [Lou02]:

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 65(r1) <expr> ::= ( <list> ) (r1) <expr> ::= a (r3) <list> ::= <expr> (r4) <list> ::= <expr><list>

which can be used to derive the following sentences: a, ((a)), and ( ((a)) a). We can re-express this grammar in EBNF using alternation as

(r1) <expr> ::= ( <list> ) | a (r2) <list> ::= <expr>|<expr><list>

We can express r2 more concisely using the optional syntax [Lou02].

(r1) <expr> ::= ( <list> ) | a (r2) <list> ::= <expr> [<list>]

As another example consider the following context-free grammar defined in BNF

(r1) <term> ::= <factor> + <factor> (r2) <factor> ::= <term>

which can be re-written in EBNF as a single rule

(r1) <term> ::= <factor> + <factor> + <factor>∗

Again, these extensions are purely syntactic sugar and used for ease of grammar definition. Any context-free grammar defined in EBNF can be expressed in BNF.

In summary, a context-free language (which is a formal language) is de- fined or generated by a context-free grammar (which is a formal grammar) and recognized by a pushdown automaton (which is a model of computa- tion). See the third row of Table 2.4.

2.4.7 Conceptual Exercises for Section 2.4

Exercise 2.4.1: Consider the following context-free grammar in EBNF from [FWH01]:

CO NF

ID EN

TI AL

D RA

FT66 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARS<list> ::= ( {<datum>}⋆ ) <dotted-datum> ::= (<datum>+. <datum> )

<vector> ::= # ({<datum>}⋆ ) <datum> ::= <number>|<symbol>|<boolean>|<string>|

<list>|<dotted-datum>|<vector> <boolean> ::= #t | #f

a) Give a derivation of the sentence (#t (foo . ()) 3) from this context-free grammar in EBNF.

b) Give a derivation of the sentence (a “mixed′′ #(bag (of . data))) from this context-free grammar in EBNF proving it is a <datum>.

c) Is the string (a . b . c) a sentence? If so, prove it is a sentence and give a derivation for it. Otherwise, indicate why it is not a sentence.

d) Rewrite this context-free grammar in BNF.

Exercise 2.4.2: Define a context-free grammar in BNF for the language of Exercise 2.3.4.

Exercise 2.4.3: Define a context-free grammar in EBNF for the language of Exercise 2.3.4.

Exercise 2.4.4: Define a regular grammar which matches any phrase of ex- actly three words separated by white space.

Exercise 2.4.5: (true / false) Ambiguity in a grammar poses no problem for language generation.

Exercise 2.4.6: Describe in English, as specificly as possible, the language defined by the context-free grammar in BNF

<text> ::= ab <text> ::= ba <text> ::= ab <text> <text> ::= a <text> b <text> ::= a <text> b <text> <text> ::= ba <text> <text> ::= b <text> a <text> ::= b <text> a <text>

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 67where <text> is a non-terminal, and a and b are terminals. Exercise 2.4.7: Prove that the grammar of Exercise 2.4.6 is ambiguous.

Exercise 2.4.8: Prove that the following context-free grammar in BNF is am- biguous, where <E > and <L> are non-terminals and a, (, and ) are terminals, and <E> is the start symbol.

<E> ::= ( <L> ) <E> ::= a <L> ::= <E> <L> ::= <E><L> <L> ::= <E><E>

Exercise 2.4.9: Re-write the grammar from the previous problem in EBNF.

Exercise 2.4.10: Consider the following context-free grammar defined in EBNF (from [Lou02]), where < expr > and < list > are non-terminals, a, (, and ) are terminals, and <expr> is the start symbol:

<expr> ::= ( <list> ) | a <list> ::= <expr> [<list>] <list> ::= <expr> [<expr>]

Prove that this grammar is ambiguous.

Exercise 2.4.11: [Lou02] State whether or not the following context-free grammar defined in EBNF, where <expr> and <list> are non-terminals, a, (, and ) are terminals, and <expr> is the start symbol, is ambiguous. If it is ambiguous, prove that it is ambiguous.

<expr> ::= ( <list> ) | a <list> ::= <expr> [<list>]

Exercise 2.4.12: Consider the following context-free grammar in BNF, where <E> and <T> are non-terminals and +, ⋆, and id are terminals.

CO NF

ID EN

TI AL

D RA

FT68 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARS<E> ::= <E> + <E> <E> ::= <T> <T> ::= <E> <T> ::= <T> ⋆ <T> <T> ::= id

a) Prove that this grammar is ambiguous.

b) Modify this grammar so that it is unambiguous, but still generates the exact same language as the ambiguous version.

c) Define an unambiguous version of the above grammar containing only two non-terminals.

Exercise 2.4.13: Express the regular grammar .⋆hw.⋆ as a context-free gram- mar.

Exercise 2.4.14: Define a context-free grammar for the language consisting of strings which have n copies of the letter a followed by the same number of copies of the letter b, where n > 0. For example, the strings ab, aaaabbbb, and aaaaaaaabbbbbbbb are sentences in the language, but the strings a, abb, ba, and aaabb are not. Is this language regular or context-free? Why?

Exercise 2.4.15: Define a context-free grammar in BNF capable of generating only binary numbers which contain the same number of 0s and 1s. For instance, the strings 01, 10, 0110, 1010, 011000100111, and 000001111011 are sentences in the language, while the strings 0, 1, 00, 11, 1111000, 01100010011, and 00000111011 are not. The empty string is not in this language. State whether or not your grammar is ambiguous, and if am- biguous, prove it is ambiguous.

Exercise 2.4.16: Solve Exercise 2.4.15 with an unambiguous grammar.

Exercise 2.4.17: Consider the language given by ‘strings of 0s and 1s of the form xx,’ where the first half of the string is identical to the second half. For instance, the strings 0000, 0101, 11, 010010, 101101, 11111111, and 1010 are sentences in the language, but not the strings 0110 and 1100 are not. Is this language regular (i.e., can be defined by a regular grammar)? If so, give a regular grammar. If not, state why not.

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 69Exercise 2.4.18: Define a context-free grammar for the language of the pre- vious problem.

Exercise 2.4.19: Define an unambiguous context-free grammar in BNF capable of generating only palindromes of binary numbers. A palindrome is a string which reads the same forward as backward. For example, the strings 0, 1, 00, 11, 101, and 100101001 are palindromes while the strings 10, 01, and 10101010 are not. The empty string is not in this language.

Exercise 2.4.20: Matching syntactic entities (e.g., parentheses, brackets, or braces) is an important aspect of many programming languages.

Define a context-free grammar in BNF capable of generating only balanced strings of (nested or flat) matched parentheses. The empty string is not in this language.

For instance, the strings (), ()(), (()), (()())(), and ((()())()) are sentences in this language, while the strings )(, )(), )()(, (()(), ())((, and ((()()) are not. Note that not all strings with the same number of open and close paren- theses are in this language (e.g., the strings )( and )()( are not sentences in this language).

State whether your grammar is ambiguous or not and if it is ambiguous, prove that it is ambiguous.

Exercise 2.4.21: Define an unambiguous context-free grammar in BNF for the language of previous problem.

Exercise 2.4.22: The following context-free grammar generates declarations for a single identifier:

CO NF

ID EN

TI AL

D RA

FT70 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARS<stmt> ::= declare <id> <option list> <id> ::= any string beginning with an alphabetic character

other than declare, real, complex, fixed, floating, single, double, binary, and decimal

<option list> ::= <option> | <option list> <option> <option> ::= <mode> | <scale> | <precision> | <base> <mode> ::= real | complex <scale> ::= fixed | floating

<precision> ::= single | double <base> ::= binary | decimal

This grammar permits redundant and/or contradictory declarations such as:

declare zippy real fixed real floating

Revise the grammar so that it forbids such obviously incorrect declara- tions. Define a context-free grammar for legal declarations, where legal means each option appears at most once. In other words, there can be at most one mode, one scale, one precision, and one base. The options can still can appear in any order. For example, the following are legal declara- tions:

declare zippy floating double

declare zippy double floating

It is also legal to not have anything specified for one or more (or all) of the options. For instance, just

declare zippy

is legal.

Exercise 2.4.23: Prove that the following context-free grammar in BNF is am- biguous.

<stmt> ::= if <expr> <stmt> <stmt> ::= if <expr> <stmt> else <stmt> <stmt> ::= s <expr> ::= c

CO NF

ID EN

TI AL

D RA

FT2.4. CONTEXT-FREE LANGUAGES AND GRAMMARS 71Exercise 2.4.24: Re-write the grammar from the previous problem in EBNF. Exercise 2.4.25: The following grammar for if-else is proposed to rem- edy the dangling else ambiguity:

<stmt> ::= if <expr> <stmt>|<matched stmt> <matched stmt> ::= if <expr> <matched stmt> else <stmt> <matched stmt> ::= <other>

where the non-terminal <other> generates some non-if statement such as a print statement.

Prove that this grammar is still ambiguous.

Exercise 2.4.26: Define an unambiguous grammar for the dangling else problem without changing the language (see § 2.4.5.

Exercise 2.4.27: List all right-associative operators in C.

Exercise 2.4.28: Recall that, surprisingly enough, the abilities of program- mers have had little influence on programming language design and im- plementation historically despite the fact that programmers are the pri- mary users of programming languages! For instance, the ability to nest comments is quite helpful when a programmer desires to completely com- ment out a section of code which may already contain a comment. How- ever, the designers of C decided to forbid nesting comments. That is, com- ments cannot nest in C. Therefore, the following code is not syntactically valid in C:

1 /* the fol lowing funct ion f ( ) has an e r r o r ; 2 I ' l l j u s t comment i t out f o r now . 3

4 void f ( ) { 5

6 /* an i n t e g e r x */ 7 i n t x ; 8

9 . . . 10 } 11

12 */

CO NF

ID EN

TI AL

D RA

FT72 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSWhy did the designers of C decide to forbid nesting comments? We are looking for a formal technical answer here. Be specific and clear in your answer.

Exercise 2.4.29: Give a specific example of semantics in programming lan- guages?

Exercise 2.4.30: List all possible interpretations for the English sentence ‘I shot the man on the mountain with the camera.’ Diagram the sentence for each interpretation.

Exercise 2.4.31: List all possible interpretation for the English sentence ‘Eats shoots and leaves.’ Diagram the sentence for each interpretation.

Exercise 2.4.32: If we can address the absence of practical (i.e., easily im- plementable) formal methods for defining semantics by modeling seman- tics using formal methods intended for defining syntax (i.e., disambiguat- ing grammars and adding additional production rules to force the in- tended interpretation), why do so many languages, including C, still use an ambiguous grammar?

Exercise 2.4.33: (true / false) A language whose sentences are sets from an infinite universe of items cannot be defined using a context-free grammar. Explain your answer.

Exercise 2.4.34: (true / false) A language whose sentences are sets from a finite universe of items can be defined using a context-free grammar. Explain your answer.

Exercise 2.4.35: Show that the following context-free grammar does not sat- isfy the rules of predictive parsing.

<stmt> ::= if <stmt> [else <stmt>]

2.5 Context-Sensitivity

Context-free grammars, by definition, are not intended to represent con- text in language. A simple example of context-sensitivity in English is: the first letter of a sentence must be capitalized (e.g., The sky is blue. Blue is a color.). A context-sensitive grammar for this property of English is

CO NF

ID EN

TI AL

D RA

FT2.5. CONTEXT-SENSITIVITY 73<beginning><article> ::= A | An | The <article> ::= a | an | the

Notice that in a context-sensitive grammar the left hand side of a produc- tion rule is not limited to one non-terminal as is the case in context-free grammars. In the example above the non-terminal <beginning> provides the context.

Context and semantics are often confused. Recall, semantics deals with the meaning of a sentence. Context can be used to validate or decipher the meaning of a sentence and, therefore, has little to do with syntax. Context can be used in two ways:

• Determine semantic validity: A classical example of context-̄sensitivity in programming languages is: a variable must be declared before it is used. Therefore, while the following C pro- gram, where the variable y is not declared, is syntactically valid, context reveals that it is not semantically valid because y is never declared.

1 main ( ) { 2 i n t x ; 3 y = 1 ; 4 }

Even if all variables are declared we may still require context to iden- tify type mismatches. For instance, consider the following C++ pro- gram.

1 main ( ) { 2 i n t x ; 3 bool y ; 4

5 x = 1 ; 6 y = false ; 7 x = y ; 8 }

Again, while this program is syntactically correct, it is not semanti- cally correct because of the assignment of the value of a variable of one type to a variable of another, different type. We need methods of

CO NF

ID EN

TI AL

D RA

FT74 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSstatic semantics (i.e., before run-time) to address this problem. We can generate semantically invalid programs from a context-free grammar because the production rules of a context-free grammar always apply, regardless of the context in which a string appears; hence, they are called context-free.

• To disambiguate semantic validity: Another example of context-sen- sitivity in programming languages is the ⋆ operator in C. Its meaning is dependent upon the context in which it is used. It is used as the multiplication and pointer dereferencing operators, as well as in the declaration of pointer types. For instance, without context the seman- tics of the expression x ⋆ y are ambiguous. If we see the declarations int x = 1, y = 2; immediately above this expression, the meaning of the ⋆ is multiplication. However, if we see typedef int x immedi- ately above this expression, the meaning of the ⋆ is the declaration of a pointer to an int.

C is a context-sensitive language whose parser is implemented with an ambiguous, context-free grammar.

One approach to modeling context in programming languages is to use more expressive and powerful grammars such as context-sensitive gram- mars. While C and Scheme are context-sensitive languages, the parser for them is implemented using a context-free grammar. While context- free grammars lend themselves to natural implementation as parsers, context-sensitive grammars do not and, therefore, are not very helpful in parser implementation (or in computer science in general). While context-sensitive grammars are beyond the scope of this book, we remark briefly that context-sensitive grammars are capable of modeling context in a purely syntactic way. With a context-sensitive grammar for C, it is impos- sible to generate a program referencing an undeclared variable. Moreover, such a program would be syntactically incorrect.

There are not many formalisms for dealing with semantics and con- text in programming languages. An impractical, brute-force approach is to push as much semantics and context as possible into a context-free gram- mar. The idea is to include additional production rules to help force the syntax to imply the semantics. This is the approach used with operator precedence and associativity in the previous section. Applying this idea

CO NF

ID EN

TI AL

D RA

FT2.5. CONTEXT-SENSITIVITY 75to more sophisticated concepts, such as requiring declaration of variables, is often impossible. Though even if possible, it is unreasonable, imprac- tical, or intractable. This approach involves designing the context-free production rules in such a way that they cannot generate a semantically invalid program. For instance, determining whether a collection of items is a set requires context (e.g., while determining if an element disquali- fies the collection from being a set, we need to examine the other items in collection, namely the context). However, if the universe from which the items in the collection are drawn is finite, then we can simply enu- merate out all possible sets from that universe. That enumeration is not only a context-free grammar, but also a regular grammar. This approach, however, involves an extraordinary large number of production rules and, thus, a larger, slower parser. We leave as an exercise defining the lan- guage for capitalizing the first letter of English sentences, defined above with a context-sensitive grammar, using a context-free grammar (see Exer- cise 2.5.4).

Attribute grammars are a practical formalism contributed by Donald Knuth which can be used. Attribute grammars are context-free grammars annotated/decorated with semantics rules and checks.

2.5.1 Conceptual Exercises for Section 2.5

Exercise 2.5.1: Give an example of a property in programming languages (other than any of those given above) that is context-sensitive or, in other words, an example of something that is not context-free.

Exercise 2.5.2: (choose one in each group) C++ is a (context-free or -sensi- tive) language, but is implemented with a (context-free or -sensitive) gram- mar.

Exercise 2.5.3: A context-sensitive grammar can express context which a context-free grammar cannot. State what a context-free grammar can express that a regular grammar cannot.

Exercise 2.5.4: We said in this section that sometimes we can push context into a context-free grammar (often by adding additional production rules) even though a context-free grammar has no provisions for representing

CO NF

ID EN

TI AL

D RA

FT76 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARScontext. Express the context-sensitive grammar given in this section enforc- ing the capitalization of the first character of an English sentence using a context-free grammar.

Exercise 2.5.5: Define a context-free grammar for the language whose sen- tences correspond to sets of the elements a, b, and c. For instance, the sentences {a}, {a, b}, {a, b, c} are in the language, but the sentences {a, a}, {b, a, b}, and {a, b, c, a} are not.

Exercise 2.5.6: (true / false) Every regular language is a context-sensitive lan- guage.

Exercise 2.5.7: Complete the blank on the left with the name of a formal language and the blank on the right with the name of a formal grammar.

C is a language implemented with a grammar.

2.6 Chomsky Hierarchy of Syntax

While this is not a textbook on automata theory or the theory of computa- tion, we make some brief remarks on the Chomsky Hierarchy which is a con- tainment hierarchy of formal grammars. The following discussion is also a summary of syntax. In the following A and B are single non-terminals; α, β, and γ are strings of terminals and non-terminals; and a and b are terminals.

• Regular Grammar (G3, Type-3, linear, one-dimensional): production rules are of the form

A ::= a | aB (i.e., a one-for-one replacement of non-terminal for non-terminal in N )

• Context-free Grammar (G2, Type-2, two-dimensional): production rules are of the form

A ::= γ (i.e., replace A with γ everywhere)

• Context-sensitive Grammar (G1, Type-1, three dimensional): produc- tion rules are of the form

CO NF

ID EN

TI AL

D RA

FT2.6. CHOMSKY HIERARCHY OF SYNTAX 77

Re cu

rsiv ely-E

numerable Lang/Gram

Turing Machine

Figure 2.10: The Chomsky Hierarchy.

αAβ ::= αγβ (i.e., replace A with γ in the context of α . . .β)

The strings α and β may be empty in the production rules of a context- sensitive grammar, but γ must be non-empty. Moreover, the rule S ::= ϵ is also permitted as long as S does not appear on the right-hand side of any rule.

• Unrestricted Grammar (G0, Type-0): production rules are of the form

α ::= β (i.e., no restrictions on replacements)

The production rules in a regular grammar are restricted to a single non-terminal on the left- and right-hand sides. The non-terminal on the right-hand side may be preceded or followed, but not both, by a single terminal. Moreover, the rule S ::= ϵ is also permitted as long as S does not appear on the right-hand side of any rule.

Here, G0 ⊃ G1 ⊃ G2 ⊃ G3. However, while generally regarded as a hi- erarchy, the Chomsky Hierarchy is not formally a hierarchy. Specifically,

CO NF

ID EN

TI AL

D RA

FT78 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSTable 2.4: Summary of formal languages and grammars, and models of computation. (defined/generated by) (recognized by) (constraints on)

Type Formal Language Formal Grammar Automaton Production Rules (model of computation)

Type-3 regular language regular grammar finite state automaton A ::= a | aB Type-2 context-free language context-free grammar pushdown automaton A ::= γ Type-1 context-sensitive language context-sensitive grammar linear-bounded automaton αAβ ::= αγβ Type-0 recursively-enumerable language unrestricted grammar Turing machine α ::= β

while every regular language is context-free, not every context-free lan- guage is a context-sensitive. Only those context-free languages not con- taining the empty string are context-sensitive. Every context-sensitive lan- guage is recursive, and every recursive language is recursively enumer- able. Also note that these containment relationships are all proper inclu- sions which means that there are recursively-enumerable languages which are not context-sensitive, context-sensitive languages (e.g., C, Scheme) which are not context-free, and context-free languages (e.g., language of balanced, matched parentheses) which are not regular. Note that recursive languages, which differ from recursively-enumerable languages, are not in the hierarchy. The former can be decided by an always-halting Turing machine.

Table 2.4 summarizes the progressive classes of formal grammars in the Chomsky Hierarchy. Specifically, it summarizes each of the four types of formal grammars in the Chomsky Hierarchy, the class of formal language each generates, the type of automaton which recognizes it, and the con- straints on its production rules. While most programming languages are context-sensitive (because variables often have to be declared before used), context-free grammars are the theoretical basis for the syntax of most pro- gramming languages (and the syntactic component of an implementation of most programming languages). The production rules of context-free grammars are restricted to have a single non-terminal on the left hand side and non-terminals and terminals on the right hand side. Also notice that the data structure we used in § 2.4.3 to recognize sentences in a bottom-up fashion in a context-free language is a stack (whose model of computa- tion is a non-deterministic pushdown automaton). Similarly, a top-down parser uses the internal run-time stack of activation records for function calls, one per non-terminal, to recognize sentences in a context-free lan- guage. Recall, context-free grammars are expressive enough to model lan- guages which require memory (i.e., a stack) to recognize sentences therein. The intangible memory implicit in a context-free grammar in a rule such as

CO NF

ID EN

TI AL

D RA

FT2.7. SEMANTICS 79<S> ::= ( <S> ) | () which only generates strings with the same number of open as close parentheses is made explicit in the tangible stack used in the parser for sentences defined by that grammar, even if that stack is built-in as is the case with a top-down parser which uses the run-time stack. Pro- duction rules with more than one non-terminal on the left hand side are permitted in context-sensitive grammars. Phrase structured (unrestricted) grammars include all formal grammars.

2.6.1 Conceptual Exercises for Section 2.6

Exercise 2.6.1: (true / false) Every regular language is also a context-free lan- guage.

Exercise 2.6.2: (true / false) Every context-free language is also a regular lan- guage.

Exercise 2.6.3: (true / false) Every regular language can be specified using a context-free grammar.

2.7 Semantics

Consider the English sentence ‘I chose wisely’ which is in the past tense. If we replace the word ‘chose’ with ‘chos’ the sentence has a syntax error because the ‘chos’ is not lexically valid. However, if we replace the word ‘chose’ with ‘choose,’ the sentence remains syntactically and semantically correct, but in the present tense. The semantics of the sentence are valid, but unintended. Such an error, like a run-time error, is difficult to identify.

2.7.1 Static Semantics

For example, declaring a variable before its usage or type compatibility. Attribute grammars can be used for static semantics.

2.7.2 Dynamic Semantics

The three approach to dynamic semantics (in order from least to most for- mal): operational, denotational, and axiomatic.

CO NF

ID EN

TI AL

D RA

FT80 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARS size of look−ahead

Ambiguous

LR(1)

LR(k)

SLR

LL(1)

LL(k)

Unambiguous Grammars Grammars

LR(0) simple

fastest

Look−Ahead

LALR(1)

LL top−down parser

LR bottom−up parser

Leftmost derivation Left to right

Rightmost derivation Left to right

LL(0)

Figure 2.11: Hierarchy of parser classes.

We refer readers to [Seb10, § 3.5] and [Lou02, §§ 13.2-13.4] for a detailed discussion of dynamic semantics (operational, denotation, and axiomatic).

2.8 Key Terms

Abstraction, agile methods, ambiguity, ambiguous grammar, assembler, assembly code, associativity, attribute grammars, back end, Backus-Naur Form (BNF) bottom-up parsing, Chomsky hierarchy, compilation, com- piler, context-free grammar, context-free language, context-sensitive gram- mar, context-sensitive language, derivation, dynamic semantics, Extended Backus-Naur Form (EBNF), extensionally, extreme programming, fixed- format languages, formal grammar, formal language, fourth-generation languages, free-format languages, front end, grammar, handle, intension- ally, interpretation, interpreter, just-in-time compilation, just-in-time im- plementation system, keywords, language implementation system, lex- eme, lexer, lexical analysis, lexical analyzer, linker, linear-bounded au- tomata, machine code, Moore’s Law, non-terminal, parser, parse tree, pars- ing, precedence, processor, production rule, programming language im- plementation system, pushdown automata, recursive-descent parsing, recursively-enumerable languages, recursive languages, reduce-reduce

CO NF

ID EN

TI AL

D RA

FT2.9. THEMATIC TAKE-AWAYS 81conflict, regular expression regular grammar, regular language, reserved words, scanner, semantic analysis, semantics, sentence, sentential form, shift-reduce parsing, software crisis, static semantics, structured program- ming, syntactic analysis, syntactic analyzer, syntactic sugar, syntax, termi- nal, token, top-down parsing, Turing machine, virtual machine.

2.9 Thematic Take-Aways

• One of the seminal contributions to computer science was the discov- ery that grammars were language recognition devices as well as gen- erative devices for language.

• Semantic properties, including precedence and associativity, can be modeled in context-free grammar.

2.10 Chapter Summary

Regular grammars can capture the rules for a valid identifier in C. Context- free grammars can capture the rules for a valid mathematical expression in C. Neither can capture the fact that a variable must be declared before it is used. We can push semantic properties like precedence and associativity into the grammar.

If a sentence from a language has more than one parse tree, then the grammar for the language is ambiguous.

Constructs and Capabilities

Since the concept of memory is absent from a regular grammar and, thus, a regular grammar is unable to define the language of balanced, matched parentheses which require a stack to recognize, we can think of a regu- lar grammar as a memory-free grammar. Since the concept of memory is present in a context-free grammar, we can think of a context-free grammar as a memory-sensitive grammar. Also note that the use of words -free and -sensitive in the names of formal grammars is inconsistent. The context- free in context-free grammar indicates what such a grammar is unable to model, namely context. However, the context-sensitive in context-sensitive

CO NF

ID EN

TI AL

D RA

FT82 CHAPTER 2. FORMAL LANGUAGES AND GRAMMARSgrammar indicates, on the other hand, what such a grammar can model, namely context.

2.11 Bibliographic Notes

Donald Knuth is the recipient of the 1977 ACM A.M. Turing Award for con- tributions to computing including programming languages and attribute grammars.

CO NF

ID EN

TI AL

D RA

FT

Chapter 3

Scanning and Parsing

Author: Saverio Perugini Copyright © 2018 by Saverio Perugini ALL RIGHTS RESERVED

3.1 Chapter Objectives

• Establish an understanding of lexical analysis or scanning.

• Establish an understanding of syntactic analysis or parsing.

• Differentiate between scanning and parsing.

• Introduce bottom-up, shift-reduce parsing.

• Introduce top-down, recursive-descent parsing.

• Illustrate the natural relationship between a context-free grammar and a recursive-descent parser.

• Introduce flex and bison.

3.2 Lexical Analysis or Scanning

The lowest-level syntactic units of a string are called lexemes (e.g., +, main, int, x, h, hw, hww). The first step of lexical analysis is to parcel the char- acters (from the alphabet Σ) of the string into lexemes. Lexemes can be

83

CO NF

ID EN

TI AL

D RA

FT84 CHAPTER 3. SCANNING AND PARSINGTable 3.1: Parceling lexemes into tokens in the sentence int i = 20;. Lexeme Token int reserved word i identifier = special symbol 20 literal ; special symbol

list of lexemes)

source program (regular grammar) list of

tokens

(context−free grammar) parserscanner

parse tree (string or

Figure 3.1: Simplified view of scanning and parsing: the front end.

formally described by regular grammars. Lexical analysis is the process of determining if a string (typically of a programming language) is lexically valid (i.e., if all of the lexemes of the string are valid).

All programming languages must specify how the lexemes of the lan- guage are delimited. There are a variety of ways that languages determine where lexemes begin and end. Most programming languages, but not all, delimit lexemes using whitespace (i.e., spaces and tabs).

Free-format languages are languages where formatting has no effect on program structure, of course other than use of some delimiter to determine where lexemes begin and end.

Consider parceling the characters from the string int i = 20 ; (in C) into lexemes. Using whitespace as the delimiter, the lexemes are int, i, =, 20, and ;. The sentences int i=20;, int i = 20;, and int i = 20 ; have the exact same set of lexemes: int, i, =, 20, and ;. C, C++, and Java are free-format languages.

Some languages impose restrictions on formatting making the language not entirely free form. Languages where formatting has effect on program structure and lexemes must occur in predetermined areas are called fixed- format languages. Early versions of FORTRAN were fixed-format. For in- stance (from [Lou02]), DO 99 I = 1.10 in FORTRAN (which is equiv- alent to DO99I = 1.10; in C) is different from DO 99 I = 1, 10 in FORTRAN (which is equivalent to for (I=1; I<=10; I++) in C) even though the lexemes DO, 99, I, and = appear in the same format. This is be- cause FORTRAN ignores all whitespace and has no reserved words. Other

CO NF

ID EN

TI AL

D RA

FT3.2. LEXICAL ANALYSIS OR SCANNING 85(string) id1 = id2 * id3 + id4

=

+

scanner

parser

id1

*

id2 id3

id4

parse tree

list of tokens

"n = x * y + z"source program

Figure 3.2: More detailed view of scanning and parsing.

languages, including Miranda, Haskell, Python, and occam use layout- based syntactic grouping (i.e., indentation) which is inspired by its use in the experimental, and highly influential, family of languages ISWIM, de- scribed by Peter J. Landin in 1966 [Mil66].

Once we have a list of lexemes we must determine if each is valid. This can be done by checking them against the valid lexemes of the language, or by running them through a finite state automata which can recognize the lexemes of the language. Most programming languages have reserved words which cannot be used as a name (e.g., int in C). Reserved words are not the same as keywords which are only special in certain contexts (e.g., main in C).

As each lexeme is determined to be valid each is abstracted into a to- ken because the individual lexemes are superfluous for the next phase— syntactic analysis. Lexemes are partitioned into tokens which are categories of lexemes. Table 3.1 shows how the five lexemes in the string int i = 20 ; fall into four token categories. The next phase in verifying the valid- ity of a sentence is determining if the tokens are structured properly. The actual lexemes are not important in verifying the structure of a candidate sentence. The details of a lexeme (e.g., i) are abstracted in its token (e.g.,

CO NF

ID EN

TI AL

D RA

FT86 CHAPTER 3. SCANNING AND PARSING<identifier>). For the string to be a sentence, the order of the tokens must conform to a context-free grammar. Here we are referring to the grammar of entire expressions rather than the (regular) grammar of individual lex- emes. If a string is lexically valid, lexical analysis returns a list of tokens.

A lexical analyzer in software system which picks out the lexemes, val- idates them, and returns a list of tokens. It is often called a scanner or a lexer. Parsing validates the order of the tokens in this list and, if valid, organizes this list of tokens into a parse tree. The system that validates a program string and, if valid, converts it into a parse tree is called a front end (scanner and parser; see Fig. 3.1).

Table 2.2 depicts a possible data structure used in a scanner. We refer the reader to [Seb10, § 4.2] for a detailed, straightforward ex-

planation of implementing scanners. The theory behind scanners (i.e., regular languages and grammars, and

finite state automata) is so sound that building a scanner is a mechanical process which can be automated by a computer program and, therefore, is rarely done by hand. The software lex (or its free version flex) is a UNIX tool that takes a set of regular expressions (i.e,. a regular grammar) (in a .l file) as input and automatically generates a lexical analyzer in C which can recognize lexemes in the language defined by the grammar; each call to lex() retrieves the next token. In other words, flex is generates a scan- ner in C. The flex program is a metaprogram because processes (reads and writes) another program. We refer the reader to [Nie] for a detailed, straightforward explanation of implementing scanners using flex.

3.3 Syntactic Analysis or Parsing

Programs that process other programs, such as interpreters and compilers, are syntax directed.

need to say what a concrete and abstract representation is here Parsing is the process of determining if a string is a sentence (in some

language) and, if so, converting the concrete representation of that sen- tence into an abstract representation (e.g., parse tree) which easily facili- tates the intended subsequent processing of it. This latter representation abstracts away the details that are irrelevant to subsequent processing. Therefore, the parser (or syntactic analyzer) component of an interpreter or

CO NF

ID EN

TI AL

D RA

FT3.3. SYNTACTIC ANALYSIS OR PARSING 87Table 3.2: Top-down versus bottom-up parsing. Parser Requisite†/Preferred‡ Type of Recursion in Grammar

bottom-up, shift-reduce left-recursive‡

top-down, recursive-descent right-recursive†

compiler typically converts the source program, once syntactically vali- dated, into a parse tree or other similar, easily manipulable representation. The term syntactic analyzer is synonymous with parser.

The syntactic validation of the program and the construction of a parse tree for it can proceed in parallel. Parsing is independent of what you are going to do with the tree: interpret it or compile (or translate) it into another, typically, lower-level representation.

Syntactic analysis (or parsing) involves verifying the validity of a candi- date sentence.

Often lexical and syntactic analysis are combined into a single phase (under the umbrella of syntactic analysis) to obviate making multiple passes through the input string (e.g., the program).

3.3.1 Bottom-up, Shift-reduce Parsing

We demonstrated bottom-up parsing in § 2.4.3 where we parsed a string using the shift-reduce method. The bottom-up nature refers to starting the parse with the terminals or lexemes of the string and working our way in a bottom-up or reverse fashion back to the start symbol of the gram- mar. While parsing a string in this bottom-up fashion we can also build up a parse tree for it, if desired, by allocating nodes as we shift and set- ting pointers to already-allocated nodes in newly-created internal nodes as we reduce. (Again, we need not always build a parse tree; sometimes a traversal is enough, especially if semantic analysis or code generation phases will not follow.)

Building a shift-reduce parser is a well-studied area of computer science and rarely done by hand anymore. A parser generator is a program that accepts lexical and syntactic specifications and automatically generates a scanner and parser from them (i.e., a front end).

The program yacc (yet another compiler compiler) (or its free version bison) is a UNIX tool which takes a context-free grammar defined in EBNF

CO NF

ID EN

TI AL

D RA

FT88 CHAPTER 3. SCANNING AND PARSING(in a .y file) as input and generates a bottom-up, shift-reduce parser in C for the language it defines. Very high-level languages, including yacc (i.e., the yacc language describes a context-free grammar and actions rather than computation directly), are referred to as fourth-generation languages because three levels of language abstraction precede them (i.e., machine code, assembly language, and high-level language). The tools flex and bison together constitute a scanner/parser generator system. A parser generator is a program that takes a lexical and a syntactic specification and automatically generates a front end from them (i.e., scanner and parser).

We refer the reader to [KP84][Ch. 8], [Nie], and [Lou02, pp. 110–112] for detailed, straightforward explanations of automatically generating a shift- reduce, bottom-up parser by defining a flex and a bison specification of the grammar that defines the language of which parsed sentences are members.

3.3.2 Top-down, Recursive-descent Parsing

We demonstrated top-down, predictive parsing in § 2.4.3 when we proved the validity a sentence by building a parse tree for it in a top-down, pre- dictive fashion.

Recall, a seminal discovery in computer science is that the grammar used to define a language and generate sentences from the language can also be used to recognize/parse sentences from the language. This is clearly evident in a recursive-descent parser. A top-down parser (in contrast to a bottom-up parser) naturally follows from a context-free grammar. The code for a top-down parser completely mirrors the grammar for the lan- guage it parses. In other words, a grammar provides a design for the implementation of a top-down parser. Specifically, we construct a top- down parser as a collection of functions where each function corresponds to one non-terminal in the grammar and is responsible for recognizing the sub-language rooted at that non-terminal. The right hand side of a produc- tion rule provides a design for the definition of the function correspond- ing to the non-terminal on the left hand side of the rule. A non-terminal on the right hand side translates to a call to the function corresponding to that non-terminal in the definition of the function corresponding to the non-terminal on the left hand side. This type of parser is also called a

CO NF

ID EN

TI AL

D RA

FT3.3. SYNTACTIC ANALYSIS OR PARSING 89recursive-descent parser because a non-terminal on the left side of a rule will frequently appear on the right side. The entire parser is decomposed this way until we arrive at functions for non-terminals with no non-terminals on the right hand side of their production rules (i.e., the base case). The parser is invoked by calling the function corresponding to the start sym- bol. Hence, the term top-down parser. Sometimes a top-down parser is called a predictive parser because rather than starting with the string and working its way bottom-up to the start symbol, it predicts that the string conforms to the start symbol and if proven incorrect pursues alternative predictions. As functions are called while the string is being parsed, the run-time stack of activation records keeps track of the current state of the parse. If the stack is empty when the entire string is consumed, the string is a sentence; otherwise it is not. We refer the reader to [Seb10, § 4.3.2, § 4.4] for information on building top-down parsers.

Since a recursive-descent parser contains one function per non-terminal, the size of the parser is directly proportional to the size of the grammar (i.e., the number of production rules). We also know that the operation (i.e., the pushing and popping of functions on and off the stack) of the run- time stack in the implementation of a programming language is, while nec- essary to support functions, pure overhead and, thus, a program with less function calls runs faster than one with more. Therefore, in implementing a recursive-descent parser we desire a grammar with a minimal set of pro- duction rules. While this guideline also applies to bottom-up parsers, the striking similarity between the rules of the grammar and the functions of a recursive-descent parser make it clear.

A theme in the discussion on disambiguating grammars in § 2.4.5 is that doing so increases the number of productions rules and, thus, is in conflict with using it as the design for a recursive-descent parser. Therefore, since ambiguity only affects the determination of the semantics of a sentence, as opposed to the determination of its syntactic validity, and a parser only deals with syntactic validity, often an ambiguous grammar is used to im- plement a language. For instance, parsers for C use ambiguous grammars. Disambiguating a grammar causes its rules to get lengthy and, thus, im- practical to implement.

An ambiguous grammar is often smaller in size (i.e., number of pro- duction rules) than its unambiguous analog and, because of this, generally

CO NF

ID EN

TI AL

D RA

FT90 CHAPTER 3. SCANNING AND PARSINGleads to a smaller (in lines of code) and faster parser than an unambiguous version of it.

We refer the reader to [Seb10, § 4.3.2, § 4.4] and [Lou02, § 4.6] for a de- tailed, straightforward explanation of implementing a recursive-descent parser.

Table 3.2 provides a comparison between the top-down and bottom-up methods of parsing.

Unlike shift-reduce parsers, top-down, recursive descent parsers are of- ten written by hand. For instance, the popular gcc C compiler used to use an automatically yacc generated shift-reduce parser, but now uses a handwritten recursive-descent parser. Similarly, the clang C compiler uses a handwritten recursive-descent parser written in C++. Moreover, clang is the ‘C Language Family Front-end’ which is a unified front-end for the C family of languages (C, Objective C, C++, and Objective C++). The rationale behind the decision to use a recursive-descent parser is that it makes it easy for new developers to understand the code (i.e., simply mapping between grammar and parser).

3.4 PLY: Python Lex-Yacc

PLY is a parser generator for Python patterned after flex and bison for C. In PLY tokens are specified using regular expressions and a context- free grammar is specified using a variation of ENBF. The yacc.yacc() function is used to automatically generate the scanner/parser; it returns an object containing a parsing function. As with bison, it is up to the programmer to specify the actions performed during parsing to build an abstract syntax representation (a parse tree, in this case; § 9.6).

The following is a context-free grammar in EBNF for the language CHAMELEON (developed in Part III) through Chapter 11.

CO NF

ID EN

TI AL

D RA

FT3.4. PLY: PYTHON LEX-YACC 91<program> ::= <expression> <expression> ::= <number> | <identifier> | <function> <expression> ::= let {<identifier> = <expression>}+

in <expression> <expression> ::= let⋆ {<identifier> = <expression>}+

in <expression> <expression> ::= <primitive> ( <expression>+(,) ) <expression> ::= if <expression> <expression> else <expression> <function> ::= fun ({<identifier>}⋆(,)) <expression>

<expression> ::= (<expression> {<expression>}⋆(,)) <expression> ::= letrec {<identifier> = <function> }+

in <expression> <primitive> ::= + | - | * | inc1 | dec1 | zero? | eqv?

The following is a PLY scanner specification for the tokens in the CHAMELEON language.

1 import re 2 import sys 3 import operator 4 import ply .lex as lex 5 import ply .yacc as yacc 6 from collections import defaultdict 7

8 global_line_offset = 0 9

10 tokens = ('NUMBER' , 'PLUS' , 'WORD' , 'MINUS' , 'MULT' , 'DEC1' , 11 'INC1' , 'ZERO' , 'LPAREN' , 'RPAREN' , 'COMMA' , 12 'IDENTIFIER' , 'LET' , 'LETSTAR' , 'LETREC' , 'EQ' , 13 'IN' , 'IF' , 'ELSE' , 'FUN' , 'EQV' , 'COMMENT' ) 14

15 keywords = ('if' , 'else' , 'fun' , 'inc1' , 'dec1' , 16 'in' , 'let' , 'let*' , 'letrec' , 'zero?' , 'eqv?' ) 17

18 keyword_lookup = {'if' : 'IF' , 'else' : 'ELSE' , 'fun' : 'FUN' , 19 'inc1' : 'INC1' , 'dec1' : 'DEC1' , 'in' : 'IN' , 20 'let' : 'LET' , 'let*' : 'LETSTAR' , 21 'letrec' : 'LETREC' , 'zero?' : 'ZERO' , 22 'eqv?' : 'EQV' } 23

24 t_PLUS = r'\+' 25 t_MINUS = r'-'

CO NF

ID EN

TI AL

D RA

FT92 CHAPTER 3. SCANNING AND PARSING26 t_MULT = r'\*' 27 t_LPAREN = r'\(' 28 t_RPAREN = r'\)' 29 t_COMMA = r',' 30 t_EQ = r'=' 31 t_ignore = " \t" 32

33 def t_WORD (t ) : 34 r'[A-Za-z_][A-Za-z_0-9*?!]*'

35 pattern = re . compile ("ˆ[A-Za-z_][A-Za-z_0-9?!]*$" ) 36

37 # i f the i d e n t i f i e r i s a keyword , parse i t as such 38 i f t .value in keywords : 39 t . type = keyword_lookup[t .value ] 40 # otherwise i t might be a v a r i a b l e so check t h a t 41 e l i f pattern .match (t .value ) : 42 t . type = 'IDENTIFIER' 43 # otherwise i t s a syntax e r r o r 44 else : 45 p rin t ("Runtime error: Unknown word %s %d" % 46 (t .value [ 0 ] , t .lexer .lineno ) ) 47 sys .exit(−1) 48 return t 49

50 def t_NUMBER (t ) : 51 r'-?\d+'

52 # t r y to convert the s t r i n g to an int , f l a g overflows 53 t r y : 54 t .value = i n t (t .value ) 55 except ValueError : 56 p rin t ("Runtime error: number too large %s %d" % 57 (t .value [ 0 ] , t .lexer .lineno ) ) 58 sys .exit(−1) 59 return t 60

61 def t_COMMENT(t ) : 62 r'---.*'

63 pass 64

65 def t_newline(t ) : 66 r'\n+'

67 global global_line_offset 68 # cont inue to next l i n e 69 t .lexer .lineno += t .value .count ("\n" ) + global_line_offset 70 global_line_offset = 0 71

72 def t_error (t ) :

CO NF

ID EN

TI AL

D RA

FT3.4. PLY: PYTHON LEX-YACC 9373 p rin t ("Unrecognized token %s on line %d." % (t .value .rstrip ( ) , 74 t .lexer .lineno ) ) 75

76 lexer = lex .lex ( )

The following is a PLY parser specification for the CHAMELEON language defined by this grammar. The reader will notice that it also contains ac- tions to construct an abstract syntax tree, used later for interpretation (see Chapters 10–11).

1 import sys 2 import ply .yacc as yacc 3 from chameleonlex import * 4

5 tokens = tokens 6

7 def p_pgm_error(t ) : 8 ' ' ' programs : e r r o r ' ' ' 9 p rin t ("Syntax error: Line %d: " % (t .lexer .lineno ) )

10 sys .exit(−1) 11

12 def p_program_expr(t ) : 13 ' ' ' programs : program programs 14 | program ' ' ' 15 #do nothing 16

17 def p_line_expr(t ) : 18 ' ' ' program : express ion ' ' ' 19

20 def p_primitive_op(t ) : 21 ' ' ' express ion : p r i m i t i v e LPAREN e xpre s s ions RPAREN ' ' ' 22

23 def p_primitive(t ) : 24 ' ' ' p r i m i t i v e : PLUS 25 | MINUS 26 | INC1 27 | MULT 28 | DEC1 29 | ZERO 30 | EQV ' ' ' 31

32 def p_expression_number(t ) : 33 ' ' ' express ion : NUMBER ' ' ' 34

35 def p_expression_identifier(t ) : 36 ' ' ' express ion : IDENTIFIER ' ' '

CO NF

ID EN

TI AL

D RA

FT94 CHAPTER 3. SCANNING AND PARSING37 38 def p_expression_let(t ) : 39 ' ' ' express ion : LET le t s ta te me nt IN express ion ' ' ' 40

41 def p_expression_let_star(t ) : 42 ' ' ' express ion : LETSTAR l e t s t a r s t a t e m e n t IN express ion ' ' ' 43

44 def p_expression_let_rec(t ) : 45 ' ' ' express ion : LETREC le t re c s ta t e me nt IN express ion ' ' ' 46

47 def p_expression_condition(t ) : 48 ' ' ' express ion : IF express ion express ion ELSE express ion ' ' ' 49

50 def p_expression_function_decl(t ) : 51 ' ' ' express ion : FUN LPAREN arguments RPAREN express ion 52 | FUN LPAREN RPAREN express ion ' ' ' 53

54 def p_expression_function_call(t ) : 55 ' ' ' express ion : LPAREN express ion parameters RPAREN 56 | LPAREN express ion RPAREN ' ' ' 57

58 def p_expression_rec_func_decl(t ) : 59 ' ' ' rec func decl : FUN LPAREN arguments RPAREN express ion 60 | FUN LPAREN RPAREN express ion ' ' ' 61

62 def p_expression_list(t ) : 63 ' ' ' e xpre s s ions : express ion 64 | express ion COMMA e xpre s s ions ' ' ' 65

66 def p_let_statement(t ) : 67 ' ' ' l e t s ta te me nt : let assignment 68 | let assignment le t s ta te me nt ' ' ' 69

70 def p_letstar_statement(t ) : 71 ' ' ' l e t s t a r s t a t e m e n t : le t s tar as s ig nme nt 72 | l e t s tar as s ig nme nt l e t s t a r s t a t e m e n t ' ' ' 73

74 def p_letrec_statement(t ) : 75 ' ' ' l e t re c s ta te me nt : le t rec ass ignment 76 | le t rec ass ignment le t re c s t a te me nt ' ' ' 77

78 def p_let_assignment(t ) : 79 ' ' ' let assignment : IDENTIFIER EQ express ion ' ' ' 80

81 def p_letstar_assignment(t ) : 82 ' ' ' l e t s t ar as s ig nme nt : IDENTIFIER EQ express ion ' ' ' 83

CO NF

ID EN

TI AL

D RA

FT3.4. PLY: PYTHON LEX-YACC 9584 def p_letrec_assignment(t ) : 85 ' ' ' le t rec ass ignment : IDENTIFIER EQ rec func decl ' ' ' 86

87 def p_arguments(t ) : 88 ' ' ' arguments : IDENTIFIER 89 | IDENTIFIER COMMA arguments ' ' ' 90

91 def p_parameters(t ) : 92 ' ' ' parameters : express ion 93 | express ion COMMA parameters ' ' ' 94

95 def parse1 (s ,parser ) : 96 pattern = re . compile ("[ˆ \t]+" ) 97 i f pattern .match (s ) : 98 t r y : 99 parser .parse (s )

100 except Exception as e : 101 i f ( i s i n s t a n c e (e , chameleoninterpreter .InterpreterException) ) : 102 p rin t ( "Line %s: %s" % (e .linenumber , e .message ) ) 103 i f ( e .additional_information != None ) : 104 p rin t ("Additonal information:" ) 105 p rin t (e .additional_information) 106

107 else : 108 p rin t ("Unknown Error occured \ 109 (this is normally caused by a syntax error)" ) 110 r a i s e e 111

112

113 def main_func ( ) : 114

115 parser = yacc .yacc ( ) 116

117 while True : 118 s = "" 119 t r y : 120 # i n t e r a c t i v e REPL mode 121 i f len (sys .argv ) == 2 and sys .argv [ 1 ] == "-i" : 122 line = input ('ChAmElEoN> ' ) 123 # t e s t , non−i n t e r a c t i v e mode 124 else : 125 line = input ( ) 126

127 while line != "" : 128 s += (line + '\n' ) 129 line = input ( ) 130 global_line_offset = global_line_offset + 1

CO NF

ID EN

TI AL

D RA

FT96 CHAPTER 3. SCANNING AND PARSING131 parse1 (s ,parser ) 132

133 except : 134 # t e s t , non−i n t e r a c t i v e mode 135 i f len (sys .argv ) == 1 : 136 parse1 (s ,parser ) 137 sys .exit(−1) 138

139 main_func ( )

Notice that the action part of each pattern-action rule is empty. Thus, this parser does not build an abstract syntax tree. For a parser generator that builds an abstract syntax tree, see Listing 9.7.2 in Chapter 9 (Type Systems and Data Abstraction).

For the details of PLY, see http://www.dabeaz.com/ply/. These specifications have been tested and run in PLY 3.9. The scanner and parser generated by PLY from these specifications have been tested and run in Python 3.4.6.

3.5 SLLGEN

SLLGEN is a parser generator for Scheme (much like flex and bison for C). In SLLGEN tokens are specified using regular expressions and a context-free grammar is specified using a variation of ENBF. The sllgen:make-string-parserprocedure is used to automatically gen- erate the scanner and parser; it returns a function that takes a string and produces an abstract syntax representation (see § 9.6). The following is a scanner and parser specification for the language CHAMELEON (devel- oped in Part III) for use with SLLGEN.

1 #lang eopl 2

3 ; ; ; l e x i c a l s p e c i f i c a t i o n 4 (define tokens 5 ' ( ( white-sp 6 (whitespace) skip ) 7 (comment 8 ("---" (arbno ( not #\newline ) ) ) skip ) 9 (identifier

10 (letter (arbno ( or letter digit "?" ) ) ) symbol ) 11 (number

CO NF

ID EN

TI AL

D RA

FT3.5. SLLGEN 9712 (digit (arbno digit ) ) number ) ) ) 13

14 ; ; ; s y n t a c t i c s p e c i f i c a t i o n 15 (define grammar 16 ' ( ( program 17 (expr ) 18 a-program) 19

20 (expr 21 (number ) 22 literal-expr) 23

24 (expr 25 (identifier) 26 variable-expr) 27

28 (expr 29 (primitive "(" (separated-list expr "," ) ")" ) 30 prim-app-expr) 31

32 (primitive ("+" ) 33 addition-primitive) 34 (primitive ("-" ) 35 subtraction-primitive) 36 (primitive ("*" ) 37 multiplication-primitive) 38 (primitive ("inc1" ) 39 increment1-primitive) 40 (primitive ("dec1" ) 41 decrement1-primitive) ) ) 42

43 ; ; ; g e ne rate s the scanner and parser : 44 ; ; ; maps s t r i n g −> a b s t r a c t syntax 45 (define scan-parse-program 46 (sllgen :make-string-parser tokens grammar ) ) 47

48 (scan-parse-program "dec1(4)" )

For the details of SLLGEN see [FW08, Appendix B].

3.5.1 Conceptual Exercises for Chapter 3

Exercise 3.5.1: State whether it is preferable to use a left-recursive or right- recursive grammar with recursive-descent parsing and why. Explain. Be spe- cific.

CO NF

ID EN

TI AL

D RA

FT98 CHAPTER 3. SCANNING AND PARSINGExercise 3.5.2: A recursive-descent parser designed from an unambiguous grammar with excessive production rules is slow. Is it slow in speed of execution or speed of development or both? Explain.

3.5.2 Programming Exercises for Chapter 3

Exercise 3.5.3: Consider the following context-free grammar defined in EBNF (from [Lou02]):

<expr> ::= ( <list> ) | a <list> ::= <expr> [<list>]

where <expr> and <list> are non-terminals and a, (, and ) are terminals.

Does this grammar follow the rules of predictive parsing? If so, illustrate why it does. If not, illustrate that it does not.

a) Implement a recursive-descent parser in any language which accepts strings from standard input (one per line) until EOF and determines whether or not each string is in the language defined by this grammar. Thus, it might be help to think of defining this language using the fol- lowing context-free grammar in EBNF:

<input> ::= <input> <expr> \n | <expr> \n <expr> ::= (<list>) | a <list> ::= <expr> | <expr> <list>

where <input>, <expr>, and <list> are non-terminals and a, (, ), and \n are terminals. Factor your program into a scanner (lexical analyzer) and recursive- descent parser (syntactic analyzer) as shown in Fig. 3.1.

Recall a recursive-descent parser contains one procedure for each non- terminal in the grammar. For example, the following C and Python pro- cedures, respectively, each parse the non-terminal <list>:

1 bool list ( char * input ) { 2 bool valid ; 3

4 valid = expr (input ) ;

CO NF

ID EN

TI AL

D RA

FT3.5. SLLGEN 995 6 i f (valid && nextLexeme != '\0' ) 7 i f (nextLexeme == '(' | | nextLexeme == 'a' ) 8 valid = list (input ) ; 9

10 return valid ; 11 }

1 def l i s t ( ) : 2 global lexeme 3

4 expr ( ) 5

6 # opt ional par t 7 i f ( ( lexeme == "(" ) or (lexeme == "a" ) ) : 8 l i s t ( )

You may not assume that each lexeme will be valid and separated by exactly one space, or that each line will contain no leading or trail- ing whitespace. There are two distinct error conditions that your program must recognize. First, if a given string does not consist of valid lexemes, then respond with this message: ‘‘...’’ contains invalid lexemes and, thus, is not an expression. Sec- ond, if a given string consists of valid lexemes but it is not an expression according to the grammar, then respond with the message: ‘‘...’’ is not an expression. Note that the “invalid lexemes” message takes priority over the “not an expression” message (i.e., the “not an ex- pression” message can only be issued if the input string consists entirely of valid lexemes).

You may assume that whitespace is ignored, that no line of input will exceed 4,096 characters, that each line of input will end with a newline, and that no string will contain more than 200 lexemes.

Print only one line of output to standard output per line of input, and do not prompt for input. The following is a sample interactive session with the parser (> is simply the prompt for input and will be the empty string in your system):

> ( a)

"( a )" is an expression.

> a

"a" is an expression.

CO NF

ID EN

TI AL

D RA

FT100 CHAPTER 3. SCANNING AND PARSING> ( ( ( a a ) ) ) "( ( ( a a ) ) )" is an expression.

> ( a ) )

"( a ) )" is not an expression.

> ,(a)

",(a)" contains invalid lexemes and, thus, is not an expression.

> (( (a a ) ))

"( ( ( a a ) ) )" is an expression.

> ( a ( a ) ) )

"( a ( a ) ) )" is not an expression.

> (( a ) 1 )

"(( a ) 1 )" contains invalid lexemes and, thus, is not an expression.

> (a(a))

"( a ( a ) )" is an expression.

> ( ( a ) )

"( ( a ) )" is an expression.

> ( )

"( )" is not an expression.

> (

"(" is not an expression.

Start by reading [Seb10, § 4.3.2, § 4.4] and [Lou02, § 4.6] on recursive- descent parsing.

b) Automatically generate a shift-reduce, bottom-up parser by defining a flex and a bison specification of a parser for the language defined by this grammar. Start by reading [KP84][Ch. 8], [Nie], and [Lou02, pp. 110–112] to get familiar with flex and bison. You may assume the following code in your bison specification, though you must replace each ... with one line of code:

1 input : input expr '\n' { printf ("\"%s\" is an expression.\n" , 2 temp ) ; 3 . . . } 4 | expr '\n' { printf ("\"%s\" is an expression.\n" , temp ) ; 5 . . . } 6

7 | error '\n' { printf ("\"%s\" is not an expression.\n" , 8 temp ) ; 9 . . .

10 yyclearin ; /* d is card lookahead */ 11 yyerrok ; } 12 ; 13 /* bison s p e c i f i c a t i o n f i l e parser . y */

CO NF

ID EN

TI AL

D RA

FT3.5. SLLGEN 101Also write a Makefilewhich builds your parser. Your Makefilemust include target directives for every derived file produced during the com- pilation process (i.e., each program, each object file, and any other inter- mediate files produced during code generation and compilation). Make sure that each directive also lists all files on which the derived file de- pends in its dependency list. Also, your Makefile must be written to carry out only the commands necessary to bring any produced file up- to-date. Your Makefile must do just enough, but no extra, work to bring the final executable for your parser up-to-date every time make is invoked. In addition, it must have an all directive and a clean direc- tive to remove all generated files. Use variables where appropriate to improve the readability of your Makefile. Your Makefile must bring everything up-to-date, using only flex, bison, and gcc, without any warnings or errors, when make is invoked.

c) Recall that a seminal discovery in computer science was that grammars can be used as recognition devices as well as generative devices. Imple- ment a generator of sentences from the language defined by the gram- mar in this exercise as an efficient approach to test-case generation. In other words, write a program to output sentences from this language. A simple way to build your generator is to follow the theme of build- ing a recursive descent parser. In other words, develop one procedure per non-terminal where each such procedure is responsible for generat- ing sentences from the sub-language rooted at that non-terminal. You can develop this generator from your recursive descent parser by invert- ing each procedure to perform generation rather than recognition For example, the following C++ and Python procedures, respectively, each generate a sub-sentence rooted at the non-terminal <list>:

1 void list ( char *&buffer ) { 2 double randomvalue = ( double ) rand ( ) / ( double ) RAND_MAX ; 3

4 recursioncounter−−; 5

6 expr (buffer ) ; 7

8 i f (randomvalue < 0 . 5 ) 9 list (buffer ) ;

CO NF

ID EN

TI AL

D RA

FT102 CHAPTER 3. SCANNING AND PARSING1 def l i s t ( ) : 2

3 global no_tokens 4 global max_tokens 5

6 expr ( ) 7 i f (random .randint ( 0 , 1 ) == 1 and no_tokens < maxtokens) : 8 l i s t ( )

Your generator must produce sentences from the language in a random fashion. Therefore, when several alternatives exist on the right-hand side of a production rule, determine which non-terminal to follow randomly. Also, generate sentences with a random number of lexemes. To do so, each time you generate a sentence, generate a random number between the minimum number of lexemes necessary in a sentence and a maxi- mum number which keeps the generated string within character limit of the input strings to the parser from the problem. Use this random number to serve as the maximum number of lexemes in the generated sentence. Every time you encounter an optional non-terminal (i.e., one enclosed in brackets), flip a coin to determine if you should pursue that path through the grammar. Then only pursue the path if the flip indi- cates you should and if the number lexemes generated so far is less than the random number of maximum of lexemes you generated. Your gen- erator must read a positive integer given at the command-line and write that many sentences from the language to standard output, one per line.

Testing any program on various representative data sets is an important aspect of software development and this exercise will help you test your parser.

Exercise 3.5.4: Consider the following context-free grammar defined in EBNF:

<P> ::= () | (<P>) | ()(<P>) | (<P>)<P>

where <P> is a non-terminal and ( and ) are terminals.

Complete Programming Exercise 3.5.3 (parts a, b, and c) using this gram- mar subject to all of the requirements given in that exercise.

The following is a sample interactive session with the parser:

CO NF

ID EN

TI AL

D RA

FT3.5. SLLGEN 103> () "()" is a sentence.

> ()()

"()()" is a sentence.

> (())

"(())" is a sentence.

> (()())()

"(()())()" is a sentence.

> ((()())())

"((()())())" is a sentence.

> (a)

"(a)" contains invalid lexemes and, thus, is not a sentence.

> )(

")(" is not a sentence.

> )()

")()" is not a sentence.

> )()(

")()(" is not a sentence.

> (()()

"(()()" is not a sentence.

> ())((

"())((" is not a sentence.

> ((()())

"((()())" is not a sentence.

Exercise 3.5.5: Consider the following context-free grammar defined in EBNF:

<expr> ::= <expr> + <expr> <expr> ::= <expr> * <expr> <expr> ::= − <expr> <expr> ::= <integer>

<integer> ::= 1 | 2 | 3 | . . . |∞

where <expr> and <integer> are non-terminals and +, *, −, and 1, 2, 3, . . . are terminals.

Complete Programming Exercise 3.5.3 (parts a, b, and c) using this gram- mar subject to all of the requirements given in that exercise.

The following is a sample interactive session with the parser:

> 2+3*4

"2+3*4" is an expression.

CO NF

ID EN

TI AL

D RA

FT104 CHAPTER 3. SCANNING AND PARSING> 2+3*-4 "2+3*-4" is an expression.

> 2+3*a

"2+3*a" contains invalid lexemes and, thus, is not an expression.

> 2+*3*4

"2+*3*4" is not an expression.

d) At some point in your education, you may have encountered the con- cept of diagramming sentences. A diagram of an expression is a parse tree-like drawing representing the grammatical structure of the sentence, including parts such as the subject, verb, and object.

Complete Programming Exercise 3.5.3.a but this time build a recur- sive descent parser that produces a diagrammed version of the in- put string, which means an expression in properly parenthesized form which means that each non-terminal appearing in the input string now has parentheses around it (omitting all redundant parentheses). Print only one line of output to standard output per line of input as follows. Consider the following sample interactive session with the parser (> is simply the prompt for input and will be the empty string in your sys- tem):

Do not build a parse tree to solve this problem. Rather instrument your recursive-descent parser to construct the diagrammed sentence as fol- lows in the C and Python procedures, respectively, that each parse and diagram a sub-sentence rooted at the non-terminal <list>:

1 bool list ( char * input ) { 2 bool valid ; 3

4 printf ("(" ) ; 5

6 valid = expr (input ) ; 7 i f (valid && nextLexeme != '\0' ) 8 /* op t ional par t */ 9 i f (nextLexeme == '(' | | nextLexeme == 'a' )

10 valid = list (input ) ; 11

12 printf (")" ) ; 13

14 return valid ; 15 }

CO NF

ID EN

TI AL

D RA

FT3.5. SLLGEN 1051 def l i s t ( ) : 2 global lexeme 3

4 p rin t "(" 5

6 expr ( ) 7 # opt ional par t 8 i f ( ( lexeme == "(" ) or (lexeme == "a" ) ) : 9 l i s t ( )

10

11 p rin t ")"

The following is a sample interactive session with the parser diagram- mer:

> 2+3*4

"((2)+((3)*(4)))" is an expression.

> 2+3*-4

"((2)+((3)*(-(4))))" is an expression.

> 2+3*a

"2+3*a" contains invalid lexemes and, thus, is not an expression.

> 2+*3*4

"2+*3*4" is not an expression.

e) Complete Programming Exercise 3.5.5.d using flex and a bison. Write a Makefile as indicated in Programming Exercise 3.5.3.b.

Hint: Use an array implementation of a stack which contains elements of type char*. Also, use the sprintf function to convert an integer to a string. For example,

1 char * string_representation_of_an_integer = 2 malloc ( 1 0 * s izeo f ( *string_representation_of_an_integer) ) ; 3

4 /* p r i n t s the i n t e g e r 789 to 5 the s t r i n g v a r i a b l e s tr ing representat ion of an integer */ 6 sprintf (string_representation_of_an_integer , "%d" , 789) ; 7

8 /* next l i n e p r i n t s the i n t e g e r 789 to stdout */ 9 printf ("%s" , string_representation_of_an_integer) ;

f) Complete Programming Exercise 3.5.5.d, but this time build a parse tree in memory and traverse it to output the diagrammed sentence.

g) Complete Programming Exercise 3.5.5.f using flex and a bison. Write a Makefile as indicated in Programming Exercise 3.5.3.b.

CO NF

ID EN

TI AL

D RA

FT106 CHAPTER 3. SCANNING AND PARSINGExercise 3.5.6: Consider the following context-free grammar defined in EBNF (from [Nie]):

<program> ::= <program> <stmtlist> | <stmtlist> <stmtlist> ::= <stmtlist> <stmt> ; | <stmt> ;

<stmt> ::= <expr> | print <expr> <stmt> ::= <variable> = <expr> <expr> ::= <integer> | <variable> <expr> ::= − <expr> <expr> ::= <expr> + <expr> <expr> ::= <expr> − <expr> <expr> ::= <expr> * <expr> <expr> ::= <expr> / <expr> <expr> ::= <expr> ˆ <expr> <expr> ::= (<expr>)

<integer> ::= 1 | 2 | 3 | . . . |∞ <variable> ::= a | b | c | . . . | z

Complete Programming Exercise 3.5.3 (parts a, b, and c) using this gram- mar subject to all of the requirements given in that exercise.

The following is a sample interactive session with the parser:

> a = 1; b = 2; print a+b;

"a = 1; b = 2; print a+b" is a program.

> x = -4; y = a + 1; print x;

"x = -4ˆ2; y = a + 1; print x;" is a program.

> x != 4

"x != 4" contains invalid lexemes and, thus, is not a program.

> x == 4;

"x == 4;" is not a program.

Exercise 3.5.7: Consider the following context-free grammar defined in EBNF (from [Nie]):

CO NF

ID EN

TI AL

D RA

FT3.5. SLLGEN 107<program> ::= <stmtlist> . <stmtlist> ::= <stmtlist> <stmt> | <stmt>

<stmt> ::= ; | print <expr> ; <stmt> ::= <variable> = <expr> ; <stmt> ::= while (<expr>) <stmt> <stmt> ::= if (<expr>) <stmt> [else <stmt>] <stmt> ::= { <stmt list> }

<stmt list> ::= <stmt> | <stmt list> <stmt> <expr> ::= <integer> | <var> | − <expr> <expr> ::= <expr> + <expr> | <expr> − <expr> <expr> ::= <expr> * <expr> | <expr> / <expr> <expr> ::= <expr> < <expr> | <expr> > <expr> <expr> ::= <expr> <= <expr> | <expr> >= <expr> <expr> ::= <expr> == <expr> | <expr> != <expr> <expr> ::= <expr> ˆ <expr> | (<expr>) <expr> ::= (<expr>)

<integer> ::= 1 | 2 | 3 | . . . |∞ <var> ::= a | b | c | . . . | z

Complete Programming Exercise 3.5.3 (parts a, b, and c) using this gram- mar subject to all of the requirements given in that exercise.

The following is a sample interactive session with the parser:

> print 2ˆ3;.

"print 2ˆ3;." is a program.

> x=1; while (x <= 5) { print x; x=x+1; } print x;.

"x=1; while (x <= 5) { print x; x=x+1; } print x;." is a program.

> a + ˜b;

"a + ˜b" contains invalid lexemes and, thus, is not a program.

> x != 4

"x != 4" is not a program.

Exercise 3.5.8: Consider the following context-free grammar defined in EBNF:

CO NF

ID EN

TI AL

D RA

FT108 CHAPTER 3. SCANNING AND PARSING<expr> ::= <term> * <term> <expr> ::= <term> − <term> <expr> ::= <term> <term> ::= <factor> / <factor> <term> ::= <factor> + <factor> <term> ::= <factor>

<factor> ::= <identifier> | <number> | (<expr>) <identifier> ::= <alpha> <alphanumrest> | <alpha>

<alpha> ::= a | b | . . . | A | B | . . . | Z | <alphanumrest> ::= <alphanum> <alphanumrest> | <alphanum>

<alphanum> ::= <alpha> | <digit> <number> ::= <nonzerodigit> <rest> | <digit>

<rest> ::= <digit> <rest> | <digit> <nonzerodigit> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Complete Programming Exercise 3.5.3 (parts a, b, and c) using this gram- mar subject to all of the requirements given in that exercise.

The following is a sample interactive session with the parser:

> ( 6 )

"( 6 )" is an expression.

> a

"a" is an expression.

> ( i) )

"( i) )" is not an expression.

> ,a - 1

",a - 1" contains invalid lexemes and, thus, is not an expression.

> ( ( a ) )

"( ( a ) )" is an expression.

> id * index - rate * 1001 - (r - 32) * key

"id * index - rate * 1001 - (r - 32) * key" is not an expression.

> ( ( ( a ) ) )

"( ( ( a ) ) )" is an expression.

> ;10 - 10

";10 - 10" contains invalid lexemes and, thus, is not an expression.

> 01 - 10

"01 - 10" is not an expression.

> a * b - c

"a * b - c" is not an expression.

> ( ( ( a a ) ) )

CO NF

ID EN

TI AL

D RA

FT3.5. SLLGEN 109"( ( ( a a ) ) )" is an expression. > ( a ( a ) )

"( a ( a ) )" is not an expression.

> 2 * 3

"2 * 3" is an expression.

> ( )

"( )" is not an expression.

> 2 * rate - (((3)))

"2 * rate - (((3)))" is not an expression.

> (

"(" is not an expression.

> ( f ( t ) ) )

"( f ( t ) ) )" is not an expression.

> f!a+u

"f!a+u" contains invalid lexemes and, thus, is not an expression.

> a* "a*" is not an expression.

> \_aaa+1

"\_aaa+1" is an expression.

> \_\_\_\_aa+y

"\_\_\_\_aa+y" is an expression.

Exercise 3.5.9: Consider the following context-free grammar defined in EBNF:

<stmtlist> ::= ( {<statement>}⋆ ) <statement> ::= if <condition> <stmtlist> [else <stmtlist>] endif <statement> ::= while <condition> <stmtlist> endwhile <statement> ::= label = <expression> <statement> ::= print <expression> <statement> ::= func (<term>⋆) <stmtlist> endfunc <condition> ::= <expression> <cmpoperator> <expression>

<expression> ::= <term> {<arithoperator><term>}+ <term> ::= <label> | <digit>

<cmpoperator> ::= == | ! = | < | > <= | <= <arithoperator> ::= + | − | ∗ | /

Complete Programming Exercise 3.5.3 (parts a, b, and c) using this gram- mar subject to all of the requirements given in that exercise.

Exercise 3.5.10: Consider the following context-free grammar defined in BNF (not EBNF):

CO NF

ID EN

TI AL

D RA

FT110 CHAPTER 3. SCANNING AND PARSING<expr> ::= <expr> & <expr> <expr> ::= <expr> | <expr> <expr> ::= ∼ <expr> <expr> ::= <literal>

<literal> ::= t <literal> ::= f

where t, f, |, &, and ∼ are terminals.

Complete Programming Exercise 3.5.5 (parts a–g) using this grammar sub- ject to all of the requirements given in that exercise.

The following is a sample interactive session with the parser:

> f | t & f | ˜t

"f | t & f | ˜t" is an expression.

> ˜t | t | ˜f & ˜f & t & ˜t | f

"˜t | t | ˜f & ˜f & t & ˜t | f" is an expression.

> f | t ; f | ˜t

"f | t ; f | ˜t" contains invalid lexemes and, thus, is not an expression.

> f | t & & f | ˜t

"f | t & & f | ˜t" is not an expression.

The following is a sample interactive session with the parser diagrammer:

> f | t & f | ˜t

"(((f) | ((t) & (f))) | (˜(t)))" is a diagrammed expression.

> ˜t | t | ˜f & ˜f & t & ˜t | f

"((((˜(t)) | (t)) | ((((˜(f)) & (˜(f))) & (t)) & (˜(t)))) | (f))"

is a diagrammed expression.

> f | t ; f

"f | t ; f" contains invalid lexemes and, thus, is not an expression.

> f | | t & ˜t

"f | | t & ˜t" is not an expression.

Exercise 3.5.11: Consider the following context-free grammar defined in BNF (not EBNF):

CO NF

ID EN

TI AL

D RA

FT3.5. SLLGEN 111<program> ::= (<declarations>, <expr>) <declarations> ::= [] <declarations> ::= [ <varlist> ]

<varlist> ::= <var> <varlist> ::= <var>, <varlist> <expr> ::= <expr> & <expr> <expr> ::= <expr> | <expr> <expr> ::= ∼ <expr> <expr> ::= <literal> <expr> ::= <var>

<literal> ::= t <literal> ::= f

<var> ::= a . . .e <var> ::= g . . .s <var> ::= u . . .z

where t, f, |, &, ∼, and a . . .e, g . . .s, and u . . .z are terminals.

Complete Programming Exercise 3.5.3 (parts a, b, and c) using this gram- mar subject to all of the requirements given in that exercise.

The following sample interactive session with the parser:

> ([], f | t & f | ˜t)

"([], f | t & f | ˜t)" is a program.

> ([], ˜t | t | ˜f & ˜f & t & ˜t | f)

"([], ˜t | t | ˜f & ˜f & t & ˜t | f)" is a program.

> ([p,q], ˜t | p | ˜e & ˜f & t & ˜q | r)

"([p,q], ˜t | p | ˜e & ˜f & t & ˜q | r)" is a program.

> ([], f | t ; f)

"([], f | t ; f)" contains invalid lexemes and, thus, is not a program.

> ([], f | | t & ˜t)

"([], f | | t & ˜t)" is not a program.

Exercise 3.5.12: Consider the following context-free grammar defined in EBNF for some simple English sentences:

CO NF

ID EN

TI AL

D RA

FT112 CHAPTER 3. SCANNING AND PARSING<sentence> ::= <subject><verb phrase><object> <subject> ::= <noun phrase>

<verb phrase> ::= <verb> | <verb> <adv> <object> ::= <noun phrase> <verb> ::= lifted | saw | found <adv> ::= quickly | carefully | brilliantly

<noun phrase> ::= [<adj phrase>] <noun> [<prep phrase>] <noun> ::= cow | alice | book

<adj phrase> ::= <adj> | <adj> <adj phrase> <adj> ::= green | lean | mean

<prep phrase> ::= <prep> <noun phrase> <prep> ::= of | at | with

For simplicity, we ignore articles, punctuation, and capitalization, includ- ing Alice’s name or the first word of the sentence, otherwise known as context.

Complete Programming Exercise 3.5.5 (parts a–g) using this grammar sub- ject to all of the requirements given in that exercise.

The following are Python and Pascal procedures, respectively, that each parse a sub-sentence rooted at the non-terminal <adj>:

1 def adj ( ) : 2 global lexeme 3 global lexeme_index 4 global num_lexemes 5 global error 6 i f ( (lexeme == "green" ) or (lexeme == "lean" ) 7 or (lexeme == "mean" ) ) : 8 getNextLexeme ( ) 9 else :

10 error = 1

1 procedure adj ; begin 2

3 ( * The g lobal v a r i a b l e ” next token ” holds the next 4 token in the input stream . * ) 5 i f ( (next_token = 'green' ) or (next_token = 'lean' ) or 6 (next_token = 'mean' ) ) then begin 7

CO NF

ID EN

TI AL

D RA

FT3.5. SLLGEN 1138 ( * A funct ion you write to advance to the next token * ) 9 lexical ;

10

11 end else 12 ( * A funct ion you write to handle e r r o r s * ) 13 error ('Input is not a program' ) ; 14 end ;

The following is a sample interactive session with the parser:

> alice found mean green book

"alice found mean green book" is a sentence.

> mean cow saw carefully green alice with book

"mean cow saw carefully green alice with book" is a sentence.

> alice found mean gren book

"alice found mean gren book" contains invalid lexemes and, thus,

is not a sentence.

> found alice mean green book

"found alice mean green book" is not a sentence.

The following are Python and Pascal procedures, respectively, that each parse and diagram a sub-sentence rooted at the non-terminal <adj>:

1 def adj ( ) : 2 global lexeme 3 global lexeme_index 4 global num_lexemes 5 global error 6 i f ( (lexeme == "green" ) or (lexeme == "lean" ) 7 or (lexeme == "mean" ) ) : 8 getNextLexeme ( ) 9 else :

10 error = 1

1 procedure adj ; begin 2

3 write ('(' ) ; 4

5 ( * The g lobal v a r i a b l e ” next token ” holds the next 6 token in the input stream . * ) 7 i f ( (next_token = 'green' ) or (next_token = 'lean' ) or 8 (next_token = 'mean' ) ) then begin 9

10 ( * A funct ion you write to advance to the next token * )

CO NF

ID EN

TI AL

D RA

FT114 CHAPTER 3. SCANNING AND PARSING11 lexical ; 12

13 write (')' ) 14 end else 15 ( * A funct ion you write to handle e r r o r s * ) 16 error ('Input is not a sentence' ) ; 17 end ;

The following is a sample interactive session with the parser diagrammer:

> alice found mean green book

((("alice")) ("found") ((("mean" ("green")) "book")))

> mean cow saw carefully green alice with book

(((("mean")"cow")) ("saw""carefully") ((("green")"alice"("with"("book")))))

> alice found mean gren book

"alice found mean gren book" contains invalid lexemes and, thus,

is not a sentence.

> found alice mean green book

"found alice mean green book" is not a sentence.

Exercise 3.5.13: Consider the following context-free grammar defined in EBNF for arithmetic expressions in postfix form:

<expr> ::= <expr> <expr> + <expr> ::= <expr> <expr> - <expr> ::= <expr> <expr> * <expr> ::= <expr> <expr> / <expr> ::= <number>

<number> ::= <nonzerodigit> <rest> | <digit> <rest> ::= <digit> <rest> | <digit>

<nonzerodigit> ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Build a postfix expression evaluator using any programming language. Specifically, build a parser for the language defined by this grammar using a stack. When you encounter a number, push it on the stack. When you encounter an operator, pop the top two elements off of the stack, compute the result of the operator applied to those two operands, and push the result on the stack. When the input string is exhausted, if there is only one number element on the stack, the string was a sentence in the language and the number on the stack is the result of evaluating the entire postfix expression.

CO NF

ID EN

TI AL

D RA

FT3.6. KEY TERMS 115Exercise 3.5.14: Build a graphical user interface, akin to that shown be- low, for the postfix expression evaluator developed in Programming Exer- cise 3.5.13.

Any programming language is permissible. Interesting choices include HTML 5 and JavaScript to build a web interface to your evaluator, or Java to build a standalone application, Qt (http://doc.qt.io/qt-4. 8/gettingstartedqt.html, http://zetcode.com/gui/qt5/), Squeak Smalltalk (try solving Programming Exercise 16.11.1 first and see http://squeak.org/), Python (https://wiki.python.org/ moin/GuiProgramming), Perl (http://perl-begin.org/uses/ GUI/), PHP, Tcl/Tk (https://www.tcl.tk/), or Dr. Racket (Scheme, https://docs.racket-lang.org/gui/). You could even built an Android or iOS app. All of these languages have a built-in or library stack data structure that you may use. Do not use a drag and drop WYSIWYG tool to create the interface.

Exercise 3.5.15: Define a tree data type and modify the parser generator for CHAMELEON given in Listing 3.4 by augmenting it with actions–in the pattern-action rules—to build a parse tree as an instance of that defined tree type.

3.6 Key Terms

ambiguity, ambiguous grammar, assembler, assembly code, associativ- ity, attribute grammars, back end, Backus-Naur Form (BNF) bottom-up parsing, Chomsky hierarchy, compilation, compiler, context-free gram- mar, context-free language, Extended Backus-Naur Form (EBNF), fixed- format languages, formal grammar, formal language, fourth-generation

CO NF

ID EN

TI AL

D RA

FT116 CHAPTER 3. SCANNING AND PARSINGlanguages, free-format languages, front end, grammar, handle, inter- pretation, interpreter, keywords, language implementation system, lex- eme, lexer, lexical analysis, lexical analyzer, machine code, non-terminal, parser, parser generator, parse tree, parsing, precedence, processor, pro- duction rule, programming language implementation system, pushdown automata, recursive-descent parsing, recursively-enumerable languages, recursive languages, reduce-reduce conflict, regular expression regular grammar, regular language, reserved words, scanner, semantic analysis, semantics, sentence, sentential form, shift-reduce parsing, syntactic analy- sis, syntactic analyzer, syntax, terminal, token, top-down parsing,

3.7 Thematic Take-Aways

• A seminal contribution to computer science was the discovery that grammars were language recognition devices as well as generative devices for language.

3.8 Chapter Summary

Scanning (or lexical analysis) picks out the lexemes, determines if all are valid, and returns a list of tokens. Parsing (or syntactic analysis) deter- mines if the list of tokens is in the correct order and, if so, structures this list into an abstract syntax tree.

lex (flex) is a scanner generator for C. yacc (bison) is a parser gen- erator for C.

3.9 Bibliographic Notes

CO NF

ID EN

TI AL

D RA

FT

Chapter 4

Programming Language Implementation

Author: Saverio Perugini Copyright © 2018 by Saverio Perugini ALL RIGHTS RESERVED

4.1 Chapter Objectives

• Establish an understanding of the differences between a compiler and an interpreter.

• Establish an understanding of a variety of implementations for pro- gramming languages.

4.2 Language Implementation

Regular and context-free grammars are important topics in the study of the theory of computation. In this context, they are useful for describing and defining the syntax of programming languages. Moreover, they can conveniently and naturally be used for language recognition which is a necessary step in language implementation and a precursor to program execution.

117

CO NF

ID EN

TI AL

D RA

FT118 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATION (regular grammar)

Interpreter scanner

list of

tokens parser grammar)

(context−freesource program (string or

parse tree

Front End

list of lexemes)

program input

program output

Figure 4.1: Interpretation.

4.2.1 Front End

Regular and context-free grammars are essential ingredients in the syn- tactic component (sometimes called the front end; see left side of Fig. 4.1 labeled ‘Front End’) to any programming language implementation sys- tem (e.g., interpreter or compiler). The front end of these systems involves scanning and parsing. Recall, the source code of a program is simply a string of characters. After comments are purged from the string and it is partitioned into primitive units based on some delimiter (usually whites- pace), we are left with a list of lexemes which is input to a scanner. The scanner must model the regular grammar which defines the tokens of the programming language. Scanning determines the validity of the lexemes. If all lexemes are valid, the result of scanning, or lexical analysis, is a list of tokens which is input to a parser. The parser must model the context-free grammar which defines the structure of the language. Parsing determines if the program is syntactically valid. If the parser can construct a parse tree from the list of tokens, the program is syntactically valid; otherwise it is not. The result of parsing, or syntactic analysis, is a parse tree. Thus, the objective of the front end of a language implementation system is to produce a parse tree if the source program is valid. What is done with that parse tree determines whether the language implementation system is an interpreter or compiler.

4.2.2 Interpretation versus Compilation

Interpretation involves traversing the parse tree produced by the front end, given the program input, so to evaluate and directly execute the program (see right side of Fig. 4.1 labeled ‘Interpreter’). In stark contrast, compilation in-

CO NF

ID EN

TI AL

D RA

FT4.2. LANGUAGE IMPLEMENTATION 119 translated program

scanner

(regular grammar) list of

tokens parser grammar)source program

(string or parse tree

Front End

list of lexemes)

(context−free

Compiler

Interpreter

(e.g., processor) code generator/

translatoranalyzer semantic

program output

program input

(e.g., object code)

Figure 4.2: Compilation.

volves translating the parse tree into another representation of the program (or, in other words, description of the computation modeled by the origi- nal program) which is typically (but not necessarily; e.g., the Java compiler javac outputs Java bytecode) more similar to the instruction set used by the processor intended to execute the program. There is no translation involved in interpretation. An interpreter is simply a computer program like any other program. It accepts input and produces output. “The in- terpreter for a computer language is just another program” [FW08]. This observation has been described as the most fundamental idea in computer programming [FW08].

The input to an interpreter is the source program to be executed and the input of that source program. We say the input of the interpreter is the source program because to the programmer of the source program the entire language implementation system (i.e., Fig. 4.1) is the interpreter as opposed to just the last component of that system (i.e., the interpreter com- ponent of it) which accepts a parse tree as input (see right side of Fig. 4.1 labeled ‘Interpreter’). The output of an interpreter is the output of the source program.

Compilation (see center of Fig. 4.2 labeled ‘Compiler’), on the other hand, translates the parse tree (which is already an intermediate represen- tation of the original source program) into another intermediate represen- tation (often assembly code) typically more similar to the instruction set used by the processor intended to execute the program. A compiler typ- ically involves two sub-components: the semantic analyzer and the code generator (neither covered here).

Abstraction, the general idea that primitive details of an entity can be hidden (i.e., abstracted away) by adding a layer to that entity which pro- vides higher-level interfaces to those details such that the entity can be interacted with without knowledge of its primitive details and so on, is a

CO NF

ID EN

TI AL

D RA

FT120 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATIONfundamental concept to computer science which recurs often in different contexts in the study of computer science. Progressively abstracting away from the details of the instruction set understood by the target processor has resulted in a series of programming languages, each at a high-level of abstraction than the prior:

1. machine language (e.g., x86)

2. assembly language (e.g., MIPS)

3. high-level language (e.g., Scheme and Java)

4. fourth-generation language (e.g., lex and yacc)

Assembly languages (e.g., MIPS) replaced the binary digits of machine lan- guage with mnemonics—short English-like words which represent com- mands or data. High-level languages (e.g., Scheme or Java) extend this abstraction with respect to control, procedure, and data. C is sometimes referred to as the lowest high-level language because it provides facilities for manipulating machine addresses and memory, and in-lining assembly language code. Fourth-generation languages are referred to as such since they follow three prior levels. Note that machine language is not the end of abstraction. The 0s and 1s in machine code are simply abstractions for electrical signals, and so on.

In this vein, the translation involved in compilation is typically a se- ries of translations or transformations from the source program to an in- termediate representation to another intermediate representation and so on, bringing the original source program closer, at each step, to the in- struction set understood by the target processor, often until an assembly language program is produced. An assembler (not shown as a component in Fig. 4.2) translates an assembly language program into machine code (i.e., object code) and is, therefore, another type of translator system (or compiler). It is important to note that a compiler1 is simply a translator; it does not execute the source program (or, for that matter, the final trans- lated representation of the program it produces) at all. Furthermore, its

1“A compiler was originally a program that ‘compiled’ subroutines [a link-loader]. When in 1954 the combination ‘algebraic compiler’ came into use, or rather misuse, the meaning of the term had already shifted to the present one” [BE75].

CO NF

ID EN

TI AL

D RA

FT4.2. LANGUAGE IMPLEMENTATION 121translations need not even bring the source program any closer to the in- struction set of the targeted platform at all. It can bring it further away if desired. For instance, a system which translates C code to Java code is no less a compiler than a system which translates C code into assembly code. Another example is a LATEX compiler from LATEX source code (a high-level language for describing and typesetting documents) to Postscript (a lan- guage understood by laser printers). The Postscript document generated by this compiler can be printed by a laser printer which is a hardware in- terpreter for Postscript (i.e., thread 1 in Fig. 4.8) or rendered on a display using a software interpreter for Postscript such as Ghostscript (i.e., thread 2 in Fig. 4.8). Web browsers are software interpreters (compiled into ma- chine code) which directly interpret HTML (a markup language describing the presentation of a webpage) and JavaScript and now a variety of other high-level programming languages. Therefore, a more appropriate term for compiler is ‘translator.’ The term compiler derives from the use of the word to describe a program which compiled sub-routines—what we now call a linker. Later in the 1950s the term compiler, shortened from ‘algebraic compiler,’ was used (or misused) to describe a source-to-source translator conveying its present-day meaning.

Note that we can think of the front end as a compiler as well; it translates the source program (i.e., a string of characters) into a parse tree (i.e., an intermediate representation). Also note that if a code generator is correct (i.e., generates syntactically valid code) the interpreter on the right side of Fig. 4.2 does not require a front end. In other words it depicts the same component as shown on the right side of Fig. 4.1 labeled ‘Interpreter’.

Sometimes students, perhaps coming from the background of an intro- ductory course in computer programming in which they exclusively pro- gram in a compiled language, find it challenging to understand that an interpreter involves no translation. Perhaps this is because the idea that in a computer system everything must be translated or reduced to zeros and ones (i.e., machine code) to execute has been ingrained in them. While we build several interpreters in Chapters 10–12, we use a simple example here to demonstrate that an interpreter does not translate to dispel this doubt. Consider a very simple interpreter which interprets the language simple with the following grammar:

CO NF

ID EN

TI AL

D RA

FT122 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATION<simple> ::= <number> + <number> <number> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

The following is an interpreter, written in C, for the language simple.

# include<l i m i t s . h> # include<s t d i o . h> # include<ctype . h>

main ( ) { char string [LINE_MAX ] ; /* s e nte nce s have e x a c t l y t hre e non−whitespace c h a r a c t e r s */ char program [ 4 ] ;

i n t num1 , num2 , sum ; i n t i = 0 ; i n t j = 0 ;

fgets (string , LINE_MAX , stdin ) ;

/* purge whitespace */ while (string [i ] != '\0' ) {

i f ( ! isspace (string [i ] ) ) program [j ] = string [i ] ;

i++ , j++; } program [ 3 ] = '\0' ;

/* l e x i c a l and s y n t a c t i c analy s is , note lack of semantic a n a l y s i s */

i f (isdigit (program [ 0 ] ) && program [ 1 ] == '+' && isdigit (program [ 2 ] ) ) {

/* s u b t r a c t i n g the i n t e g e r value of the a s c i i c h a r a c t e r 0 from any a s c i i d i g i t re t urns the i n t e g e r value of the d i g i t ( e . g . , ' 2 ' − ' 0 ' = 2) */

num1 = program [ 0 ] − '0' ; num2 = program [ 2 ] − '0' ;

sum = num1 + num2 ;

printf ("%d\n" , sum ) ; } else /* i n v a l i d lexeme */

fprintf (stderr , "Program is invalid.\n" ) ;

CO NF

ID EN

TI AL

D RA

FT4.2. LANGUAGE IMPLEMENTATION 123 num1

2+3 2 3 5

2 + 3

(i.e., a simple program)

5 (i.e., program output)

program num2 sum

Simple Interpreter (a C program compiled into machine code)

"2 + 3"

string

Figure 4.3: Interpreter for language simple illustrating that the simple program becomes part of the running interpreter process.

}

A session with the simple interpreter follows:

1 $ gcc −o simple simple .c 2 $ ./simple 3 2 + 3 4 5 5 $ ./simple 6 5+9 7 14 8 $ ./simple 9 3+ 8

10 11 11 $ ./simple 12 6 +0 13 6 14 $ ./simple 15 9 + 3 16 12 17 $ ./simple 18 123 19 Program is invalid . 20 $ ./simple 21 23 + 1 22 Program is invalid .

The simple program 2 + 3 is never translated prior to execution. It is however read in as input to the interpreter, which has been compiled into machine code and is currently executing on the processor, and, therefore, has become part and parcel of the image of the interpreter in memory (i.e., the interpreter process; see Fig. 4.3). In that sense, it has become part of the interpreter. An interpreter is never going to translate its source pro-

CO NF

ID EN

TI AL

D RA

FT124 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATIONgram into any representation other than a parse tree or other similar data structure to facilitate its evaluation.

Again, language implementations each involve two major components, the first of which, the front end, is the same (see left sides of Figs. 4.1 and 4.2). The differences in the various approaches to implementation lie in the back end.

4.2.3 Run-time Systems or Methods of Executions

Ultimately, the series of translations must end and a representation of the original source program must be interpreted (see right side of Fig. 4.2 la- beled ‘Interpreter (e.g., processor)’). Therefore, interpretation must follow compilation. That interpretation can be performed by the most primitive of interpreters, a hardware interpreter called the processor, or by a soft- ware interpreter (which itself is just another computer program which is compiled or executed by interpretation).

Interpretation by the processor is the more common and traditional ap- proach to execution after compilation (for purposes of speed of execution; see thread 1 in Fig. 4.8); it involves translating the source program all the way down, through the use of an assembler, to machine code (e.g., x86). This more traditional style is depicted in the language-neutral diagram in Fig. 4.4. For instance, the gcc (i.e., the gNU c compiler) translates a C pro- gram into object code (e.g., program.o). For purposes of simplification of exposition, we omit the optional, but common, code optimization step and necessary linking step from Figs. 4.2 and 4.4.2 A more detailed view of the phases of compiling a C program is illustrated in Fig. 4.5. Often the code optimization phase of compilation is part of what is called the back end of a language implementation system.

An example of the final representation being evaluated by a software interpreter is a compiler from Java code to Java bytecode where the re- sulting Java bytecode is executed by the Java Virtual Machine—an inter- preter. These systems are sometimes referred to as hybrid (of compila- tion and interpretation) language implementation systems (see thread 2 in Fig. 4.8). Though their approach to program execution is historically non-

2For a more detailed, internal view into all of the phases of execution through compilation and the interfaces between them, we refer the reader to Fig. 1.1 of [App04].

CO NF

ID EN

TI AL

D RA

FT4.2. LANGUAGE IMPLEMENTATION 125

program output

mul id3 add id4 store id1

001101010110110 000110101010111 111100011100101 010101010101010

id1 = id2 * id3 + id4

=

+

scanner

parser

preprocessor

n = x * y + z

id1

Front End

n = x * y + z

code generator

Compiler

/* mathematical expression */

*

id2 id3

id4

parse tree

assembly code

assembler

object code

source program commented

list of tokens

list of lexemes

processorprogram input

load id2

Figure 4.4: Low-level view of execution by compilation.

CO NF

ID EN

TI AL

D RA

FT126 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATION

dynamic linking

. . .

a.out

executable

.

. . .

.

.

f2.o

object code

f1.o main.o

linker

gcc f1.o f2.o main.o

.

.

9A 01 00 00 10 00 4C 01 04 00 00 00

.

.

40 00 30 C0 2E 62 61 00 00 00 00 00

.

.

24 01 00 00 00 00 00 00 83 C0 0F 83

.

.

.

.

.

.

f2.cf1.c main.c

expanded C source files

int main() {

strlen("...");

f2.c

C source files

f1.c main.c

stdio.h string.h

movl 2345, %esp call strlen

movl 10, %espmovl 1234, %esp call printf

input output

}

f2.s

assembly code

f1.s main.s

assembler

preprocessor

compiler

string.h stdio.h /usr/include/

f(g((10));

call f call g

} printf("...");

}

int g(int x) {

strlen("...");

#include<stdio.h> #include<string.h>

printf("...");

printf strlen f g

void f(int x) {

gcc −E f1.c f2.c main.c cpp f1.c f2.c main.c

gcc −S f1.i f2.i main.i

gcc −c f1.s f2.s main.s

} f(g(TEN)); /* comment */ int main() { int g(int x); void f(int x); #define TEN 10

4C D1 04 00 00 A0 4A 01 00 20 10 00

/usr/lib/libc.a

definition of printf of strlen definition

static linking

ar t /usr/lib/libc.a

printf.o strlen.o

printf strlen

nm −C /usr/lib/libc.so

/usr/lib/libc.so

definition of strlendefinition

of printf

.

Figure 4.5: Stages involved in compiling a C program.

CO NF

ID EN

TI AL

D RA

FT4.2. LANGUAGE IMPLEMENTATION 127

Interpreterscanner

source program (string or

list of lexemes)

list of

tokens parser grammar)

(context−free parse tree

Front End

(regular grammar) program output

program input

interpreter (compiled to machine code)

(input to the interpreter)

(input to the interpreter)

(e.g., processor)

Figure 4.6: Alternate view of execution by interpretation.

compiler hardware software

interpreter

Figure 4.7: Dependency relationships between compilers and interpreters. Directed ar- rows depict a dependency relationship.

CO NF

ID EN

TI AL

D RA

FT128 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATION

Hardware Interpreter

compiled

Software Interpreter

Software Interpreter

3

program output

Hardware Interpreter 2

4

program output

translation(s) more dynamic and not as fast hybrid systems:

(e.g., Java, Perl, C#)

(e.g., FORTAN and C) static bindings, and fast

traditional compilation:

pure interpretation: dynamic bindings and slow

executed by ...

Software Interpreters Infinite Tower of

by executed

interpreter

executed by

Interpretation

(e.g., Scheme, ML, Haskell)

interpreted interpreter executed by

Compilation

Software Interpreter

executed by

source program

source program

...

final representation intermediate representations

Hardware Interpreter

program output

1

Figure 4.8: Four different approaches to programming language implementation.

CO NF

ID EN

TI AL

D RA

FT4.2. LANGUAGE IMPLEMENTATION 129traditional, the number of languages implemented in this manner (e.g., Perl3 or Java) have increased since the mid-1990s for a variety of factors such as the advent of faster processors, the web, and a higher premium on flexibility than speed of execution in some application domains. Any lan- guage implementation system involving compilation must eventually in- volve interpretation and, therefore, all language implementation systems involving compilation can be said to be hybrid systems. Here, we refer to hybrid systems only as those which compile to a representation inter- preted by a compiled software interpreter (thread 2 in Fig. 4.8).

While we do not have a hardware interpreter (i.e., processor or ma- chine) which natively executes programs written in high-level languages4, an interpreter or compiler creates a virtual machine for the language of the source program (i.e., a computer which understands, primitively and naively, that language). In other words, “[t]he interpreter program acts as a software simulation of a machine whose fetch-[decode-]execute cycle deals with high-level language program statements rather than machine instructions” [Seb10]. Therefore, an interpreter for language L (i.e., IL) is a virtual machine for executing programs written in language L. For exam- ple, a Scheme interpreter creates a virtual Scheme computer. Similarly, a compiler from language L to L′ (i.e., CL→L′) can translate a program in lan- guage L either to a language (i.e., L′) for which an interpreter executable on the target processor exists (i.e., IL′) or directly to code understood by the target processor and, therefore, the (CL→L′, IL′) pair also serves as a virtual machine for language L. For instance, a compiler from Java to Java byte- code and a Java bytecode interpreter (e.g., (javac, java) pair) provide a virtual Java computer.5 Programs written in the C# programming lan- guage within .NET run-time environment are compiled, interpreted, and executed in a similar fashion. Some language implementation systems, in the same spirit as dynamic linking, even delay the translation of parts

3While Perl programs are compiled to detect errors before interpretation the compiled intermediate code is not stored on disk in any conveniently accessible or reusable form (as in the case of bytecode .class files with Java) and, thus, Perl appears to be purely interpreted to the programmer. Moreover, this pre- compilation is performed every time a Perl program is interpreted.

4This is not entirely true. A LISP chip has been built as well as a PROLOG computer called the Warren Abstract Machine.

5The Java bytecode interpreter (i.e., java) is typically referred to as the Java Virtual Machine or JVM by itself. However, it really is a virtual machine for Java bytecode rather than Java. Therefore, it is more accurate to say that the Java compiler and Java bytecode interpreter (traditionally, though somewhat inac- curately, called a JVM) together provides a virtual machine for Java.

CO NF

ID EN

TI AL

D RA

FT130 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATIONof the final intermediate representation produced into machine code un- til they are required (if at all) and used during execution. These systems are called Just-in-Time (JIT) implementation systems and are said to use just- in-time compilation. Thus, a compiler must rely on a hardware or software interpreter. We build several interpreters in Chapters 10–12, where execu- tion by interpretation is the focus.

This view of an interpreter as a virtual machine is assumed in Fig. 4.1 where on the right side of that figure the interpreter is given the parse tree and the program input as input and executes it directly to produce program output. Unless that interpreter (on the right hand side) is a hard- ware processor and its input is machine code, that figure is an abstraction of another processes because the interpreter—as we have said is a pro- gram like any other—needs to be executed (i.e., interpreted or compiled itself). Therefore, a lower-level presentation of interpretation is given in Fig. 4.6. Specifically, an interpreter compiled into machine code is inter- preted by a processor (see thread 3 of Fig. 4.8). In addition to accepting the interpreter as input, the processor accepts the source program (or its parse tree which is input to the program the processor is running, namely the interpreter) and the input of the source program as additional inputs. However, an interpreter for a computer language need not always be com- piled directly into machine code. A software interpreter also can be in- terpreted by another (possibly the same) software interpreter, and so on (see thread 4 of Fig. 4.8). This has been referred to as an infinite tower of interpreters [Smi82, Smi84]. At some point, however, the final software in- terpreter must be executed on the target processor. Therefore, a software interpreter is dependent upon a compiler because the software interpreter eventually needs to be compiled into machine code, unless the software interpreter is originally written in machine code. Therefore, a compiler is more dependent on an interpreter than a software interpreter is dependent on a compiler.

For now, we say that compilers and interpreters are mutually- dependent on each other (see Fig. 4.7). More specifically, a compiler is dependent upon either a hardware or software interpreter. A software in- terpreter is dependent on a compiler so that the interpreter itself or one of its descendant interpreters can be translated into machine code and run. For instance, the simple interpreter above is written in C and compiled into

CO NF

ID EN

TI AL

D RA

FT4.2. LANGUAGE IMPLEMENTATION 131machine code using the gNU c compiler (i.e., gcc) (see thread 3 of Fig. 4.8). Fig. 4.8 provides a summary of the four different approaches to pro-

gramming language implementation described in this subsection. Each approach is surrounded with a dotted box which is labeled with a circled number and presented below in order from fastest to slowest execution.

1. (blue) Traditional compilation directly to machine code (e.g., FOR- TRAN, C).

2. (magenta) Hybrid systems: interpretation of a compiled, final repre- sentation through a compiled interpreter (e.g., Java).

3. (cyan) Pure interpretation of a source program through a compiled interpreter (e.g., Scheme, ML).

4. (green) Interpretation of either a source program or a compiled fi- nal representation through an interpreted interpreter (i.e., Infinite Tower).

The study of language implementation and methods of execution, specif- ically as depicted in Fig. 4.8 through progressive levels of compilation and/or interpretation, again bring us face-to-face with abstraction.

Note that the figures in this section (with the exception of Fig. 4.5) are conceptual in that they identify the major components and steps in the interpretation and compilation process independent of any particular ma- chine-̄/architecture and are not intended to model any particular inter- preter or compiler. Some of these steps can be combined to obviate mul- tiple passes through the input program or representations thereof. For instance, we discuss in § 3.3 that lexical analysis can be performed dur- ing syntactic analysis. We also mentioned in § 2.4.4 that mathematical ex- pressions can be evaluated while being syntactically validated. We revisit Figs. 4.1 and 4.6 in Chapters 10–12 where we study fundamental concepts of programming languages through the implementation of interpreters op- erationalizing those concepts.

4.2.4 Comparison of Interpreters and Compilers

Table 4.1 summarizes the advantages and disadvantages of compilation (thread 1) and pure interpretation (thread 3). The primary difference be-

CO NF

ID EN

TI AL

D RA

FT132 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATIONTable 4.1: Advantages and disadvantages of interpreters and compilers. Implementation Advantages Disadvantages

Trad. Compiler fast execution inconvenient program development less source-level debugging less run-time flexibility

Pure Interpreter convenient program development direct source-level debugging run-time flexibility

slow execution (decoding) often requires more run-time space

tween the two approaches is speed of execution. Interpreting high-level language is slower than interpreting machine code primarily because de- coding high-level statements and expressions is slower than decoding ma- chine instructions. Moreover, a statement must be decoded as many times as it is executed in a program even though it may only appear in the pro- gram once and the result of that decoding is the same each time. For instance consider the following loop in a C fragment which computes 21,000,000 iteratively:

i n t i=0; i n t result=2; fo r (i=1; i < 1000000 ; i++)

result *= 2 ;

If this program was purely interpreted, the statement result *= 2 would be decoded once less than one million times! Thus, not only does a software interpreter decode a high-level statement such as result *= 2 slower than the processor decodes analogous machine code, but that per- formance degradation is compounded by repeatedly decoding the same statement every time it is executed. ‘Therefore, statement decoding, rather than the connection between the processor and memory, is the bottleneck of a pure interpreter’ [Seb10]. An interpreter also typically requires more run-time space because the run-time environment, a data structure which provides the bindings of variables, is required during interpretation (see Chapter 6) and, moreover, often the source program is represented in- ternally with a data structure designed for easy access and modification rather than one with minimal space requirements (see Chapter 9). Often the internal representation of the source program accessed and manipu- lated by an interpreter is an abstract syntax tree (or AST). An abstract syntax

CO NF

ID EN

TI AL

D RA

FT4.2. LANGUAGE IMPLEMENTATION 133tree (see Chapter 9), like a parse tree, depicts the structure of a program. However, unlike a parse tree, it does not contain non-terminals; it also structures the program in a way which facilitates interpretation (see Chap- ter 10).

The advantages of a pure interpreter and the disadvantages of a tradi- tional compiler are complements of each other on progressive levels. At a core level, program development using a compiled language is inconve- nient because every time the program is modified it must be re-compiled to be tested and often the programmer expends copious time cycling through a program-compile-debug-recompile loop. Program development with an interpreter, on the other hand, involves one less step. Moreover, if pro- vided with an interpreter, a read-eval-print loop helps facilitate testing and debugging program units (e.g., functions) in isolation of the rest of the program, where possible.

More importantly, since an interpreter does not translate a program into another representation (other than an abstract syntax representation), it does not obfuscate the original source program and, therefore, can more accurately identify source-level (i.e., syntactic) sources (e.g., name of an ar- ray whose index is out-of-bounds) of run-time errors and refer directly to lines of code in error messages with more precision than a compiler which due to translation may have no way to identify the source of a compile- time error in the original pristine source program by the time the error is detected. Run-time errors in compiled programs are similarly difficult to trace back to the source program because the target program has no knowl- edge whatsoever of the original source program. Such run-time feedback can be invaluable to debugging a program in an efficient and systematic manner. Therefore, at a second level, the mechanics of testing and debug- ging can be more streamlined and cleaner using an interpreted as, opposed to a compiled, language.

On a higher level, the ability or lack thereof to delay bindings until run- time affect flexibility of program development. For instance, the more dynamic bindings a language supports, the less commitments the pro- grammer must make during program development, and vice versa, which leaves provisions for flexible and convenient debugging, maintenance, and re-design for dealing with issues such as errors or changing requirements. For instance, run-time binding of messages to methods in some object-

CO NF

ID EN

TI AL

D RA

FT134 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATIONoriented languages allow programs to be more easily designed during de- velopment and extended during maintenance.

4.2.5 Influence of Language Design Goals and Binding on Language Implementation and Vice Versa

The goals of a language (e.g., speed of execution, speed of development) and its subsequent design (e.g., choice of static or dynamic bindings) influ- ence the method through which it is implemented (e.g., interpretation or compilation). For instance, FORTRAN and C programs are intended to be fast and are, therefore, compiled. UNIX shell scripts, on the other hand, are intended to be quick and easy to develop and debug and, thus, are inter- preted. Incidentally, an aspect of ease of program development is writabil- ity while an aspect of ease of program modification is readability (espe- cially by someone other than the original developer) and yet often both of these aspects conflict each other. For instance, a writable language (e.g., Scheme) is generally terse and therefore difficult to read. On the other hand, a readable language (e.g., COBOL) is generally difficult for writing programs. This does not mean that a language cannot facilitate both pro- gram development and modification, or that a terse language cannot fos- ter program modification; Scheme, for example, happens to facilitate both well. Similarly, languages offering a great deal of run-time flexibility (e.g., where variables can be bound to any values), such as Perl or Scheme (e.g., where the types of variables are not declared before run-time) often in- volve interpretation because the translations involved in compilation often preclude changes at run-time.

However, there is nothing intrinsic in a programming language (i.e., in its definition) which precludes it from being implemented through inter- pretation or compilation. For instance, we can easily build an interpreter for C which is traditionally thought of as a compiled language. Such an approach to implementing C runs contrary to the design goals of C (i.e., efficiency) and provides no reasonable benefit to justify the degradation in performance. Similarly, we can build a compiler for Scheme which is typically referred to as an interpreted language. While the primary bene- fit of compiling Scheme is an improvement in speed, without altering the semantics of the language the compiler would need to generate code for

CO NF

ID EN

TI AL

D RA

FT4.2. LANGUAGE IMPLEMENTATION 135a great deal of run-time checks to support the run-time flexibility the lan- guage provides. While speed of compilation decreases with the generation of code for run-time checks, that degradation only affects program devel- opment (i.e., debugging and maintenance) because the final program is only compiled once. In cases where an implementation provides both an interpreter and a compiler (to machine code) for a language (e.g., Java), the interpreter can be used for (speedy and flexible) program development while the compiler is reserved for producing the final, (fast-executing) pro- duction version of software.

Choice of bindings also affects choice of implementation. It is natu- ral to compile a language designed to support many static bindings be- cause establishing those bindings and semantic checks thereof can happen at compile-time and, thus, do not occupy CPU cycles at run-time yield- ing a fast executable. Therefore, the speed of executables produced by a compiler are a direct result of the efficient decoding of machine instruc- tions (versus high-level statements) at run-time coupled with very few semantic checks at run-time. The more static bindings a language sup- ports, the slower a compiler for it runs since the compiler is charged with the additional task of performing semantic checks. While this decreases compilation speed due to the generation of these checks in the target code and make program development slower and perhaps inconvenient (espe- cially for compiling a large software system from scratch), they are tol- erated because of the run-time speed they achieve in the target program. Optimizing target code also trades slower compilation time for a faster ex- ecutable. However, compilation is a ‘compile once, repeatedly run’ propo- sition because once a program is fairly bug-free, compilation is no longer performed and, therefore, the extra time required during compilation for these steps to improve the speed of the executable is ultimately negligible.

On the other hand, it is natural to interpret a language with many dy- namic bindings because, by definition, these bindings happen at run-time and, therefore, do not present an opportunity for a compiler to improve the speed of the executable produced. This does not mean that interpreted languages do not have static bindings. Scheme, for example, uses static scoping. It just means that static bindings do not improve run-time speed and, therefore, their justification lies outside of improving run-time perfor- mance.

CO NF

ID EN

TI AL

D RA

FT136 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATIONNote also that these influences work both ways. In other words, lan- guage implementation can influence bindings in the language. The fact that C is typically compiled encourages extending it with additional static bindings. Similarly, the fact that Scheme is typically interpreted encour- ages extending it with dynamic bindings because since all (static and dy- namic) semantics checks must happen at run-time (because there is no other time than run-time with an interpreter), there is no speedup in ex- ecution gained by supporting static bindings.

Again, though not technologically impossible to build an interpreter for a typically compiled language supporting many static bindings or to build a compiler for a typically interpreted language supporting many dy- namic bindings (because a compiler can generate code for dynamic seman- tic checks), it is unnatural because doing so runs contrary to the natural advantages in each approach (i.e., speed for compilation and flexibility for interpretation). Moreover, compiling languages which are traditionally in- terpreted removes some of the original advantages provided by interpre- tation. For instance, it is easy for an interpreted program to generate and run another program at run-time (an idea encapsulated in and realized with LISP macros as well as in the client/server paradigm of the web—see below). The same is nearly impossible in a compiled language because the program generated at run-time would need to be compiled itself before it could be run.

4.2.6 Speed of Execution versus Speed of Development

Historically, computer architecture influenced programming language de- sign and, therefore, implementation. For instance, we had the von Neu- mann architecture, where the connection between memory and the proces- sor is the biggest bottleneck and, therefore, speed of execution was the primary goal in language design. The goal of improving execution speed encourages compilation and compilation in turn encourages static bind- ings which make the resulting target program even faster. An emphasis on speed of program execution encourages compilation which encourages static bindings.

The software crisis in the 1960s and 1970s, which refers to the software industry’s inability to scale the software development process of large

CO NF

ID EN

TI AL

D RA

FT4.2. LANGUAGE IMPLEMENTATION 137systems in the same way as other engineering disciplines, revealed that speed of program development is, like speed of execution, a criterion in the design of programming languages and resulted in software design methodologies such as structured programming (see Chapter 13). Moreover, advances in computer hardware, particularly due to Moore’s Law, which states that the number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years and describes the evolution of computer hardware, also helped reduce the emphasis of speed of execution as the primary criterion in the design of programming languages. Although the object-oriented paradigm was born in the late 1970s and early 1980s, and the ideas it espouses flourished with the devel- opment of Smalltalk in the 1980s (see outlier in Fig. 4.2), these methodolo- gies coupled with Moore’s Law (see thread on left side of Fig. 4.9) did not seem to shift the emphasis of language design and implementation away from hardware and speed of execution as the primary criteria from which to design programming languages to speed of development. As a result, there was little affect on the evolution of programming languages. To the contrary, and ironically, compilation flourished in the 1980s which saw very few interpreted languages (see center of Fig. 4.2). Therefore, while the software crisis and Moore’s Law created an awareness of speed of pro- gram development, this awareness did not surpass speed of execution in importance. Therefore, the merits of dynamic bindings and interpretation remained in the background. However, the advent of the World Wide Web (WWW) in the late 1990s and early 2000s created an unforeseen practical landscape for the exploration of these ideas of dynamic binding and fea- tures and interpretation as a way to address the challenges of developing software intended to run in this new client/server paradigm of the web.

Specifically, programs running on a server often generate, in response to a user input or action, new (high-level) programs (often in a language different from the original program) intended to be run on-the-fly in a browser. Compiling the generated program down to machine code before executing it in a browser is problematic for multiple reasons. First, the web runs on a variety of hardware platforms and, thus, program portability is an issue. Compilation of the generated programs would require the server to generate code targeted to each user’s individual hardware platform to support multiple target platforms which, while technologically feasible, is

CO NF

ID EN

TI AL

D RA

FT138 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATIONdifficult to sustain. Moreover, such an approach is wasteful because the program generated by the server running in the browser is typically run once and then discarded. The big sell of a compiler is its ‘compile once, re- peatedly run’ motif, while the web runs, quite the opposite, on a ‘generate multiple, run once’ motif. In other words, a program running on a server generates multiple (possibly different) programs (one for each client run- ning it) which are then run once and discarded. Compiling each of these generated programs repeatedly is analogous (though still faster) to an in- terpreter spending copious time decoding a high-level statement only to execute it once (because it will have to decode it again the next time it encounters and executes it). Second, again while technologically feasible, ...

These problems point to interpretation as a solution. However, fast re- sponse time is critical for maintaining user attention on the web and in- terpretation is notorious for slow execution speed. These requirements of portability and speed (see right thread in Fig. 4.9) created an ideal en- vironment for hybrid approaches to execution (see thread 2 in Fig. 4.8) which provide most of the advantages of both pure interpretation and compilation. Using a hybrid approach, high-level language is decoded only once and compiled into an architecturally-neutral, intermediate form which is portable in that it can be run on any system equipped with an interpreter for it. While the intermediate code cannot be interpreted as fast as machine code, it is interpreted faster than the original high-level source program while maintaining many of the run-time flexibilities afforded to traditional purely interpreted languages. Note that support for dynamic bindings and portability are to some extent the same issues. Compiling to an architecturally-neutral format means that, for example, the storage required for a variable (e.g., of type integer) can be delayed until run-time because the target architecture which dictates the size of an integer is un- known until run-time. These hybrid approaches are used on the web in two main ways:

• A program compiled into an architecturally-neutral format is down- loaded from a server and executed on the client by a browser plugin acting as a software interpreter (e.g., Adobe Flash).

• A program running on the server generates a webpage encapsulating

CO NF

ID EN

TI AL

D RA

FT4.2. LANGUAGE IMPLEMENTATION 139

speed of execution as a criterion for language design)

software crisis Moore’s Law (faster processors)

awareness of speed of development

increased emphasis on

dynamic bindings and features

advent of WWW

need for portability and speed of execution

hybrid compilation−interpretation language implementation systems

more purely interpreted languages

?

(where speed of development surpasses

Figure 4.9: The convergence of two independent threads in the evolution of language implementation and possible subsequent influence on the design of (future) languages. Arrows imply ‘lead to.’

Table 4.2: Methods of language implementation shift with evolving design goals. primary design goal: research speed speed/run-time flexibility

implementation method: early interpreted languages → nearly all compiled languages → hybrid compiled-interpreted languages language: (e.g., LISP, APL) → (e.g., Ada, C, C++, COBOL) → (e.g., Java, Perl, C#)

outliers: FORTRAN Smalltalk JavaScript, PHP, Python (compiled) (interpreted) (interpreted)

timeline: 1960s 1980s 2000s

a program (e.g., a Perl or PHP program running on a server gener- ates an HTML webpage with embedded JavaScript) which is directly executed by the browser which is a software interpreter.

The response to this shift to a networked client/server program exe- cution model, namely hybrid compilation-interpretation implementation approaches, encourages dynamic bindings and features which the ap- proach enables and therefore, also supported (or at least did not oppose), as a byproduct, improvements to the efficiency of the software develop- ment processes. For instance, we see the evolution of languages such as JavaScript, Perl, Python, PHP, and Ruby resembling UNIX scripting lan- guages in both syntax and style of rapid and easy development. Thus, this shift also heightened awareness of the importance of speed of devel-

CO NF

ID EN

TI AL

D RA

FT140 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATION full−circle back home?

interpreted languages compiled languages

(e.g., Java, Perl, C#)

(e.g., APL, LISP, SNOBOL) (e.g., Ada, C, C++, COBOL)

hybrid compiled−interpreted languages

1960s 1980s

2000s

Figure 4.10: Towards a full-circle in the evolution of programming languages.

opment, particularly in non-engineering domains where speed of execu- tion is not always the most important design factor. (Ironically, Matlab, a programming language used frequently by engineers is interpreted.) Con- comitantly, we see research on agile software development methods such as extreme programming [] involving repeated and rapid tours through the software development cycle which benefit from the flexibility afforded by dynamic bindings and features. Therefore, hybrid implementations are a juncture (see bottom of Fig. 4.9) where these two independent threads in the evolution of language implementation met.

It is no surprise that a recent trend in the evolution of programming languages is toward more interpreted, so called ‘dynamic,’ languages (see Table 4.2 and Fig. 4.10). Only the future will tell if modern languages im- plemented with hybrid compiler-interpreters are a response to the need for architectural-neutrality and portability in the new computing landscape of the web or a part of a broader, yet slow, progression in a trend back to- ward purely interpreted languages (see Fig. 4.10 and dotted transition in Fig. 4.9).

4.3 Conceptual Exercises for Chapter 4

Exercise 4.3.1: Give an example (different from those given in this section) of something that a compiled language can do that is not possible in an interpreted language, and vice versa.

CO NF

ID EN

TI AL

D RA

FT4.4. PROGRAMMING EXERCISES FOR CHAPTER 4 1414.4 Programming Exercises for Chapter 4 Exercise 4.4.2: Consider the following context-free grammar defined in EBNF from Programming Exercise 3.3.5:

<program> ::= <program> <expr> \n | <expr> \n <expr> ::= <expr> + <expr> <expr> ::= <expr> * <expr> <expr> ::= − <expr> <expr> ::= <integer>

<integer> ::= 1 | 2 | 3 | . . . |∞

where <expr> and <integer> are non-terminals and +, *, −, and 1, 2, 3, . . . are terminals.

a) Extend your program from Programming Exercise 3.3.5.a to interpret programs.

Normal precedence rules hold: − has the highest, * has the second high- est, and + has the lowest. Assume left-to-right associativity. The follow- ing is sample input and output for the expression evaluator (> is simply the prompt for input and will be the empty string in your system):

> 2+3*4

14

> 2+3*-4

-10

> 2+3*a

"2+3*a" contains invalid lexemes and, thus, is not an expression.

> 2+*3*4

"2+*3*4" is not an expression.

Do not build a parse tree to solve this problem. Factor your program into a recursive-descent parser (solution to Programming Exercise 3.3.5.a) and an interpreter as shown in Fig. 4.1.

b) Extend your program from Programming Exercise 3.3.5.b to interpret expressions as shown in Programming Exercise 4.4.2.a.

Do not build a parse tree to solve this problem. Factor your program into a shift-reduce parser (solution to Programming Exercise 3.3.5.b) and an interpreter as shown in Fig. 4.1.

CO NF

ID EN

TI AL

D RA

FT142 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATIONc) Complete Programming Exercise 4.4.2.a, but this time build a parse tree and walk it to evaluate the expression.

d) Complete Programming Exercise 4.4.2.b, but this time build a parse tree and walk it to evaluate the expression.

e) Extend your program from Programming Exercise 3.3.5.d to interpret expressions as shown below:

> 2+3*4

((2)+((3)*(4))) = 14

> 2+3*-4

((2)+((3)*(-(4)))) = -10

> 2+3*a

"2+3*a" contains invalid lexemes and, thus, is not an expression.

> 2+*3*4

"2+*3*4" is not an expression.

f) Extend your program from Programming Exercise 3.3.5.e to interpret expressions as shown in Programming Exercise 4.4.2.e.

g) Extend your program from Programming Exercise 3.3.5.f to interpret expressions as shown in Programming Exercise 4.4.2.e.

h) Extend your program from Programming Exercise 3.3.5.g to interpret expressions as shown in Programming Exercise 4.4.2.e.

i) Complete Programming Exercise 4.4.2.e, but this time, rather than dia- gramming the expression, decorate each expression with parentheses to indicate the order of operator application and interpret expressions as shown below:

> 2+3*4

(2+(3*4)) = 14

> 2+3*-4

(2+(3*(-4))) = -10

> 2+3*a

"2+3*a" contains invalid lexemes and, thus, is not an expression.

> 2+*3*4

"2+*3*4" is not an expression.

j) Complete Programming Exercise 4.4.2.f, but this time, rather than dia- gramming the expression, decorate each expression with parentheses to indicate the order of operator application and interpret expressions as shown in Programming Exercise 4.4.2.i.

CO NF

ID EN

TI AL

D RA

FT4.4. PROGRAMMING EXERCISES FOR CHAPTER 4 143k) Complete Programming Exercise 4.4.g, but this time, rather than dia- gramming the expression, decorate each expression with parentheses to indicate the order of operator application and interpret expressions as shown in Programming Exercise 4.4.2.i.

l) Complete Programming Exercise 4.4.h, but this time, rather than dia- gramming the expression, decorate each expression with parentheses to indicate the order of operator application and interpret expressions as shown in Programming Exercise 4.4.2.i.

Exercise 4.4.3: Consider the following context-free grammar defined in EBNF from Programming Exercise 3.3.6:

<program> ::= <program> <stmtlist> | <stmtlist> <stmtlist> ::= <stmtlist> <stmt> ; | <stmt> ;

<stmt> ::= <expr> | print <expr> <stmt> ::= <variable> = <expr> <expr> ::= <integer> | <var> <expr> ::= − <expr> <expr> ::= <expr> + <expr> <expr> ::= <expr> − <expr> <expr> ::= <expr> * <expr> <expr> ::= <expr> / <expr> <expr> ::= <expr> ˆ <expr> <expr> ::= (<expr>)

<integer> ::= 1 | 2 | 3 | . . . |∞ <variable> ::= a | b | c | . . . | z

Complete Programming Exercise 4.4.2 (parts a–l) using this grammar sub- ject to all of the requirements given in that exercise.

Exercise 4.4.4: Consider the following context-free grammar defined in EBNF from Programming Exercise 3.3.7:

CO NF

ID EN

TI AL

D RA

FT144 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATION<program> ::= <stmtlist> . <stmtlist> ::= <stmtlist> <stmt> | <stmt>

<stmt> ::= ; | print <expr> ; <stmt> ::= <var> = <expr> ; <stmt> ::= while (<expr>) <stmt> <stmt> ::= if (<expr>) <stmt> [else <stmt>] <stmt> ::= { <stmt list> }

<stmt list> ::= <stmt> | <stmt list> <stmt> <expr> ::= <integer> | <var> | − <expr> <expr> ::= <expr> + <expr> | <expr> − <expr> <expr> ::= <expr> * <expr> | <expr> / <expr> <expr> ::= <expr> < <expr> | <expr> > <expr> <expr> ::= <expr> <= <expr> | <expr> >= <expr> <expr> ::= <expr> == <expr> | <expr> != <expr> <expr> ::= <expr> ˆ <expr> | (<expr>) <expr> ::= (<expr>)

<integer> ::= 1 | 2 | 3 | . . . |∞ <variable> ::= a | b | c | . . . | z

Complete Programming Exercise 4.4.2 (parts a–l) using this grammar sub- ject to all of the requirements given in that exercise.

Exercise 4.4.5: Consider the following context-free grammar defined in BNF (not EBNF) from Programming Exercise 3.3.10:

<expr> ::= <expr> & <expr> <expr> ::= <expr> | <expr> <expr> ::= ∼ <expr> <expr> ::= <literal>

<literal> ::= t <literal> ::= f

where t, f, |, &, and∼ are terminals that represent true, false, or, and, and not, respectively. Thus, sentences in the language defined by this grammar represent logical expressions that can evaluate to true or false.

Complete Programming Exercise 4.4.2 (parts a–l) using this grammar sub- ject to all of the requirements given in that exercise.

CO NF

ID EN

TI AL

D RA

FT4.4. PROGRAMMING EXERCISES FOR CHAPTER 4 145Specifically, build a parser and an interpreter to evaluate and determine the order in which operators of a logical expression are evaluated. Normal precedence rules hold: ∼ has the highest, & has the second highest, and | has the lowest. Assume left-to-right associativity.

The following is a sample interactive session with the pure interpreter:

> f | t & f | ˜t

false

> ˜t | t | ˜f & ˜f & t & ˜t | f

true

> f | t ; f | ˜t

"f | t ; f | ˜t" contains invalid lexemes and, thus, is not an expression.

> f | t & & f | ˜t

"f | t & & f | ˜t" is not an expression.

The following is a sample interactive session with the diagrammed inter- preter:

> f | t & f | ˜t

(((f) | ((t) & (f))) | (˜(t))) is false.

> ˜t | t | ˜f & ˜f & t & ˜t | f

((((˜(t)) | (t)) | ((((˜(f)) & (˜(f))) & (t)) & (˜(t)))) | (f)) is true.

> f | t ; f

"f | t ; f" contains invalid lexemes and, thus, is not an expression.

> f | | t & ˜t

"f | | t & ˜t" is not an expression.

The following is a sample interactive session with the parentheses-for- operator-precedence interpreter:

> f | t & f | ˜t

((f | (t & f)) | (˜t)) is false.

> ˜t | t | ˜f & ˜f & t & ˜t | f

((((˜t) | t) | ((((˜f) & (˜f)) & t) & (˜t))) | f) is true.

> f | t ; f

"f | t ; f" contains invalid lexemes and, thus, is not an expression.

> f | | t & ˜t

"f | | t & ˜t" is not an expression.

Exercise 4.4.6: Consider the following context-free grammar defined in BNF (not EBNF) from Programming Exercise 3.3.11:

CO NF

ID EN

TI AL

D RA

FT146 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATION<program> ::= (<declarations>, <expr>) <declarations> ::= [] <declarations> ::= [ <varlist> ]

<varlist> ::= <var> <varlist> ::= <var>, <varlist> <expr> ::= <expr> & <expr> <expr> ::= <expr> | <expr> <expr> ::= ∼ <expr> <expr> ::= <literal> <expr> ::= <var>

<literal> ::= t <literal> ::= f

<var> ::= a . . .e <var> ::= g . . .s <var> ::= u . . .z

where t, f, |, &, and ∼ are terminals that represent true, false, or, and, and not, respectively, and all lower case letters except for f and t are terminals each representing a variable. Each variable in the variable list is bound to true in the expression. Any variable used in any expression not contained in the variable list is assumed to be false. Thus, programs in the language defined by this grammar represent logical expressions, which can contain variables, that can evaluate to true or false.

Complete Programming Exercise 4.4.2 (parts a–d and i–l) using this gram- mar subject to all of the requirements given in that exercise.

Specifically, build a parser and an interpreter to evaluate and determine the order in which operators of a logical expression with variables are eval- uated. Normal precedence rules hold: ∼ has the highest, & has the second highest, and | has the lowest. Assume left-to-right associativity. The following is a sample interactive session with the pure interpreter:

> ([], f | t & f | ˜t)

false

> ([], ˜t | t | ˜f & ˜f & t & ˜t | f)

true

> ([p,q], ˜t | p | ˜e & ˜f & t & ˜q | r)

true

CO NF

ID EN

TI AL

D RA

FT4.5. PROGRAMMING PROJECT FOR CHAPTER 4 147> ([], f | t ; f) "([], f | t ; f)" contains invalid lexemes and, thus, is not a program.

> ([], f | | t & ˜t)

"([], f | | t & ˜t)" is not a program.

The following is a sample interactive session with the parentheses-for- operator-precedence interpreter:

> ([], f | t & f | ˜t)

((f | (t & f)) | (˜t)) is false.

> ([], ˜t | t | ˜f & ˜f & t & ˜t | f)

((((˜t) | t) | ((((˜f) & (˜f)) & t) & (˜t))) | f) is true.

> ([p,q], ˜t | p | ˜e & ˜f & t & ˜q | r)

((((˜t) | p) | ((((˜e) & (˜f)) & t) & (˜q))) | r) is true.

> ([], f | t ; f)

"([], f | t ; f)" contains invalid lexemes and, thus, is not a program.

> ([], f | | t & ˜t)

"([], f | | t & ˜t)" is not a program.

Notice that this language is context-sensitive because variables must be declared before used (e.g., ([a], b | t) is syntactically, but not seman- tically, valid).

4.5 Programming Project for Chapter 4

Putting It All Together Build an interpreter and a compiler to C++ for the language BOOLexp.

BOOLexp programs are defined by the context-free grammar in BNF (not EBNF) given in Programming Exercise 4.5.6.

Factor your system into the following three components:

• Front End (i.e., recursive-descent or shift-reduce parser; solution to Programming Exercise 3.3.11 part a or b augmented with a parse tree; see left side of Fig. 4.1)

• Interpreter (i.e., expression evaluator; see right side of Fig. 4.1)

• Compiler (i.e., translator; see middle of Fig. 4.2)

CO NF

ID EN

TI AL

D RA

FT148 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATIONThe first two components constitute the solution to Programming Exer- cise 4.4.6 (part k or l).

The general approach to this problem is to build a parse tree for each program and then implement two traversals of the tree: one traversal eval- uates the expression as it walks the tree (the interpreter component) and the other generates C++ code as it walks the tree (the compiler compo- nent).

The following is sample input and output for the interpreter (i.e., ex- pression evaluator) (> is simply the prompt for input and will be the empty string in your system).

> ([], f | t & f | ˜t)

((f | (t & f)) | (˜t)) is false.

> ([p,q], ˜t | p | ˜e & ˜f & t & ˜q | r)

((((˜t) | p) | ((((˜e) & (˜f)) & t) & (˜q))) | r) is true.

Notice that, akin to Programming Exercise 4.4.6 (part k or l), when inter- preting a BOOLexp program you must not only evaluate the logical ex- pression (the first element of the program pair) but also determine the or- der in which operators of it are evaluated and illustrate that order in the diagrammed output.

When compiling a BOOLexp program to C++ you must generate a C++ program with equivalent semantics as the BOOLexp program.

Requirements:

a) Use flex and bison to develop the front end of your system (i.e., scan- ner and parser, respectively).

b) Implement a -i option indicating to only interpret and a -c option in- dicating to only compile. If no command line options are given, then interpret and compile. Alternatively, generate two separate executa- bles: one for the interpreter and one for the compiler. Only the the first approach is demonstrated below.

c) Your program must read from standard input and write to standard output. Specifically, your program must read a set of expressions from standard input (one per line) and write the corresponding parenthe- sized expressions (also one per line, in the format used above) to stan- dard output. When compiling, the compiled programs are written to files, rather than standard output.

CO NF

ID EN

TI AL

D RA

FT4.5. PROGRAMMING PROJECT FOR CHAPTER 4 149d) Free all memory that you explicitly allocated from the heap. Specifi- cally, free the entire parse tree which means you must free each node, and for internal (operator) nodes you must free the buffer which stores the pointers to its children, if used.

e) The C++ programs you translate to must compile with g++ without er- rors or warnings.

f) Write a Makefile that builds your system (interpreter and compiler) as indicated in Programming Exercise 3.3.3.b.

Sample Test Data Sample standard input is available at http://perugini.cps.

udayton.edu/teaching/books/PL/www/files/boolexpstdin.

txt and sample standard output is available at http:// perugini.cps.udayton.edu/teaching/books/PL/www/files/

boolexpstdout.txt. A sample test session with boolexp on that data is available at http://perugini.cps.udayton.edu/teaching/ books/PL/www/files/boolexptestsession.txt.These test cases are not exhaustive. There is also a reference boolexp executable solu- tion for this system available at http://perugini.cps.udayton. edu/teaching/books/PL/www/files/boolexp. This sample test data with the reference executable is bundled and available at http:// perugini.cps.udayton.edu/teaching/books/PL/www/files/

boolexpdata.tar. The following is a sample input and output for the interpreter (only)

(> is simply the prompt for input and will be the empty string in your system).

$ ./boolexp -i

> ([] , f | t & f | ˜ t)

((f | (t & f)) | (˜t)) is false.

> ([p], f | t & f | ˜p)

((f | (t & f)) | (˜p)) is false.

> ([] , f | t | f & t | f | t & t & t | ˜ t)

(((((f | t) | (f & t)) | f) | ((t & t) & t)) | (˜t)) is true.

> ([p, q], ˜t | p | ˜e & ˜f & t & ˜q | r)

((((˜t) | p) | ((((˜e) & (˜f)) & t) & (˜q))) | r) is true.

> ([] , t & f & t | ˜ t & ˜ f & ˜ f | f & t & ˜ t)

((((t & f) & t) | (((˜t) & (˜f)) & (˜f))) | ((f & t) & (˜t))) is false.

> ([] , t & f | t & f | t & f | f & ˜ t | f)

CO NF

ID EN

TI AL

D RA

FT150 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATION(((((t & f) | (t & f)) | (t & f)) | (f & (˜t))) | f) is false. > ([], t & t & ˜ f | f & ˜ t | ˜ t & f)

((((t & t) & (˜f)) | (f & (˜t))) | ((˜t) & f)) is true.

> ([ ], t & t | ˜ f & ˜ f | t & f | ˜ t)

((((t & t) | ((˜f) & (˜f))) | (t & f)) | (˜t)) is true.

> ([a,b,c], a & ˜ f & ˜ f & b | ˜ t | c)

(((((a & (˜f)) & (˜f)) & b) | (˜t)) | c) is true.

> ([], t & ˜ f & ˜ t | ˜ f & ˜ t & t)

(((t & (˜f)) & (˜t)) | (((˜f) & (˜t)) & t)) is false.

> ([], t & ˜ f | t & ˜ f)

((t & (˜f)) | (t & (˜f))) is true.

> ([], t | f | t & f | t | ˜ t & t | f)

(((((t | f) | (t & f)) | t) | ((˜t) & t)) | f) is true.

> ([], ˜ f & t & ˜ t | ˜ f | t & ˜ f)

(((((˜f) & t) & (˜t)) | (˜f)) | (t & (˜f))) is true.

> ([],˜ t | ˜ f | ˜ t & ˜ f & f & ˜ t)

(((˜t) | (˜f)) | ((((˜t) & (˜f)) & f) & (˜t))) is true.

> ([x,y], ˜x | t | ˜z & ˜f & y & ˜y | f)

((((˜x) | t) | ((((˜z) & (˜f)) & y) & (˜y))) | f) is true.

> ([],˜t|˜f&˜t|˜t&˜f|˜t&˜t)

((((˜t) | ((˜f) & (˜t))) | ((˜t) & (˜f))) | ((˜t) & (˜t))) is false.

> ˆD

$

The following is a sample interactive test session for the system (inter- preter and compiler):

$ ./boolexp

> ([p, q], ˜t | p | ˜e & ˜f & t & ˜q | r)

((((˜t) | p) | ((((˜e) & (˜f)) & t) & (˜q))) | r) is true.

> ([] , t & f & t | ˜ t & ˜ f & ˜ f | f & t & ˜ t)

((((t & f) & t) | (((˜t) & (˜f)) & (˜f))) | ((f & t) & (˜t))) is false.

> ([] , t & f | t & f | t & f | f & ˜ t | f)

(((((t & f) | (t & f)) | (t & f)) | (f & (˜t))) | f) is false.

> ([], t & t & ˜ f | f & ˜ t | ˜ t & f)

((((t & t) & (˜f)) | (f & (˜t))) | ((˜t) & f)) is true.

ˆD

$

$ cat 1.cpp

#include<iostream>

using namespace std;

main() {

bool p = true;

CO NF

ID EN

TI AL

D RA

FT4.5. PROGRAMMING PROJECT FOR CHAPTER 4 151bool q = true; bool e = false;

bool r = false;

bool result = !true || p || !e && !false & true && !q || r;

cout << "The result is ";

if (result)

cout << "true";

else

cout << "false";

cout << "." << endl;

}

$

$ cat 4.cpp

#include<iostream>

using namespace std;

main() {

bool result = true && true && !false || false && !true || !true & false;

cout << "The result is ";

if (result)

cout << "true";

else

cout << "false";

cout << "." << endl;

}

$

$ ./boolexp

> ([ ], t & t | ˜ f & ˜ f | t & f | ˜ t)

((((t & t) | ((˜f) & (˜f))) | (t & f)) | (˜t)) is true.

> ([a,b,c], a & ˜ f & ˜ f & b | ˜ t | c)

(((((a & (˜f)) & (˜f)) & b) | (˜t)) | c) is true.

> ([], t & ˜ f & ˜ t | ˜ f & ˜ t & t)

(((t & (˜f)) & (˜t)) | (((˜f) & (˜t)) & t)) is false.

> ([], t & ˜ f | t & ˜ f)

((t & (˜f)) | (t & (˜f))) is true.

> ([], t | f | t & f | t | ˜ t & t | f)

CO NF

ID EN

TI AL

D RA

FT152 CHAPTER 4. PROGRAMMING LANGUAGE IMPLEMENTATION(((((t | f) | (t & f)) | t) | ((˜t) & t)) | f) is true. > ([], ˜ f & t & ˜ t | ˜ f | t & ˜ f)

(((((˜f) & t) & (˜t)) | (˜f)) | (t & (˜f))) is true.

> ([],˜ t | ˜ f | ˜ t & ˜ f & f & ˜ t)

(((˜t) | (˜f)) | ((((˜t) & (˜f)) & f) & (˜t))) is true.

ˆD

$

$ ./boolexp -c

> ([x,y], ˜x | t | ˜z & ˜f & y & ˜y | f)

$

$ ./boolexp -ci

> ([],˜t|˜f&˜t|˜t&˜f|˜t&˜t)

((((˜t) | ((˜f) & (˜t))) | ((˜t) & (˜f))) | ((˜t) & (˜t))) is false.

ˆD

$

$ cat 1.cpp

#include<iostream>

using namespace std;

main() {

bool result = !true || !false && !true || !true && !false || !true && !true;

cout << "The result is ";

if (result)

cout << "true";

else

cout << "false";

cout << "." << endl;

}

4.6 Key Terms

Abstraction, agile methods, ambiguity, ambiguous grammar, assembler, assembly code, associativity, attribute grammars, back end, Backus-Naur Form (BNF) bottom-up parsing, Chomsky hierarchy, compilation, com- piler, context-free grammar, context-free language, context-sensitive gram- mar, context-sensitive language, derivation, dynamic semantics, Extended Backus-Naur Form (EBNF), extensionally, extreme programming, fixed- format languages, formal grammar, formal language, fourth-generation

CO NF

ID EN

TI AL

D RA

FT4.7. THEMATIC TAKE-AWAYS 153languages, free-format languages, front end, grammar, handle, intension- ally, interpretation, interpreter, just-in-time compilation, just-in-time im- plementation system, keywords, language implementation system, lex- eme, lexer, lexical analysis, lexical analyzer, linker, linear-bounded au- tomata, machine code, Moore’s Law, non-terminal, parser, parse tree, pars- ing, precedence, processor, production rule, programming language im- plementation system, pushdown automata, recursive-descent parsing, re- cursively-enumerable languages, recursive languages, reduce-reduce con- flict, regular expression regular grammar, regular language, reserved words, scanner, semantic analysis, semantics, sentence, sentential form, shift-reduce parsing, software crisis, static semantics, structured program- ming, syntactic analysis, syntactic analyzer, syntax, terminal, token, top- down parsing, Turing machine, virtual machine,

4.7 Thematic Take-Aways

4.8 Chapter Summary

4.9 Bibliographic Notes

We refer the reader to [Seb10, Ch. 4] for a detailed discussion of lexical and syntactic analysis from a scanner and parser implementation perspective. PLPP?