Compilers(Question about Computer Science)
Dr. Ernesto Gomez : CSE 570/670
1. Miscellaneous updates
These will be incorporated into Lecture Notes 1-4, where these topics are covered
1.1. Top down parsing. I have just found a reference on modern top-down pars- ing methods : https://www.sanity.io/blog/why-we-wrote-yet-another-parser-compiler. Have not had a chance to review in detail, but at first glance this looks good - it is also very new, the site went up in December 2019. It links to a selection of the papers where the techniques are developed, and it introduces a parser generator equivalent to Yacc for topdown parsing. This material will probably be covered in class next year, but you can start looking at it now - If your work in the future involves compilers or any large appication that incorporates a language, you will need to know about this.
1.2. Non-determinism and Finite Automata (this text has been incorpo- rated in Lecture Notes 3. In Lecture Notes 3, we suggested an approach to constructing a deterministic finite automaton from a non-deterministic one using depth-first search and backtracking, with a stack. The text book constructs (chap- ter 3) a detailed algorithm, first for converting a non-deterministic finite automaton (NFA) to a deterministic (DFA)y, and then another algorithm to minimize the re- sulting DFA to the smallest possible DFA that accepts the same language. We then go full circle by converting the DFA to a regular expression - this whole sequence serves as a proof that Regular Expressions (RE) are equivalent to NFA, and that NFA are equivalent to DFA. That is, RE, DFA and NFA can recognize the same set of languages.
Further, the ability to minimize a DFA gives us the ability to determine if any pair of language definitions (RE, DFA or NFA) accept the same language - if two RE, or two FA minimize to the same DFA, then they accept the same language - that is, they mean the same thing.
All of this is of great theoretical interest but very little practical interest. The problem is, the algorithm to convert NFA to DFA has exponential complexity. Therefore the conversion can only be done for small languages and automata - the amount of work required for a more complex automaton or expression grows to fast. (The same applies to the backtracking algorithm we suggested in Notes 3 - it is a backtracking algorithm, and such algorithms also have exponential complexity).
1.3. Lex and Yacc. The front end to the compiler uses a https://www.sa nity.io/blog/why- we-wrote-yet-another-parser-compilerlexical analyzer, built on top of finite automata, defined by regular expressions. An example of this is described in 5.7, Lecture Notes 3. You can find a RE that recognizes the format of integers, so it can read text and pick out integers, get their value, and report this to the parse program. We do this because that means we can define something like: “addition → number+number” and we don’t have to define the details of the number format when we define what addition looks like. We have tools : Lex and Yacc (and other compatible programs that do the same thing), we submit a list of regular expressions to Lex, and it builds a scanner program (in C or C++) which can identify text that matches the expressions, and then pass this to Yacc.
Yacc in turn has a description of the language syntax (a grammar), this descrip- tion is a lot easier to write because we can do it with higher level things (integer,
1
2
float, identifier, ...) rather than have to describe what a number looks like each time we need to use it in a definition. Yacc generates a parser from the description of the language syntax and the code from Lex, and ouputs a C or C++ parser program that we can guarantee matches the language definition (we will see the theory behind this and understand how Lex and Yacc work in chapters 3, 4 and 5 of the text).
1.3.1. Problem : a Lex specification is a lhttps://www.sa nity.io/blog/why-we-wrote- yet-another-parser-compilerarge NFA. We have many things we want Lex to iden- tify - integers, floating-point, names of variables and functions, keywords, arithmetic operators ...
We write RE for each one ot the things we want identified, these are converted to DFA (deterministic finite automata) using the techniqjes of Chapter 3. We may have, for example, 50 different defintions resulting in 50 DFAs. To analyze the text, and identify all 50 things we specify, we can immediately OR all the individual DFA, creating a giant NFA - we would get a start state with 50 possible paths out of it to the automata for each specific DFA. A non-deterministic machine just picks the right path and does it.
We don’t know what the right path is, however. We know it is possible to convert an NFA to a DFA, but this is not practical because the conversion algorithm has exponential complexity (also - the DFA it generates is much larger - even exponentially larger than the original NFA). We could use a backtracking algorithm - this tries one path, if it works, it accepts, otherwise it goes back to the beginning and takes another path. This also has exponential complexity, it is not practical.
If we could do all 50 alternatives in parallel, we would still be doing exponential work, but we would finish quickly. Consider the NFA we have specified as a tree, the start node is the root, and it has 50 paths spreading out from the root. We still have to do all 50 alternatives, but if we can do them all concurrently, the time is just what it takes to get from the root to a leaf node (recall from your data structures class, if the number of nodes in a tree is N, the height of the tree is log(N) - given the height h, the number of nodes is exp(h) - so the work to traverse the tree is still exponential, but doing it in parallel it takes time h). Ideally we would need 50 threads for this case, to get real speedup we would need to have a separate CPU core for each thread.
Note: In the DFA examples we have seen, we submit a string to the automaton, if there is any text after you reach an accept state, usually the automaton crashes and rejects. If we are submitting program text, however, we have to reach accept and then reset and continue reading to see what comes next. Some languages make this easy by saying that whitespace is needed to separate different items - you hit a space and you know this is a separator, and jut have to check if you were in an accept state. Many/most languages don’t do this - x1=act+5; is perfectly legitimate C++, you read the x1, notice that = is not a legitimate character in a variable and accept “x1” as a name, then identify “=” as an assignment. C++ is pretty good about forbidding operator and punctuation symbols in names, but not all languages are.
We can simulate parallelism on a single core, with timeslicing. This is what the operating system does tyolso simulate multiple programs per core running at the same time - it allocates something like 100 microseconds to a thread, swaps it out and allocates the same time to another thread, until it has done all 50 and then
3
returns to the first one. All 50 hreads get to advance a little bit every 5 miliseconds, so on a human scale it looks like they are all advancing together. What happens if several threads accept the string we are trying to identify? A concrete example: “if” is a keyword. “ifblue” is a perfectly good variable name. In a language like Fortran, which does not require spaces between different things, the thread looking for “if” keyword will accept, and so will the thread that decides “ifblue” is the name of something. We need a tiebreaker - for example a rule that says whoever accepts last wins. This is hard to enforce unless threads are actually all advancing at the same rate - we can’t do this (we can come close on some thing like a GPU, but it is provable that this can not be done on any real machine becaue Physics! Take my parallel algorithms course next time it is offered, or I can send you a copy of my Algebra of Synchronization paper) .
1.3.2. Simulating parallelism Miscellaneous updates. This is what Lex does to iden- tify strings in text. It builds ONE non-deterministic finite automaton by implicitly OR-ing each of many definitions. An important detail - the entire Lex specification, which is the parallel simulation we have described here, defines a SINGLE NFA! We do not allow a single DFA to reject and start immediately processing new text, the concept is, read one character, submit it to every DFA in turn, Any DFA that rejects waits until all the others have also halted. We select a single result, and only then do we reset all the DFA to their respective start states, read the next character and start to identify the next item.
1.3.3. Parallelism in Lex. It turns our, however, that we can execute an actual non-deterministic automaton by simulating parallelism, in such a way that it looks like all the automata we are trying to run in parallel are advancing together. First we list all the automaton in a fixed order (this happens more or less automatically, writing the RE for each DFA on a separate line lists them in order, textually from top to bottom.
Then: read a single character from the input. (This is important - it is tempting to speed things up by reading multiple characters - it doesn’t work, I’ve tried!). Submit that character, in turn, to every automaton in your OR statement. Each automaton takes one step. Some will reject immediately - if your character is the letter “e”, then automata looking for numbers or arithmetic operators reject immediately. Automata looking for identifiers and keywords continue. We keep going as long as any automaton on the list is still working. Suppose we are looking at the string “elsered” - the DFA looking for the keyword “else” will accept when we find the second e, but the DFA that is looking for identifiers will keep going. Our tiebreaker here is, “the DFA that matches the longest string wins” - in this case, by the time we reach the letter “r”, only the DFA checking for identifiers is still running - so it halts, accepts and reports an identifier with the value “elsered”
Suppose the text was “else {“. both the identifier and the keyword DFA accept when the reach the space, since spaces are not allowed in a name. We have a tie on the same string - what we do is, list the DFA in order of priority - whoever appears first on the list wins. We write the more specific DFA (like the one that only recognizes “else”) before the more general one that recognizes identifiers.
This is what Lex does to identify strings in text. It builds ONE non-deterministic finite automaton by implicitly OR-ing each of many definitions. An important detail - the entire Lex specification, which is the parallel simulation we have described
4
here, defines a SINGLE NFA! We do not allow a single DFA to reject and start immediately processing new text, the concept is, read one character, submit it to every DFA in turn, Any DFA that rejects waits until all the others have also halted. We select a single result, and only then do we reset all the DFA to their respective start states, read the next character and start to identify the next item.