test
*
Languages and Automata
*
Section 11.1 Regular Languages
Problem: Suppose the input strings to a program must be strings over the alphabet {a, b}
that contain exactly one substring bb. In other words, the strings must be of the form xbby,
where x and y are strings over {a, b} that do not contain bb, x does not end in b, and y does
not begin with b. In a few minutes, we’ll see how to describe the strings formally.
A regular language over alphabet A is a language constructed by the following rules:
- and {} are regular languages.
- {a} is a regular language for all a A.
- If M and N are regular language, then so are M N, MN, and M*.
Example. Let A = {a, b}. Then the following languages are a sampling of the regular
languages over A:
, {}, {a}, {b}, {a, b}, {ab}, {a}* = {, a, aa, aaa, …, an, … }.
A regular expression over alphabet A is an expression constructed by the following rules:
- and are regular expressions.
- a is a regular expression for all a A.
- If R and S are regular expressions, then so are (R), R + S, R·S, and R*.
The hierarchy in the absence of parentheses is, * (do it first), ·, + (do it last). Juxtaposition will be used in place of ·.
Example. Let A = {a, b}. Then the following expressions are a sampling of the regular
expressions over A:
, , a, b, ab, a + ab, (a + b)*.
*
*
Regular expressions represent regular languages
Regular expressons represent regular languages by the following correspondence, where
L(R) denotes the regular language of the regular expression R.
L() = ,
L() = {},
L(a) = {a} for all a A,
L(R + S) = L(R) L(S),
L(RS) = L(R)L(S),
L(R*) = L(R)*.
Example. The regular expression ab + a* represents the following regular language:
L(ab + a*) = L(ab) L(a*)
= L(a)L(b) L(a)*
= {a}{b} {a}*
= {ab} {, a, aa, aaa, …, an, … }
={ab, , a, aa, aaa, …, an, … }.
Example. The regular expression (a + b)* represents the following regular language:
L((a + b)*) = (L(a + b))* = {a, b}*, the set of all possible strings over {a, b}.
Back to the Problem: Suppose the input strings to a program must be strings over the
alphabet {a, b} that contain exactly one substring bb. In other words, the strings must be of
the form xbby, where x and y are strings over {a, b} that do not contain bb, x does not end
in b, and y does not begin with b. How can we describe the set of inputs formally?
Solution: let x = (a + ba)* and y = (a + ab)*.
*
Quiz. Find a regular expresson for {abn | n N} {ban | n N}.
Answer. ab* + ba*.
Quiz. Use a sentence to describe the language of (b + ab)*( + a).
Answer. All strings over {a, b} whose substrings of a’s have length 1.
The Algebra of Regular Expressions
Equality: Regular expressions R and S are equal, written R = S, when L(R) = L(S).
Examples. a + b = b + a, a + a = a, aa* = a*a, ab ≠ ba.
Properties of +, ·, and closure
+ is commutative, associative, is identity for +, and R + R = R.
· is associative, is identity for ·, and is zero for ·.
· distributes over +
(closure properties)
* = * = .
R* = R*R* = (R*)* = R + R*.
R* = + R* = + RR* = ( + R)* = ( + R)R*.
R* = (R + R2 + … + Rk)* = + R + R2 + … + Rk–1 + RkR* for any k ≥ 1.
R*R = RR*.
(R + S)* = (R* + S*)* = (R*S*)* = (R*S)*R* = R*(SR*)*.
R(SR)* = (RS)*R.
(R*S)* = + (R + S)*S and (RS*)* = + R(R + S)*.
Proof: All properties can be verified by showing that the underlying regular languages are
equal as sets. QED.
*
Quiz. Explain each inequality.
(1). (a + b)* ≠ a* + b*. (2) (a + b)* ≠ a*b*.
Answers. (1) ab L((a + b)*) – L(a* + b*).
(2) ba L((a + b)*) – L(a*b*).
Quiz. Simplify the regular expression aa(b* + a) + a(ab* + aa).
Answer. aa(b* + a) + a(ab* + aa)
= aa(b* + a) + aa(b* + a) · distributes over +
= aa(b* + a) R + R = R.
Example/Quiz. Show that (a + aa)(a + b)* = a(a + b)*.
Proof: (a + aa)(a + b)* = (a + aa)a*(ba*)* (R + S)* = R*(SR*)*
= a( + a)a*(ba*)* R = R and · distributes over +
= aa*(ba*)* ( + R)R* = R*
= a(a + b)* (R + S)* = R*(SR*)* QED.
Example/Quiz. Show that a*(b + ab*) = b + aa*b*.
Proof: a*(b + ab*) = ( + aa*)(b + ab*) R* = + RR*
= b + ab* + aa*b + aa*ab* · distributes over +
= b + (ab* + aa*ab*) + aa*b + is commutative and associative
= b + ( + aa*)ab* + aa*b R = R and · distributes over +
= b + a*ab* + aa*b R* = + RR*
= b + aa*b* + aa*b R*R = RR*
= b + aa*(b* + b) · distributes over +
= b + aa*b*. R* = R* + R QED.
*
Example. Show that a* + abb*a = a* + ab*a.
Proof: Starting on the RHS, we have
a* + ab*a = a* + a( + bb*)a R* = + RR*
= a* + aa + abb*a · distributes over +
= ( + aa*) + aa + abb*a R* = + RR*
= + (aa* + aa) + abb*a + is associative
= + a(a* + a) + abb*a · distributes over +
= + aa* + abb*a R* = R + R*
= a* + abb*a R* = + RR* QED.
Example. Show that (a + aa + … + an)(a + b)* = a(a + b)* for all n ≥ 1.
Proof (by induction): If n = 1, the statement becomes a(a + b)* = a(a + b)*, which is true. If
n = 2, the statement becomes (a + aa)(a + b)* = a(a + b)*, which is true by a previous
example. Let n > 2 and assume the statement is true for 1 ≤ k < n. We need to prove the
statement is true for n. The LHS of the statement for n is
(a + aa + … + an)(a + b)*
= a(a + b)* + (aa + … + an)(a + b)* · distributes over +
= a(a + b)* + a(a + aa + … + an–1)(a + b)* · distributes over +
= a(a + b)* + aa(a + b)* induction assumption
= (a + aa)(a + b)* · distributes over +
= a(a + b)* induction assumption
The last line is the RHS of the statement for n. So the statement is true for n. Therefore, the
statement is true for all n ≥ 1. QED.
*
Example/Quiz: Use regular algebra to show that R + (R + S)* = (R + S)*.
Proof:
R + (R + S)* = R + [L + (R + S) + (R + S)(R+ S)*] R* = + R + R2 + … + Rk–1 + RkR*
= L + (R + R) + S + (R + S)(R+ S)* + is associative
= L + R + S + (R + S)(R+ S)* R + R = R
= L + (R + S) + (R + S)(R+ S)* + is associative
= (R+ S)* R* = + R + R2 + … + Rk–1 + RkR*
QED.
Take-home quiz: Use regular algebra to show that (a + ab)*a = a(a + ba)*.
*
Section 11.2 Finite Automata
Can a machine(i.e., algorithm) recognize a regular language? Yes!
Deterministic Finite Automata
A deterministic finite automaton (DFA) over an alphabet A is a finite digraph (where
vertices or nodes are called states) for which each state emits one labeled edge for each
letter of A. One state is designated as the start state and a set of states may be final states.
Example. Either of the following alternatives is acceptable for representing a DFA, where
final states are indicated by double circles.
The Execution of DFA for input string w A* begins at the start state and follows a path
whose edges concatenate to w. The DFA accepts w if the path ends in a final state.
Otherwise the DFA rejects w. The language of a DFA is the set of accepted strings.
Example. The example DFA accepts the strings
a, b, ab, bb, abb, bbb, …, abn, bbn, ….
So the language of the DFA is given by the regular expression (a + b)b*.
Start
0
1
2
a
a
b
b
b
a
Start
0
1
2
a, b
a
b
a, b
*
*
Quiz. Find a DFA for the language of a + aa*b.
Theorem (Kleene) The class of regular languages is exactly the same as the class of
languages accepted by DFAs.
Quiz. Find an DFA for each of the following languages over the alphabet {a, b}.
(a) . (b) {}. (c) {(ab)n | n N}, which has regular expression (ab)*.
Start
0
1
4
a, b
a
a
a, b
b
a
b
b
2
3
Solution:
Start
a, b
a, b
(b):
Start
a, b
Solutions: (a):
Start
b
a, b
a
a
b
(c):
*
*
Table Representation of a DFA
A DFA over A can be represented by a transition function T : States A States, where
T(i, a) is the state reached from state i along the edge labeled a, and we mark the start and
final states. For example, the following figures show a DFA and its transition table.
Note: T can be extended to T : States A* States by
T(i, L) = i and T(i, aw) = T(T(i, a), w) for a A and w A*.
Quiz: Calculate T(0, bba).
Solution: T(0, bba) = T(1, ba) = T(1, a) = T(2, L) = 2.
Example/Quiz. Back to the problem of describing input strings over {a, b} that contain
exactly one substring bb. We observed that the strings could be described by the regular
expression (a + ba)*bb(a + ab)*. Find a DFA to recognize the language.
Start
0
1
2
a, b
a
b
a, b
Start
a
a
b
a
b
A solution:
b
a
b
a, b
*
*
Nondeterministic Finite Automata
A nondeterministic finite automaton (NFA) over an alphabet A is similar to a DFA except
that L-edges are allowed, there is no requirement to emit edges from a state, and multiple
edges with the same letter can be emitted from a state.
Example. The following NFA recognizes the language of a + aa*b + a*b.
Table representation of NFA
An NFA over A can be represented by a function T : States A {L} power(States),
where T(i, a) is the set of states reached from state i along the edge labeled a, and we mark
the start and final states. The following figure shows the table for the preceding NFA.
Start
0
2
1
a
a
L
a
b
*
*
Theorem (Rabin and Scott) The class of regular languages is exactly the same as the
class of languages accepted by NFAs.
Quizzes. Find an NFA for each of the following languages over {a, b}.
(a) . (b) {}. (c) {(ab)n | n N}, which has regular expression (ab)*.
ExampleQuiz. Back to the problem of describing input strings over {a, b} that contain
exactly one substring bb. We observed that the strings could be described by the regular
expression (a + ba)*bb(a + ab)*. Find an NFA to recognize the language.
Start
Solutions: (a):
Start
(b):
Start
b
a
(c):
Start
a
b
a
b
b
a
a
b
A solution:
*
*
Algorithm: Transform a Regular Expression into a Finite Automaton
Start by placing the regular expression on the edge between a start and final state:
Apply the following rules to obtain a finite automaton after erasing any -edges.
Quiz. Use the algorithm to construct a finite automaton for (ab)* + ba.
Start
Regular expression
R + S
i
j
R
i
j
S
transforms to
RS
i
j
i
j
R
S
transforms to
R*
i
j
i
j
L
L
R
transforms to
Start
a
b
b
a
L
L
Answer:
*
*
Algorithm: Transform a Finite Automaton into a Regular Expression
Connect a new start state s to the start state of the FA and connect each final state of the FA
to a new final state ƒ as shown in the figure.
If needed, combine all multiple edges between the same two nodes into one edge with label
the sum of the labels on the multiple edges. If there is no edge between two states, assume
there is an -edge.
Now eliminate each state k of the FA by constructing a new edge (i, j) for each pair of edges
(i, k) and (k, j) where i ≠ k and j ≠ k. The new label new(i, j) is defined in terms of the old
labels by the formula
new(i, j) = old(i, j) + old(i, k)old(k, k)*old(k, j)
L
L
Given
FA
Connect to start state of FA
s
ƒ
Connect from each final state of FA
D
A
C
B
i
k
j
A + BC*D
i
j
Example.
becomes
*
*
Quiz. Use the algorithm to transform the following NFA into a regular expression. (Half the
class eliminate state 0 then state 1 and half the class eliminate state 1 then state 0.)
Solution: Connect the NFA to new a start
state s and a new final state f as pictured.
First Solution: Eliminate state 0 to obtain:
new(s, 1) = + La*a = a*a.
new(1, 1) = + ba*a = ba*a.
Second Solution: Eliminate state 1 to obtain:
new(0, f) = + a*L = a.
new(0, 0) = a + a*b = a + ab.
Eliminate state 1 to obtain:
new(s, f) = + a*a(ba*a)*L = a*a(ba*a)*.
Eliminate state 0 to obtain:
new(s, f) = + L(a + ab)*a = (a + ab)*a.
Start
a
a
b
0
1
a
a
b
0
1
s
ƒ
L
L
s
ƒ
1
a*a
ba*a
L
s
ƒ
a*a(ba*a)*
s
ƒ
0
L
a + ab
a
s
ƒ
(a + ab)*a
*
*
Quiz. Use regular algebra to show the following equality from the previous quiz.
a*a(ba*a)* = (a + ab)*a.
Proof: a*a(ba*a)* = a*[a((ba*)a)*] · is associative
= a*[(a(ba*))*a] R(SR)* = (RS)*R
= a*[((ab)a*)*a] · is associative
= [a*((ab)a*)*]a · is associative
= (a + ab)*a R*(SR*)* = (R + S)*. QED.
Automata with Output (Mealy and Moore)
Mealy machine: Associate an output action with each
transition between states.
Challenge: Describe a two-floor elevator system with a Mealy or Moore machine.
Moore machine: Associate an output action with each state.
Theorem: Mealy and Moore machines are equivalent.
Example/Quiz. Output the complement of a binary number.
i/A
a
j/B
a/A
i
j
Start
0/1
0
1/0
Start
0/L
2/0
1/1
0
1
0
1
0
1
Solutions:
*
*
Section 11.3 Constructing Efficient Finite Automata
First we’ll see how to transform an NFA into a DFA. Then we’ll see how to transform a
DFA into a minimum-state DFA.
Transforming an NFA into a DFA
The l-closure of a state s, denoted l(s), is the set consisting of s together with all states that
can be reached from s by traversing l-edges. The l-closure of a set S of states, denoted
l(S), is the union of the l-closures of the states in S.
Example. Given the following NFA as a graph and as a transition table.
Some sample l-closures for the NFA are as follows:
l(0) = {0, 1, 2}
l(1) = {1, 2}
l(2) = {2}
l() =
l({1, 2}) = {1, 2}
l({0, 1, 2}) = {0, 1, 2}.
S
0
2
1
b
b
L
a
L
a
*
*
Algorithm: Transform an NFA into a DFA
Construct a DFA table TD from an NFA table TN as follows:
1. The start state of the DFA is l(s), where s is the start state of the NFA.
2. If {s1, …, sn} is a DFA state and a A, then
TD({s1, …, sn}, a) = l(TN(s1, a) … TN(sn, a)).
3. A DFA state is final if one of its elements is an NFA final state.
Example. Given the following NFA.
The algorithm constructs the following DFA transition table TD, where it is also written in
simplified form after a renumbering of the states.
S
0
3
1
b
b
a
L
a
2
b
*
*
Quiz. Use the algorithm to transform the following NFA into a DFA.
Solution: The algorithm constructs the following DFA transition table TD, where it is
also written in simplified form after a renumbering of the states.
S
0
1
a
a
L
b
2
3
a
*
*
Transforming an DFA into a minimum-state DFA
Let S be the set of states that can be reached from the start state of a DFA over A.
For states s, t S let s ~ t mean that for all strings w A* either T(s, w) and T(t, w) are
both final or both nonfinal. Observe that ~ is an equivalence relation on S. So it partitions S
into equivalence classes.
Observe also that the number of equivalence classes is the minimum number of states
needed by a DFA to recognize the language of the given DFA.
Algorithm: Transform a DFA to a minimum-state DFA
1. Construct the following sequence of sets of possible equivalent pairs of distinct states:
E0 E1 … Ek = Ek+1,
where
E0 = {{s, t} | s and t are either both final or both nonfinal}
and
Ei+1 = {{s, t} Ei | {T(s, a), T(t, a)} Ei or T(s, a) = T(t, a)} for every a A}.
Ek represents the distinct pairs of equivalent states from which ~ can be generated.
2. The equivalence classes form the states of the minimum state DFA with transition table Tmin defined by
Tmin([s], a) = [T(s, a)].
3. The start state is the class containing the start state of the given DFA.
4. A final state is any class containing a final state of the given DFA.
*
*
Example. Use the algorithm to transform the following DFA into a minimum-state DFA.
Solution: The set of states is S = {0, 1, 2, 3, 4}. To find the equivalent states calculate:
E0 = {{0, 4}, {1, 2}, {1, 3}, {2, 3}}
E1 = {{1, 2}, {1, 3}, {2, 3}}
E2 = {{1, 2}, {1, 3}, {2, 3}} = E1.
So 1 ~ 2, 1 ~ 3, 2 ~ 3. This tells us that S is partitioned by {0}, {1, 2, 3}, {4}, which we
name [0], [1], [4], respectively. So the minimum-state DFA has three states.
Quiz: What regular expression equality arises from the two DFAs?
Answer: a + aa + (aaa + aab + ab)(a + b)* = a(a + b)*.
S
0
a
b
a
2
a, b
4
1
3
b
a, b
a, b
Min-state Table
Renamed Table
S
0
a
b
a, b
2
1
a, b
Min-state DFA graph
*
*
Quiz. Is the following DFA a minimum-state DFA?
Answer: No. E0 = {{0, 1}, {0, 5}, {1, 5}, {2, 3}, {2, 4}, {3, 4}}
E1 = {{1, 5}, {2, 4}}
E2 = {{1, 5}, {2, 4}} = E1.
So 1 ~ 5 and 2 ~ 4. This gives us {0}, {1, 5}, {2, 4}, {3}, with names
[0], [1], [2], [3], respectively with the following minimum-state DFA.
E0 = {{0, 1}}
E1 = {{0, 1}} = E0.
So 0 ~ 1, which tells us that the partition is the whole set of states {0, 1} = [0]. Therefore,
we obtain the following minimum-state DFA.
Answer. No. Use the minimum-state algorithm.
S
b
0
1
a
a
b
S
a, b
0
Quiz. Is the following DFA a minimum-state DFA?
*
*
Section 11.4 Regular Language Topics
Regular languages are also characterized by special grammars called regular grammars
whose productions take the following form, where w is a string of terminals.
A wB or A w.
Example. A regular grammar for the language of a*b* is
S L | aS | T
T b | bT.
Quiz. Write a regular grammar for {ab, acb, accb. …, acnb, …}.
Solution: A regular expression for the language is ac*b. A regular grammar is
S aT
T b | cT.
Regular grammars as we have defined them are often called right-regular because there are
also left-regular grammars that require productions to have the form
A Bw or A w.
Any language with a right-regular grammar also has a left-regular grammar, and
conversely.
Example. For the regular expression a*bc* we have the following grammars.
Right regular: S aS | bT
T L | cT.
Left Regular: S Sc | Tb
T L | Ta.
*
*
Transforming an NFA to a Regular Grammar
- State names become the nonterminals. So rename them to be uppercase letters.
2. The start state becomes the start symbol of the grammar.
3. For each state transition from I to J labeled with x construct a production I xJ.
4. For each final state F construct a production F L.
Example/Quiz. Transform the following NFA into a regular grammar.
Solution:
S aI | F
I aI | bF
F L
This grammar can be simplified to
S aI | L
I aI | b
Notice in the previous example that right-regular grammars derive strings by examining the
leftmost letter first, while left-regular grammars examine the rightmost letter first.
Example/Quiz. Find a right-regular grammar and a left-regular grammar for the language
of the regular expression a(b + c)*de*.
Right regular: S aT
T bT | cT | dU
U L | eU.
Left Regular: S Se | Td
T a | Tb | Tc.
Start
S
F
I
L
b
a
a
*
*
Example/Quiz. Transform the regular expression a*(ba)* into a regular grammar.
Solution: Draw an NFA and then use the algorithm.
An NFA:
The resulting grammar:
S aS | bI | L
I aF
F bJ | L
J aF.
Quiz. Simplify the grammar.
Answer:
S aS | baF | L
F baF | L
Quiz. What happens when a DFA has
an error state, like the following DFA?
Answer: The resulting grammar:
S aF | bE
F aE | bJ | L
J aF | bE
E aE | bE.
But E does not derive a
terminal string. So the
grammar simplifies to
S aF
F bJ | L
J aF.
Quiz: Simplify the grammar
further.
Answer: S aF
F baF | L.
Start
S
I
a
b
F
J
a
b
a
Start
S
F
J
a
b
a
E
b
a
b
a, b
*
*
Transforming a Regular Grammar to an NFA
1. Replace any production with multiple terminals by productions with single terminals.
2. The start state is the grammar start symbol.
3. Transform I aJ into a transition from I to J labeled with a.
4. Transform I J into a transition from I to J labeled with L.
5. Transform each I a into a transition from I to new single final state F labeled with a.
6. The final states are F together with each state I with a production I L.
Example/Quiz. Transform the following regular grammar into a NFA.
S abS | T | L
T cT | d
Quiz. What is the regular expression for the language of the grammar?
Answer: (ab)*(L + c*d)
Solution. Transform S abS into S aI
and I bS, so the grammar becomes
S aI | T | L
I bS
T cT | d
Now the NFA can be drawn:
Start
S
T
L
b
a
c
F
I
d
*
*
Properties of Regular Languages
When we know some properties of regular languages they can help us argue, BWOC, that
certain languages are not regular.
The Pumping Lemma
If L is an infinite regular language, then it is recognized by a DFA with, say, m states.
If s in L and | s | ≥ m, then an acceptance path for s must pass through some state twice.
The following graph depicts the situation.
The dotted arrows represent the path to acceptance for s and the letters x, y, and z represent
the concatenation of the letters along the edges of the path. So s = xyz and y ≠ L. Assume
that the middle state is the first repeated state on the path. So | xy | ≤ m. Since the loop can
be traversed any number of times, we have the pumping property: xykz L for all k N
Example. The language L = {anbn | n N} is not regular.
Proof: Assume, BWOC, that L is regular. Since L is infinite, the pumping lemma applies.
Choose s = ambm. Then s = xyz, where y ≠ L, | xy | ≤ m, and xykz L for all k N.
Since | xy | ≤ m and s = ambm = xyz, it follows that xy is a string of a’s.
Since y ≠ L, it must be that y = ai for some i > 0.
We’ll try for a contradiction with k = 2. The pumping property implies xy2z L.
But xy2z = am + ibm, which is not in L because i > 0.
This contradiction implies that L is not regular. QED.
Start
y
x
z
*
*
Quiz. In the previous proof we exhibited a contradiction when k = 2. Find similar
contradictions for k = 0 and k = 3.
Answer: (k = 0) The pumping property implies xy0z L. In other words, xz L.
But xz = am – ibm, which is not in L because i > 0. This contradiction implies that L is not
regular. QED.
(k = 3) The pumping property implies xy3z L. But xy3z = am + 2ibm, which is not in L
because i > 0. This contradiction implies that L is not regular. QED.
Example/Quiz. Prove the language L = {aabnac2n | n N} is not regular.
Proof: Assume, BWOC, that L is regular. Since L is infinite, the pumping lemma applies.
Choose s = aabmac2m. Then s = xyz, where y ≠ L, | xy | ≤ m and xykz L for all k N.
Since | xy | ≤ m, there are three cases for y:
Case 1: y = aabi for some 0 ≤ i ≤ m – 2. (Remember, | xy | ≤ m, so x = L in this case.)
Case 2: y = abi for some 0 ≤ i ≤ m – 2. (In this case, we have x = a.)
Case 3: y = bi for some 0 < i ≤ m – 2 – j. (In this case, x = aabj for some 0 ≤ j ≤ m – 3)
We can obtain a contradiction in each case by considering k = 0, which implies xz L.
Case 1: xz = bm – iac2m which is not in L because strings of L must begin with aa.
Case 2: xz = abm – iac2m which is not in L because strngs of L must begin with aa.
Case 3: xz = aabm – iac2m which is not in L because strings in L must have twice as many c’s as b’s which can’t be the case since i > 0.
So each of the three cases yields a contradiction. Therefore L is not regular. QED.
*
*
Section 11.5 Context-Free Languages
We know that the language {anbn | n N} is not regular, but it certainly has a non-regular
grammar, such as
S aSb | L.
A context-free grammar has productions of the form
N w
Where N is a nonterminal and w is any string containing terminals and/or nonterminals.
A context-free language is the set of strings derived from a context-free grammar.
Example. {anbn | n N} is a C-F language derived from the C-F grammar S aSb | L.
Example. Any regular grammar is context-free. So regular languages are C-F languages.
Quiz. Find a grammar for {anbn + 2 | n N}.
Answer. S aSb | bb.
Quiz. Find a grammar for{wwR| w {a, b}*}, where wR is the reverse of w.
Answer. S aSa | bSb | L.
Techniques for Constucting Grammars:
Let L and M be two C-F grammars with disjoint sets of nonterminals and with start symbols
A and B, respectively. Then
- L M has grammar S A | B.
- LM has grammar S AB.
- L* has grammar S AS | L.
*
*
Example. Let L be the language of strings over {a, b} with the same number of a’s and b’s.
Does L have the following grammar?
S aSbS | bSaS | L.
It’s easy to see that the language of the grammar is a subset of L.
What about the other way?
Assume that w L and show w is derived by the grammar.
If w = L, then S L.
Let w ≠ L and assume that if s L and | s | < | w |, then S + s.
Show that S + w. Consider the four cases:
1. w = asb for some string s. In this case, s L and | s | < | w |. So by induction we have
S + s. Therefore, we have S aSbS aSb + asb = w.
2. w = bsa for some string s. Similar to case 1.
3. w = axa for some string x. In this case, x has two more b’s than a’s. So x L.
What do we do now?
Notice, for example, if | w | = 4, then w = abba. If | w | = 6, then w has one of the forms
aabbba, ababba, abbaba abbbaa.
We claim that x can be written in the form x = ubbv where u, v L. (Can you prove it?)
So by induction we have derivations S + u and S + v. Therefore, we have
S aSbS + aubS aubbSaS + aubbvaS aubbva = axa = w.
4. w = bxb for some string x. Similar to case 3.
QED.
*
*
Section 11.6 Pushdown Automata
A pushdown automaton (PDA) is a finite automaton with a stack that has stack operations
pop, push, and nop. PDAs always start with one designated symbol on the stack. A state
transition depends on the input symbol and the top of the stack. The machine then performs
a stack operation and enters the next state.
Representation can be graphical or with sets of 5-tuples. For example:
or (i, b, C, pop, j)
Execution: If the machine is in state i and the input letter is b and C is on top of the stack,
then pop the stack and enter state j.
Nondeterminism can occur in two ways, as in the following examples.
Acceptance: A string w is accepted by a PDA if there is a path from the start state to a
final state such that the input symbols on the path edges concatenate to w. Otherwise, w
is rejected.
i
j
i
j
k
i
j
k
*
*
Example. A PDA to accept the language {anbn | n > 0} as a graph and as a set of 5-tuples.
(0, a, X, push(a), 0)
(0, a, a, push(a), 0)
(0, b, a, pop, 1)
(1, b, a, pop, 1)
(1, L, X, nop, 2).
Quiz. How would you modify the machine to accept {anbn | n N}?
Answer: Add the instruction (0, L, X, nop, 2).
Example/Quiz. Find a PDA to accept the language {a2nbn | n N}.
ID for input aaaabb:
(0, aaaabb, X)
(0, aaabb, aX)
(0, aabb, aaX)
(0, abb, aaaX)
(0, bb, aaaaX)
(1, b, aaaX)
(2, b, aaX)
(1, L, aX)
(2, L, X)
(3, L, X).
A solution:
0
Start
X
1
2
0
Start
X
1
2
3
*
*
Empty-Stack Acceptance. A PDA can also be defined to accept a string by the condition
that the stack is empty. The two types of acceptance are equivalent in that they both define
PDAs that accept the same class of languages.
Transform final-state acceptance to empty-stack acceptance:
Introduce a new start state s, a new start symbol Y, a new empty state e, and construct edges
and actions as shown in the diagram below. Note that ? stands for any stack symbol.
Quiz: If the final-state PDA is deterministic, the transformed empty-stack PDA could be
nondeterministic. Why?
Answer: If the final-state PDA has an edge emitted from a final state, then the empty-stack
PDA constructs a nondeterministic choice from that final state. For example, a final-state
PDA for {abncn | n N} could have a final state k to accept the string a that also emits an
edge to take care of any string abn + 1cn + 1. The empty-stack PDA would construct a
nondeterministic choice from k by adding instructions of the form (k, L, ?, pop, e).
Start
X
s
Y
Start
e
final-state PDA
*
*
Transform empty-stack acceptance to final-state acceptance:
Introduce a new start state s, a new start symbol Y, a new final state ƒ, and construct edges
and actions as shown in the diagram below.
Quiz: If the empty-stack PDA is deterministic, then the transformed final-state PDA is also
deterministic. Why?
Answer: Because the instructions of the empty-stack PDA do not use stack symbol Y. So the
only time a new instruction can be executed is when Y is on top of the stack, meaning that
the empty-stack PDA has emptied it’s original stack and can’t move.
Theorem: The context-free languages are exactly the languages accepted by PDAs.
Proof: We’ll see two algorithms, one to transform a C-F grammar into an empty-stack
PDA, and one to transform an empty-stack PDA into a C-F grammar. QED.
Start
X
s
Y
Start
empty-stack PDA
ƒ
*
*
Transform a C-F grammar to empty-stack PDA
The idea is to construct a PDA whose execution corresponds to a leftmost derivation.
The PDA has one state, but we allow multiple stack operations. The stack symbols are the
terminals and nonterminals of the grammar. The designated starting stack symbol is the
grammar start symbol. We’ll illustrate the algorithm with an example.
Example. A leftmost derivation of abbcb:
S aSb
aBCb
abBCb
abbCb
abbcCb
abbcb.
(0, abbcb, S)
(0, abbcb, aSb)
(0, bbcb, Sb)
(0, bbcb, BCb)
(0, bbcb, bBCb)
(0, bcb, BCb)
(0, bcb, bCb)
(0, cb, Cb)
(0, cb, cCb)
(0, b, Cb)
(0, b, b)
(0, L, L).
Example. ID for input string abbcb:
Terminals Corresponding PDA Instructions
a (0, a, a, pop, 0)
b (0, b, b, pop, 0)
c (0, c, c, pop, 0)
Productions
S aSb (0, L, S, pop, push(b), push(S), push(a), 0)
S BC (0, L, S, pop, push(C), push(B), 0)
B bB (0, L, B, pop, push(B), push(b), 0)
B b (0, L, B, pop, push(b), 0)
C cC (0, L, C, pop, push(C), push(c), 0)
C L (0, L, C, pop, 0)
Grammar
S aSb | BC
B bB | b
C cC | L.
*
*
Transform an empty-stack PDA to a C-F grammar
The idea is to construct a grammar whose leftmost derivations correspond to computation
sequences. For each stack symbol B and each pair of states i and j construct a nonterminal
Bij from which a leftmost derivation will correspond to a computation sequence that starts
in state i with B on the stack and ends in state j with B popped from the stack.
Bij a
Bik aBjk for each state k.
Bil aCjkBkl for each state k and l.
S Xsj for each state j. (S is start symbol)
Corresponding Grammar Productions
PDA Instruction
(For the following instructions, the dotted lines represent
possible paths to states that pop the desired stack symbol.)
i
j
i
j
k
i
j
k
l
s
X
Start
j
*
*
Example/Quiz. Use the algorithm to
transform the following empty-stack PDA
into a C-F grammar.
Quiz: Simplify the preceding grammar by
deleting productions that do not derive terminal
strings and productions that are not reachable
from the start symbol.
Solution:
The start state 0 with X on the stack gives:
S X00 | X01
The pop operation (1, a, X, pop, 1) gives:
X11 a
The pop operation (1, a, A, pop, 1) gives:
A11 a
The push operation (0, a, X, push(A), 0) gives:
X00 aA00X00 | aA01X10
X01 aA00X01 | aA01X11
The nop operation (0, a, A, nop, 1) gives:
A00 aA10
A01 aA11
Solution: S X01
X01 aA01X11
A01 aA11
X11 a
A11 a
or, S aaaa.
Start
0
X
1
*
*
Example. We’ll try to find a grammar for the language L = {w {a, b}* | na(w) = nb(w)} by
constructing an empty-stack PDA to accept L and then transforming it into a C-F grammar.
Consider the following single-state empty-stack PDA.
Idea of Proof: Assume, WLOG, that the input
string begins with a. Then A is pushed and this
action continues for each consecutive a. When the
first b occurs, A is popped, which means that an a-b
pair has been counted. The only way to get back to
X on top is if the string has an equal number of a’s
and b’s so far. If the input is exhausted, then pop X
and accept the string. Otherwise, the process
continues as described.
PDA Transformed into C-F grammar:
S X00
X00 L | aA00X00 | bB00X00
A00 b | aA00A00
B00 a | bB00B00 .
A Simplified C-F grammar:
S L | aAS | bBS
A b | aAA
B a | bBB.
Example Derivation of aababb:
S aAS aaAAS aabAS
aabaAAS aababAS
aababbS aababb.
Start
0
X
*
*
Nondeterministic PDAs are more powerful than deterministic PDAs.
An example is to consider the language of even palindromes over {a, b}. A context-free
grammar for the language is given by
S L | aSa | bSb
Any PDA to accept the language must make a nondeterministic decision to start comparing
the second half of a string with the reverse of the first half.
Quiz: Transform the grammar into a PDA
A Solution (using the algorithm)
(0, a, a, pop, 0)
(0, b, b, pop, 0)
(0, L, S, pop, 0)
(0, L, S, pop, push(a), push(S), push(a), 0)
(0, L, S, pop, push(b), push(S), push(b), 0).
*
*
Examples/Quizzes. For a string x and letter a let na(x) be the number of a’s in x.
Let L = {x {a, b}* | na(x) = nb(x)}. A grammar for L with start symbol E can be written as:
E aEbE | bEaE | L.
Use this information to find grammars for the following languages.
- {x {a, b}* | na(x) = 1 + nb(x)}.
Solution: S EaE.
2. {x {a, b}* | na(x) = 2 + nb(x)}.
Solution: S EaEaE.
3. {x {a, b}* | na(x) > nb(x)}.
Solution: S EaET
T aET | L.
4. {x {a, b}* | na(x) < nb(x)}.
Solution: S EbET
T bET | L.
5. {x {a, b}* | na(x) ≠ nb(x)}.
Solution: This language is the union of the languages in (3) and (4). Rename the
nonterminals in the grammars for (3) and (4) as follows:
(3) A EaET
T aET | L.
(4) B EbEU
U bEU | L.
Then S A | B is the desired grammar.
*
*
Section 11.7 Context-Free Language Topics
We know (via a theorem) that the context-free languages are exactly those languages that
are accepted by PDAs.
When a context-free language can be recognized by a deterministic final-state PDA, it is
called a deterministic context-free language.
An LL(k) grammar has the property that a parser can be constructed to scan an input string
from left to right and build a leftmost derivation by examining next k input symbols to
determine the unique production for each derivation step.
If a language has an LL(k) grammar, it is called an LL(k) language.
LL(k) languages are deterministic context-free languages, but there are deterministic
context-free languages that are not LL(k). (See text for an example on page 789.)
Example. Consider the language {anb | n N}.
- It has the LL(1) grammar S aS | b.
A parser can examine one input letter to decide whether to use S aS or S b for
the next derivation step.
(2) It has the LL(2) grammar S aaS | ab | b.
A parser can examine two input letters to determine whether to use S aaS or S ab
for the next derivation step. Notice that the grammar is not LL(1).
(3) Quiz. Find an LL(3) grammar that is not LL(2).
Solution. S aaaS | aab | ab | b.
*
*
Answer: Any derivation starts with S AB. The next derivation step uses one of the
productions A aAb or A L, depending on whether the scanned input letter is a or not.
The argument is the same for B-productions.
It is not LL(1): Let the input be ab. The first letter a tells us to start with S aSb.
The letter a in aSb matches the first input letter, so we lookahead to the second input letter b.
This tells us we must use S T to get S aSb aTb.
The lookahead remains at b, so we can’t determine whether it is the last b of the input string.
So we don’t know whether to choose T bT or T L for the next step. Thus the grammar is
not LL(1).
It is not LL(2): Let the input be aabb. The first two input letters aa, tell us S aSb.
The letter a in aSb matches the first input, so we lookahead to the next two-letter substring ab.
We must use S aSb to get S aSb aaSbb.
Now the lookahead becomes bb, so we must use S T to get S aSb aaSbb aaTbb.
The lookahead remains at bb, so we can’t determine whether these are the last two b’s of the
input string. So we don’t know whether to choose T bT or T L for the next step. Thus
the grammar is not LL(2).
Quiz (on your time): Find a general argument to show the grammar is not LL(k).
Example/Quiz. Why is the following grammar
for {anbn + k | n, k N} an-LL(1) grammar?
S AB
A aAb | L
B bB | L.
Example. The following grammar for {anbn + k | n, k N}
is not LL(k) for every k.
S aSb | T
T bT | L.
*
*
Example. The language {an + kbn | k, n N} is not deterministic context-free. So it has no
LL(k) grammar for every k.
Proof: Any PDA for the language must keep a count of the a’s with the stack so that when the
b’s come along the stack can be popped with each b. But there might still be a’s on the stack
(e.g., when k > 0), so there must be a nondeterministic state transition to a final state from the
popping state. i.e., We need two instructions like, (i, b, a, pop, i) and (i, L, a, pop, final).
Note: The previous two examples show that {ambn | m ≤ n} is LL(1) but {ambn | m ≥ n} is not
LL(k) for any k.
Grammar Transformations
Left factoring: Sometimes we can “left-factor” an LL(k) grammar to obtain an equivalent
LL(n) grammar where n < k.
Example. The grammar S aaS | ab | b is LL(2) but not LL(1). But we can factor out the
common prefix a from productions S aaS | ab to obtain
S aT
T aS | b.
This gives the new grammar: S aT | b
T aS | b.
Quiz: Find an LL(k) grammar where k is as
small as possible that is equivalent to the
following grammar.
S abS | abcT | ab
T cT | c.
Solution: The given grammar is LL(3)
and it can be left-factored to become
an LL(1) grammar as follows:
S abR
R S | cT | L
T cU
U T | L.
*
*
Removing left recursion: A left-recursive grammar is one which has a derivation of the form
A + Ax
for some nonterminal A and sentential form x.
- Left-recursive grammars are not LL(k) for any k.
Example. The language {ban | n N} has a grammar S Sa | b, which is left-recursive. It is
not LL(k) for any k. Consider the following cases:
LL(1) case: If the input string is ba, the lookahead is b. So we don’t know whether there are
any a’s to the right of b. Therefore we don’t know which production to start the derivation.
LL(2) case: If the input string is baa, the lookahead is ba. So the derivation starts with S
Sa. But the a in Sa denotes the a at the right end of the derived string. So the input string
could be ba or baa. Therefore we don’t know which production to choose next.
LL(k) case: If the input string is ba…a with k a’s, the lookahead is ba…a with k – 1 a’s. So
the derivation after k –1 steps is S Sa Saa … Sa…a. Now, the input string could
be ba…a (length k) or ba…a (length k + 1). Therefore we don’t know which production to
choose next.
- Sometimes we can obtain an LL(k) grammar by removing left recursion.
- Algorithm Idea for direct left recursion:
Transform: A Aw | Au | Av | a | b.
To: A aB | bB
B wB | uB | vB | L.
Example/Quiz. Remove left recursion.
S Sa | b.
Solution. S bT
T aT | L.
It is LL(1).
*
*
Example/Quiz. Remove left recursion.
S Saa | aab | aac.
Solution: S aabT | aacT
T aaT | L.
It is LL(3).
Example (Removing indirect left recursion). Consider the following grammar.
S Ab | a
A Sa | b.
The grammar is left-recursive because of the indirect left recursion S Ab Sab.
To remove the indirect left recursion, replace A in S Ab by the right side of A Sa | b to
obtain S Sab | bb. The grammar becomes S Sab | bb | a. Remove the left recursion:
S bbT | aT
T abT | L.
Quiz: Remove left recursion from the grammar: S Ab | a
A SAa | b.
Solution: Replace A in S Ab | a by the right side of A SAa | b to obtain
S SAab | bb | a
A SAa | b.
Now remove the direct left recursion: S bbT | aT
T AabT | L
A SAa | b.
Quiz: Rewrite the previous solution as LL(1).
Solution: S aaU
U bT | cT
T aaT | L.
*
*
Remove L-productions from grammars for langauges without L.
- Find nonterminals that derive L.
- For each production A w construct all productions A w’ where w’ is obtained from
w by removing one or more occurrences of the nonterminals from Step 1.
- Combine the original productions with those of step 2 and eliminate any L-productions.
Example. Remove L-productions from the grammar
S ABc
A aA | L
B bB | L.
Solution. Step 1: The nonterminals A and B derive L.
Step 2: From the production S ABc we construct S Bc | Ac | c.
From the production A aA we construct A a.
From the production B bB we construct B b.
Step 3: S ABc | Bc | Ac | c
A aA | a
B bB | b.
Quiz. Remove L -productions from
S ABc | Ab | c
A ABa | L
B Bbc | L.
Solution.
S ABc | Ab | c| Bc | Ac | b
A ABa | Ba | Aa | a
B Bbc | bc.
*
*
Chomsky Normal Form. Productions have one of the following forms
A a (a a terminal)
A BC
S L (if L is in the language).
Advantages: Parse trees are binary, which are easy to represent. Any string of length n > 0
can be derived in 2n – 1 steps.
Algorithm. Transform to Chomsky normal form (with the additional property that no start symbol occurs on the right side of a production)
1. If the start symbol S occurs on some right side, create a new start symbol S´ and a
new production S´ S.
2. Remove A L (if A ≠ S) by previous algorithm. (If S L is removed, add it back.)
3. Remove unit productions (i.e., A B): If A B or A + B, then construct productions
A w where B w is not a unit production. Now remove all unit productions.
4. For each production whose right side has two or more symbols, replace all occurrrences
of each terminal a with a new nonterminal A and also add the new production A a.
5. Replace each production B C1…Cn with n > 2 with B C1D where D C2 …Cn.
Repeat this step until all right sides have length two.
*
*
Example. Construct a Chomsky normal form for the grammar
S aSb | D
D Dc | L.
Solution.
Step 1: Add the production S´ S.
Step 2: S´ S | L
S aSb | ab | D
D Dc | c.
Step 3: S´ aSb | ab | Dc | c | L
S aSb | ab | Dc | c
D Dc | c.
Step 4: S´ ASB | AB | DC | c | L
S ASB | AB | DC | c
D DC | c
A a
B b
C c.
Step 5: Replace S´ ASB and S ASB by
S´ AE, S´ AE, and E SB.
*
*
Quiz. Construct a Chomsky normal form for the grammar
S aTbb | U | L
T cT | c
U Ud | d.
Solution. Step 1: No change. Step 2: No change.
Step 3: Remove unit production S U to obtain
S aTbb | Ud | d | L
T cT | c
U Ud | d.
Step 4: Transform right sides of length at least two into strings of nonterminals.
S ATBB | UD | d | L
T CT | c
U UD | d.
A a
B b
C c
D d.
Step 5: Replace S ATBB with the productions
S AE, E TF, F BB.
*
*
Greibach Normal Form. Productions have one of the following forms
A b (b a terminal)
A bD1…Dk
S L (if L is in the language).
Advantage: Any string of length n > 0 can be derived in n steps.
Algorithm (idea). Transform context-free grammar to Greibach normal form.
- Perform steps 1, 2, and 3 of the Chomsky algorithm.
- Remove all left-recursion, including indirect, without adding L
3. Make substitutions to transform the grammar into the proper form.
Example. Put the following grammar into Greibach normal form.
S AB | Ac | d
A aA | a
B Ab | c.
Solution: Steps 1 (Chomsky steps 1, 2, and 3) and 2 are non needed.
Step 3: Replace A in S AB | Ac | d with aA | a to obtain
S aAB | aB | aAc | ac | d.
Replace A in B Ab | c with theright side of A aA | a to obtain B aAb | ab | c.
Now add the new productions C c and D b to obtain the proper form:
S aAB | aB | aAC | aC | d
A aA | a
B aAD | aD | c
C c
D b.
*
*
Example. Put the following grammar into Greibach normal form.
S AB | Ac | d
A aA | a
B Ab | c.
Solution: Steps 1 (Chomsky steps 1, 2, and 3) and 2 are not needed.
Step 3: Replace A in S AB | Ac | d with aA | a to obtain
S aAB | aB | aAc | ac | d.
Replace A in B Ab | c with theright side of A aA | a to obtain
B aAb | ab | c.
Now add the new productions C c and D b and make appropriate replacements
to obtain the proper form:
S aAB | aB | aAC | aC | d
A aA | a
B aAD | aD | c
C c
D b.
*
Properties of Context-Free Languages
When we know some properties of context-free languages they can help us argue, BWOC,
that certain languages are not context-free.
The Pumping Lemma
If L is an infinite context-free language, then any grammar for L must be recursive, so there
must be derivations of the the following form where u, v, w, x, and y are terminal strings.
S + uNy
N + vNx (where v and x are not both L)
N + w.
These derivations lead to derivations like
S + uNy + uvNxy + uv2Nx2y + uvkNxky + uvkwxky L for all k N.
This is the basis for the Pumping Lemma:
There is an integer m > 0 such that if z L and | z | ≥ m, then z has the form z = uvwxy where
1 ≤ | vx | ≤ | vwx | ≤ m and uvkwxky L for all k N.
Note: The number m depends on the grammar as we’ll see in the following example.
*
*
For this grammar m = 4 can be used in the pumping lemma because any derivation of a
string z with | z | ≥ 4 must use the nonterminal N. For example, if | z | = 8 and z = abcccccd,
then the pumping lemma factors z = abcccccd = uvwxy where 1 ≤ | vx | ≤ | vwx | ≤ 4 and
uvkwxky L for all k N. In this case let u = a, v = L, w = b, x = c, and y = ccccd.
Example. The language L = {anbncn+k | k, n N} is not context-free.
Proof: Assume, BWOC, that L is context-free. L is infinite, so pumping lemma applies.
Choose z = ambmcm where m is the positive integer from the lemma.
Then z = ambmcm = uvwxy where 1 ≤ | vx | ≤ | vwx | ≤ m and uvkwxky L for all k N.
Observe neither v nor x can contain distinct letters. For example, if v = …a…b…, then
v2 = …a…b……a…b…, which can’t appear as a substring of any string in L.
So v and x must be strings of repeated occurrences of a single letter.
Now since | vwx | ≤ m, there are two possible places in ambmcm where v and x must occur:
(1) v and x occur in ambm.
(2) v and x occur in bmcm.
But we obtain the following contradictions because v and x are not both L.
(1) Let k = 2 to obtain uv2wx2y = am+ibm+jcm, where i > 0 or j > 0. So uv2wx2y L
(2) Let k = 0 to obtain uwy = ambm-icm–j, where i > 0 or j > 0. So we have uwy L.
These contradictions imply that L is not context-free. QED.
Example. Suppose we have the following
grammar for {L, bbc} {abcnd | n N}.
S aNd | bbc | L
N Nc | b.
Here are a few derivations:
S aNd abd
S aNd aNcd abcd
S aNd aNcd aNccd abccd
S + abckd for any k in N.
*
*
Example/Quiz. Prove that the language L = {ss | s {a, b}*} is not context-free.
Proof: Assume, BWOC, that L is context-free. L is infinite, so pumping lemma applies.
Choose z = ambmambm where m is the positive integer from the lemma.
Then z = ambmambm = uvwxy where 1 ≤ | vx | ≤ | vwx | ≤ m and uvkwxky L for all k N.
Now since | vwx | ≤ m, there are three possible places in ambmambm where v and x must occur:
(1) v and x occur in ambm (on the left of z).
(2) v and x occur in bmam (in the center of z).
(3) v and x occur in ambm (on the right of z).
Notice that v and x can consist only of repetitions a single letter. For example, in case (1)
suppose v = aibj for some i > 0 and j > 0 and x = bn for some n ≥ 0. Then, letting k = 0, we
would obtain uwy = am–ibm–j–nambm, which cannot be in L. The argument is similar for the
other cases. So v and x must consist only of repetitions of a single letter.
We need to find a contradiction in each of the three cases. We’ll do it by using k = 0.
This tells us that uwy L. But we obtain the following contradictions because v and x are not
both L.
(1) uwy = am–ibm–jambm where either i > 0 or j > 0 So uwy L,
(2) uwy = ambm–iam–jbm where either i > 0 or j > 0. So uwy L.
(3) uwy = ambmam–ibm–j where either i > 0 or j > 0. So uwy L.
Therefore L is not context-free. QED.
Remark: Be careful that the choice of z is not in a context-free sublanguage of L. For example, if we chose z = (ab)m(ab)m in the preceding example, we would not get any contradictions.
*