Mathematical Research Paper
IEOR 4701: Professor Whitt
Lecture Notes, Monday, July 16, 2007
Introduction to Markov Chains
1. Markov Mouse: The Closed Maze
We start by considering how to model a mouse moving around in a maze. The maze is a closed space containing nine rooms. The space is arranged in a three-by-three array of rooms, with doorways connecting the rooms, as shown in the figure below
The Maze
1 2 3
4 5 6
7 8 9
There are doors leading to adjacent rooms, vertically and horizontally. In particular, there are doors
from 1 to 2, 4 from 2 to 1, 3, 5 from 3 to 2, 6 from 4 to 1, 5, 7 from 5 to 2, 4, 6, 8 from 6 to 3, 5, 9 from 7 to 4, 8 from 8 to 5, 7, 9
from 9 to 6, 8
We assume that the mouse is a Markov mouse; i.e., the mouse moves randomly from room to room, with the probability distribution of the next room depending only on the current room, not on the history of how it got to the current room. (This is the Markov property.) Moreover, we assume that the mouse is equally likely to choose each of the available doors in the room it occupies. (That is a special property beyond the Markov property.)
We now model the movement of the mouse as a Markov chain. The state of the Markov chain is the room occupied by the mouse. We let the time index n refer to the nth room visited by the mouse. So we make a discrete-time Markov chain. Specifically, we let Xn be the state (room) occupied by the mouse on step (or time or transition) n. The initial room is X0. The room after the first transition is X1, and so forth. Then {Xn : n ≥ 0} is the discrete-time Markov chain (DTMC); it is a discrete-time discrete-state stochastic process. The mouse is in room Xn ( a random variable) after making n moves, after having started in room X0, which could also be a random variable.
We specify the evolution of the Markov chain by specifying the one-step transition prob- abilities. We specify these transition probabilities by specifying the transition matrix. For our example, making a discrete-time Markov chain model means that we define a 9×9 Markov transition matrix consistent with the specification above. For the most part, specifying the transition matrix is specifying the model. (We also must say how we start. The starting point could be random, in which case the initial position would be specified by a probability vector.)
Notation. It is common to denote the transition matrix by P and its elements by Pi,j ; i.e., Pi,j denotes the probability of going to state j next when currently in state i. (When the state space has m states (is finite), P is a square m × m matrix.)
For example, for our Markov mouse model, we have P1,2 = 1/2 and P1,4 = 1/2, with P1,j = 0 for all other j, 1 ≤ j ≤ 9. And we have P2,1 = P2,3 = P2,5 = 1/3 with P2,j = 0 for all other j. And so forth.
Here is the total transition matrix:
P =
1 2 3 4 5 6 7 8 9
0 1/2 0 1/2 0 0 0 0 0 1/3 0 1/3 0 1/3 0 0 0 0 0 1/2 0 0 0 1/2 0 0 0
1/3 0 0 0 1/3 0 1/3 0 0 0 1/4 0 1/4 0 1/4 0 1/4 0 0 0 1/3 1/3 0 0 0 1/3 0 0 0 1/2 0 0 0 1/2 0 0 0 0 0 1/3 0 1/3 0 1/3 0 0 0 0 0 1/2 0 1/2 0
.
It is common to label the columns, but we did not above. They are numbered in the same way as the rows. Here the columns are numbered from 1 to 9 starting at the left. Hence, the matrix element in the upper left corder is P1,1, while the matrix element in the lower right corner is P9,9. The matrix element P2,3 appears in the second row and third column. The rows represent the starting state, while the column represents the next step; i.e., P2,3 is the probability of going next to state 3 given that you are starting in state 2. The Markov property implies that the probability does not depend on the earlier history. Each time the Markov chain is in state 2, its transition probabilities are the same, independent of how it happened to get there. (We are assuming that the transition probabilities do not depend on either time or the previous states visited.)
2
We can analyze the evolution of the Markov chain over time (successive transitions) by looking at powers of the matrix P . In particular, the k-step transition matrix is just the kth power of the matrix P . Thus, P (k) denotes the k-step transition matrix, while P k
denotes the kth power of the transition matrix P . A main theorem is that P (k) = P k. You should look at the powers of the matrix P in MATLAB (available, for example, in
the IEOR Computer Lab in 301 Mudd) or some other suitable programming language. (Even the T I89 calculator can do matrix multiplication.) You can also access MATLAB from Butler Library and the Mathematics Building. Sample MATLAB code is given in my web-page exam- ples; see the Computational Tools link. Use the program stat.m or the program stationary.m to analyze this example.
It is important here to know how to do matrix multiplication. Suppose that A ≡ (Ai,j ) is a k × m matrix, while B ≡ (Bi,j ) is an m × n matrix. Then the matrix product AB is the k × n matrix with matrix elements
(AB)i,j ≡ m∑
p=1
Ai,pBp,j
Hence, if P is the transition matrix of an m-state DTMC, then P and P k are m × m matrices, with
P (2) i,j = P
2 i,j ≡
m∑
p=1
Pi,pPp,j
and
P (k) i,j = P
k i,j ≡
m∑
p=1
P k−1i,p Pp,j for k ≥ 2 .
Much can be done by thinking. Here are some probabilities to estimate:
P 31,1 = ?
P 71,1 = ?
P 71,9 = ?
P 72,4 = ?
P 172,8 = ?
P 171,1 = ?
P 182,2 = ?
P 171,2 = ?
P 191,2 = ?
P 184,2 = ?
P 182,2 = ?
P 181,1 = ?
P 181,7 = ?
P 183,9 = ?
P 181,5 = ?
P 185,5 = ?
3
The simple formulas that result can be seen as a consequence of periodicity and symmetry. Note that the state necessarily alternates between even and odd: You go from even to odd to even to odd, and so forth. Hence, you can go from an odd-numbered state to an odd-numbered state only in an even number of steps. Similarly, you can go from an even-numbered state to an even-numbered state in an even number of steps. This is a periodic DTMC with period 2.
One of the key ideas is that there is statistical regularity in the long run. This regularity can be seen by calculating high powers of the matrix P . Because of the periodicity, in this example we do not have P n converge as n → ∞. We would have that without the periodicity, but we have periodicity here. The matrix powers converge if we look at the powers P 2k or P 2k+1. We also obtain convergence if we look at (P 2k + P 2k+1)/2 (average of the entries). This convergence is remarkably fast. You should convince yourself by performing calculations.
In order to determine what the long-run probabilities are for this example, you can exploit symmetry. By symmetry, the long-run probabilities of being in the states 2, 4, 6 and 8 should be identical. Hence, for large k, we should have
P 2k2,2 ≈ P 2k2,8 ≈ P 2k4,2 ≈ P 2k8,8 ≈ 1 4
In fact, it may be somewhat surprising, but k need not actually be too large. You should verify this with Matlab.
Similarly, the long-run probability of being in the odd states except state 5 should be identical. The odd-numbered states are a bit tricky, but we can work from the even-numbered states:
P 2k1,1 = P 2k−1 1,2 P2,1 + P
2k−1 1,4 P4,1 ≈
( 1 4
) ( 1 3
) +
( 1 4
) ( 1 3
) =
1 6
With Markov chains, we are often interested in the long-run transition probabilities. These transition probabilities often converge as the number of transitions increases, but in this ex- ample they do not, because of the periodicity. Because of the periodicity, the probabilities do not simply converge. They alternate between positive values (which converge) and 0.
These properties that can be deduced from detailed analysis can be confirmed by looking at powers of the transition matrix P . So the probabilities above are:
P 31,1 = 0
P 71,1 = 0
P 71,9 = 0
P 72,4 = 0
P 172,8 = 0
P 171,1 = 0
P 182,2 ≈ 1/4 P 171,2 ≈ 1/4 P 191,2 ≈ 1/4 P 184,2 ≈ 1/4 P 182,2 ≈ 1/4 P 181,1 ≈ 1/6
4
P 181,7 ≈ 1/6 P 183,9 ≈ 1/6 P 181,5 ≈ 1/3 P 185,5 ≈ 1/3
For any initial probability vector α ≡ (α1, · · · , αm), αP gives the new probability vector after one transition. An important concept is the notion of a stationary probability vector. An important theorem about irreducible finite-state DTMC’s states that there exists a unique probability vector π ≡ (π1, · · · , πm) such that
π = πP .
That is a matrix equation. For the corresponding matrix elements, we have
πj = m∑
k=1
πiPi,j for all j, 1 ≤ j ≤ m .
(A DTMC is irreducible if it is possible to go from any state to any other state in some number of transitions.)
We can relate the positive limiting probability vectors to the stationary probabilities πj . These positive approximate limiting probabilities are exactly twice the elements of the station- ary probability vector π, because of the periodicity. For example, π2 = 1/8, while P 182,2 ≈ 1/4.
Looking ahead to future classes: As we will discuss later, the Markov maze here, as well as much larger mazes, can be analyzed very quickly by exploiting the structure of a random walk on a graph. The idea is to let the rooms by nodes in the graph. The doors from room to room then become arcs in the graph. There is an arc in the graph between node i and node j whenever there is a positive probability of going from room i to room j in the maze (in one step). Because of the structure of a random walk on a graph, the stationary probability of being in room i turns out to be the number of doors out of room i divided by the sum over the rooms of the number of doors out of each room. That simple formula is a consequence of the detailed balance that holds for this example. Detailed balance is equivalent to the Markov chain being time reversible. See Section 4.8 in the textbook. We will get there in due course.
Summary So Far
1. For a finite-state DTMC the model is mostly just a matrix P .
(There is also the initial distribution, a probability vector.)
2. Multi-step transition probabilities are given by powers of the transition matrix P .
3. The long-run behavior can be seen from looking at high powers of the matrix P .
4. We let Xn be the state of the Markov chain after the nth transtion; then {Xn : n ≥ 0} is the stochastic process (Markov chain). The transtion probability is
Pi,j ≡ P (Xn+1 = j|Xn = i)
5. The Markov property asserts that the conditional probability of future states, given the present states and information about past states, depends only on the present state. As a
5
formula, the Markov property states that
P (Xn+1 = j|Xn = i, X0 = i0, X1 = i1, . . . , Xn−1 = in−1) = P (Xn+1 = j|Xn = i)
for all possible past events {X0 = i0, X1 = i1, . . . , Xn−1 = in−1}.
Additional Points:
6. The long-run behavior can also be found from the stationary probabilities, assuming the Markov chain is irreducible, which it is. (We are also using the fact that the state space is finite. When the state space is infinite, we need to make extra assumptions.) The stationary probability can be found by solving π = πP . The stationary probability vector π is defined as being a probability vector such that, if we then transition according to P , the distribution after the transition is again π. That is, π is a stationary probability vector under the operator P . (In general, the matrix P is thought of as an operator because it maps probability vectors into probability vectors.
(Solving π = πP is just solving a system of linear equations. Subtle point: there is one redundant equation. The extra equation needed is obtained by the condition that the probabilities should add to 1.)
7. Often, but not always, the stationary probabilities (of an irreducible Markov chain) coincide with the long-run limiting probabilities. They do when the DTMC is aperiodic (period 1). They do not when the Markov chain is periodic. The closed maze is an example of a periodic chain (period 2).
8. If a chain is periodic with period d, then the stationary probabilities can be obtained approximately from the average of d successive high matrix powers; e.g., from (P n+1 + P n+2 + · · · + P n+d)/d for large n.
9. If a chain is periodic with period d, then the long-run limiting probabilities are obtained from the stationary probabilities, e.g., P ndi,i = dπi. The probabilities P
n i,j assume the values dπj
once and 0 d − 1 times as n increases. 10. Sample MATLAB output and programs are given on the computational tools web page
for this course.
2. Liberating Markov Mouse: The Open Maze
2.1 The First Absorbing Chain
Suppose that we add a door from room 9 to the outside, and that we allow the mouse to leave room 9 from any of the doors available to it. Now there are three doors out of room 9, one of which takes it outside. Suppose also that the mouse does not reenter the maze after it has left.
This modification significantly changes the problem. Now the mouse will eventually leave the maze with probability one. There is no interesting long-run distribution of being in the different rooms. Eventually the mouse will leave the maze.
However, we can still model the process as a Markov chain. We can add an extra state for the “outside.” We can call the outside state 10. We thus have a 10 × 10 Markov transition matrix. Now the probabilities of row 9 change: Now we have P9,6 = P9,8 = P9,10 = 1/3. And there is a 10th row with entries P10,10 = 1.0 and P10,j = 0 for all other j. We also have Pi,10 = 0 for all i < 9.
6
New questions become interesting for this absorbing Markov chain. Now we want to know about the time it takes until the mouse leaves the maze. Now we want to know about the expected number of visits to the various rooms before the mouse leaves the maze.
We now show how to compute the expected number of steps from each starting state until the mouse leaves the maze. We obtain a system of 9 equations in 9 unknowns by conditioning on what happens in the first step.
Let Ti be the time (number of steps) until the mouse leaves the maze (first enters state 10) starting in state i. We cannot find E[Ti] for one i directly, but we can solve for all of the E[Ti] in a system of equations. For example, the first equation is
E[T1] = 1 + (1/2)E[T2] + (1/2)E[T4] ,
while the second equation is
E[T2] = 1 + (1/3)E[T1] + (1/3)E[T3] + (1/3)E[T5] .
We can use MATLAB again to solve this system of equations.
2.2 The Second Absorbing Markov Chain
We now consider a larger example with more possible absorbing states. Now we put a door leading out of the maze from rooms 3, 7 and 9. The probabilities of leaving through each of these doors is 1/3. As before, we let the mouse choose each of the available doors with equal probability. Wherever the mouse leaves the original 9 rooms, we assume that the mouse cannot return. He stays outside. Here is the new picture:
The Maze with Outside Rooms
1 2 3
4
7 98
5 6
10
12
11
We now pay attention to the door from which the mouse leaves the maze. We thus obtain the 12-state absorbing Markov chain with 3 absorbing states. We say that the Markov chain
7
enters state 10 if the mouse leaves through the door out of room 3; we say that the Markov chain enters state 11 if the mouse leaves through the door out of room 7; we say that the Markov chain enters state 12 if the mouse leaves through the door out of room 9. We let the mouse stay outside the maze when it leaves, so the Markov chain is absorbing: we have P10,10 = P11,11 = P12,12 = 1. Here is the new transition matrix:
P =
1 2 3 4 5 6 7 8 9 10 11 12
0 1/2 0 1/2 0 0 0 0 0 0 0 0 1/3 0 1/3 0 1/3 0 0 0 0 0 0 0 0 1/3 0 0 0 1/3 0 0 0 1/3 0 0
1/3 0 0 0 1/3 0 1/3 0 0 0 0 0 0 1/4 0 1/4 0 1/4 0 1/4 0 0 0 0 0 0 1/3 1/3 0 0 0 1/3 0 0 0 0 0 0 1/3 0 0 0 1/3 0 0 1/3 0 0 0 0 0 1/3 0 1/3 0 1/3 0 0 0 0 0 0 0 0 1/3 0 1/3 0 0 0 1/3 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
.
This is an absorbing Markov chain, which is reducible. We analyze it in a different way than we analyze an irreducible Markov chain. Use the program absorbing.m on this example.
2.3. Analyzing an Absorbing Chain
We now indicate how to analyze an absorbing Markov chain. This analysis applies to the absorbing Markov chain we have just defined, but also to other absorbing Markov chains. We first label the states so that all the absorbing states appear first, and then afterwards we put the transient states (the states that we will eventually leave, never to return. The transition matrix then has the block matrix form
P = (
I 0 R Q
) ,
where I is an identity matrix (1’s on the diagonal and 0’s elsewhere) and 0 (zero) is a matrix of zeros. In this case, I would be 3 × 3, R is 9 × 3 and Q is 9 × 9). The matrix Q describes the probabilities of motion among the transient states, while the matrix R gives the probabilities of absorption in one step (going from one of the transient states to one of the absorbing states in a single step). In general Q would be square, say m by m, while R would be m by k, and I would be k by k.
2.3.1 The Fundamental Matrix N
First suppose that we want to calculate the expected number of times the chain spends in transient state j starting in transient state i. Let Ti,j be the total number times and let Ni,j ≡ E[Ti,j ] be the expected number of times. It is convenient to write
Ti,j = T (0) i,j + T
(1) i,j + T
(2) i,j + T
(3) i,j + T
(4) i,j + T
(5) i,j + · · ·
where T (k)i,j is the number of times at the k th transition. Clearly, T (k)i,j is a random variable
that is either 1 (if the chain is in transient state j on the kth transition) or 0 (otherwise). By
8
definition, we say that T (0)i,j = 1 if i = j, but = 0 otherwise. Since these random variables assume only the values 0 and 1, we have
Ni,j ≡ E[Ti,j ] = E[T (0)i,j + T (1) i,j + T
(2) i,j + T
(3) i,j + T
(4) i,j + T
(5) i,j + · · ·]
= E[T (0)i,j ] + E[T (1) i,j ] + E[T
(2) i,j ] + E[T
(3) i,j ] + E[T
(4) i,j ] + E[T
(5) i,j ] + · · ·
= P (T (0)i,j = 1) + P (T (1) i,j = 1) + P (T
(2) i,j = 1) + P (T
(3) i,j = 1) + P (T
(4) i,j = 1) + · · ·
= Q0i,j + Q 1 i,j + Q
2 i,j + Q
3 i,j + Q
4 i,j + Q
5 i,j + · · ·
To summarize, Ni,j ≡ Q(0)i,j + Q
(1) i,j + Q
(2) i,j + Q
(3) i,j + · · · .
In matrix form, we have
N = Q(0) + Q(1) + Q(2) + Q(3) + · · · = I + Q + Q2 + Q3 + · · ·
where the identity matrix I here has the same dimension m as Q. (Since Q is the submatrix corresponding to the transient states, Qn → 0 as n → ∞, where here 0 is understood to be a matrix of zeros.)
Multiplying by (I −Q) on both sides, we get a simple formula, because there is cancellation on the righthand side. In particular, we get
(I − Q) ∗ N = I ,
so that, multiplying on the left by the inverse (I − Q)−1, which can be shown to exist, yields
N = (I − Q)−1
In MATLAB, we would write N = inv(I − Q). The matrix N = (I − Q)−1 is called the fundamental matrix of an absorbing Markov chain.
We can be a little more careful and write
Nn ≡ I + Q + Q2 + Q3 + · · · + Qn .
Then the cancellation yields (I − Q)Nn = I − Qn+1 .
We use the fact that Q is the part of P corresponding to the transient states, so that Qn
converges to a matrix of zeros as n → ∞. Hence, (I − Q)Nn → I as n → ∞. Writing
(I − Q)N = I ,
we see that both N and I − Q are nonsingular, and thus invertible, yielding N = (I − Q)−1, as stated above.
2.3.2 Other Quantities of Interest: M and B
Other quantities of interest can be computed using the fundamental matrix N , as we now show.
Let Mi be the expected number of steps until absorption starting in transient state i and let M be the m×1 column vector with elements Mi. The total number of steps until absorption is
9
the sum of the numbers of steps spent in each of the transient states before absorption (always starting in transient state i). Hence,
Mi = Ni,1 + Ni,2 + · · · + Ni,m , assuming, as before, that there are m transient states. In matrix form,
M = N ∗ w , where w is a m × 1 column vector of ones.
Let Bi,l be the probability of being absorbed in absorbing state l starting in transient state i. Breaking up the overall probability into the sum of the probabilities of being absorbed in state l in each of the possible steps, we get
Bi,l = Ri,l + (Q ∗ R)i,l + (Q2 ∗ R)i,l + · · · so that
B = R + Q ∗ R + Q2 ∗ R + Q3 ∗ R + · · · = (I + Q + Q2 + · · ·) ∗ R = N ∗ R .
Hence, B = N R, where N is the fundamental matrix above. In summary, it is easy to compute the matrices N , M and B describing the evolution of
the absorbing Markov chain, given the key model elements - the matrices Q and R. You should do the escaping Markov mouse example using MATLAB. The MATLAB pro-
gram absorbing.m does that for you. The data and the program are on my web page on the computational-tools page.
For some related material, see Sections 4.5.1 and 4.6 of Ross.
2.3.3. Summary
1. There are two kinds of basic Markov chains, those considered in so far in these notes.
2. The first kind, illustrated by the closed maze, is an irreducible Markov chain. It has a unique stationary distribution, obtained by solving π = πP . We have considered a DTMC with a finite state space. There are issues about what happens with an infinite state space. With an infinite state space, such as the nonnegative integers, it is possible that the DTMC is transient or null-recurrent instead of positive recurrent. With a finite state space, the irreducible DTMC is always positive recurrent, in which case the stationary probability vector π is proper (sums to 1). In the other two cases, it is the 0 vector.
3. The second kind, illustrated by the open maze (the escaping mouse), is an absorbing Markov chain.
4. You ask different questions for the two kinds of chains. (And so, there are different tools and different answers.)
5. There are even more complicated Markov chains, but we usually decompose them and analyze component chains of the two types above.
6. Finite matrices can be applied only when the state space is finite. However, the main story (and results) go over to infinite-state Markov chains. Indeed, Ross discusses the more
10
general case from the beginning. A simple example is a simple random walk, in which you go to the right or left at each step, each with probability 1/2.
3. Classification of States
See Section 4.3 of the textbook.
3.1 Concepts:
1. State j is accessible from state i if it is possible to get to j from i in some finite number of steps. (notation: i à j)
2. States i and j communicate if both j is accessible from i and i is accessible from j. (notation: i ! j)
3. A subset A of states in the Markov chain is a communication class if every pair of states in the subset communicate.
4. A communication class A of states in the Markov chain is closed if no state outside the class is accessible from a state in the class.
5. A communication class A of states in the Markov chain is open if it is not closed; i.e., if it is possible for the Markov chain to leave that communicating class.
6. A Markov chain is irreducible if the entire chain is a single communicating class.
7. A Markov chain is reducible if there are two or more communication classes in the chain; i.e., if it is not irreducible.
8. A Markov chain transition matrix P is in canonical form if the states are re-labelled (re-ordered) so that the states within closed communication classes appear together first, and then afterwards the states in open communicating classes appear together. The recurrent states appear at the top; the transient states appear below. The states within a communication class appear next to each other.
9. State j is a recurrent state if, starting in state j, the Markov chain returns to state j with probability 1
10. State j is a transient state if, starting in state j, the Markov chain returns to state j with probability < 1; i.e., if the state is not recurrent.
11. State j is a positive-recurrent state if the state is recurrent and if, starting in state j, the expected time to return to that state is finite.
12. State j is a null-recurrent state if the state is recurrent but, , starting in state j, the expected time to return to state j is infinite.
3.2. Canonical Form for a Probability Transition Matrix
Find the canonical form of the following Markov chain transition matrix:
(a)
P =
0.1 0.0 0.0 0.9 0.0 0.0 0.4 0.0 0.0 0.6 0.3 0.3 0.0 0.4 0.0 0.3 0.0 0.0 0.7 0.0 0.0 0.7 0.0 0.0 0.3
11
————————————————-
Notice that the sets {1, 4} and {2, 5} are closed communicating classes containing recurrent states, while {3} is an open communicating class containing a transient state.
So you should reorder the states according to the order: 1, 4, 2, 5, 3. The order 2, 5, 1, 4, 3 would be OK too, as would 5, 2, 4, 1, 3. We put the recurrent states first and the transient states last. We group the recurrent states together according to their communicating class. Using the first order - 1, 4, 2, 5, 3 - you get
P =
0.1 0.9 0.0 0.0 0.0 0.3 0.7 0.0 0.0 0.0 0.0 0.0 0.4 0.6 0.0 0.0 0.0 0.7 0.3 0.0 0.3 0.4 0.3 0.0 0.0
—————————————————
Notice that the canonical form here has the structure:
P =
P1 0 0 0 P2 0
R1 R2 Q
,
where P1 and P2 are 2 × 2 Markov chain transition matrices in their own right, whereas Ri is the one-step transition probabilities from the single transient state to the ith closed set. In this case, Q ≡ (0) is the 1 × 1 sub-matrix representing the transition probabilities among the transient states. Here there is only a single transient state and the transition probability from that state to itself is 0. The chain leaves that transient state immediately, never to return..
4. Some Underlying Theory: Optional Extra Stuff For those who have a good mathematics background and might be looking for a bit more...
4.1 The limit for aperiodic irreducible finite-state DTMC’s
There is a nice simple limit for aperiodic irreducible finite-state Markov chains. For any initial probability vector u ≡ (u1, . . . , un), the probability vector at time n is
π (n) j ≡ P (Xn = j) = (uP n)j =
n∑
i=1
uiP n i,j ;
i.e., in matrix notation, π(n) = uP n .
The key limiting result is
Theorem 0.1 If P is the transition matrix of an aperiodic irreducible m-state Markov chain with transition matrix P , then, for any initial probability vector u,
π(n) = uP n → π as n → ∞ ,
12
where the limiting probability vector π is the unique stationary probability vector, i.e., the unique solution to the fixed-point equation
π = πP or πj = n∑
i=1
πiPi,j for all j
satisfying the condition m∑
i=1
πi = 1 .
4.2 The Contraction Proof
One way to prove the theorem is to consider the transition matrix P as an operator on the space of all probability vectors. An operator on a space maps the space into itself. If u is a probability vector, then P maps u into the probability vector uP , corresponding to the probability vector starting with u and then taking one step according to P , i.e.,
(uP )j = n∑
i=1
uiPi,j for all j .
We want the underlying space to be a complete metric space and the operator to be a contrac- tion map. Then we can apply the Banach fixed-point theorem, also called the Banach-Picard fixed-point theorem or the contraction fixed-point theorem; see the book Principles of Real Analysis by Walter Rudin for example. Those pages are posted online.
The proof can be done in two steps:
step 1. In the first step, you prove the some power of P has all positive entries (using the assumption
that P is an m×m transition matrix of an irreducible aperiodic Markov chain). It can be shown that the minimum such power for an m×m matrix is less than or equal to m2−2m+2 ≤ m2; see pages 58-59 of E. Seneta, Nonnegative Matrices and Markov Chains, second edition, Springer, New York, 1981. The minimum such power is called the index of primitivity.
The worst case occurs when the transition probabilities satisfy: Pi,i+1 = 1 for 1 ≤ i ≤ m−1, but then Pm,1 > 0 and Pm,2 > 0, but Pm,i = 0 for all i ≥ 3.
step 2. In the second step, you assume that one column of P has all positive elements. (By step
1, that will necessarily occur after at most m − 1 steps.) We then want to show that P , regarded as an operator, is a contraction map, assuming
that at least one column of P has all positive elements. That implies that there exists a unique fixed point and that there is convergence to that fixed point at a geometric rate: For any initial probability vector u ≡ (u1, . . . , un),
d(uP k, π) ≤ ckd(u, π) for all k ,
for the metric d. That yields a geometric rate of convergence. In Markov-chain theory we speak of geometric ergodicity.
Finding the appropriate norm on Rn.
13
The space we will be looking at is a subset of Rn, containing all probability measures and their differences. We will use a distance defined via a standard norm. That means that the distance is
d(x, y) ≡ ||x − y|| , where || · || is the norm. We will consider a standard norm on Rn; the question is which one. Some norms do not work, but one does.
Consider the transition matrix:
P =
0.95 0.01 0.01 0.01 0.01 0.01 0.95 0.01 0.01 0.01 0.01 0.01 0.95 0.01 0.01 0.01 0.01 0.01 0.01 0.95 0.01 0.01 0.01 0.01 0.01 0.95 0.01 0.01 0.01 0.01 0.01 0.95 0.01 0.01 0.01 0.01
and the two probability vectors:
u = (1/3, 1/3, 1/3, 0, 0, 0)
and v = (0, 0, 0, 1/3, 1/3, 1/3) .
Then consider the probability vectors uP and vP . We want to have
||uP − vP|| ≤ c||u − v|| , where 0 ≤ c < 1, for some norm || · ||; then the desired distance is d(u, v) ≡ ||u − v||.
Look at the two vectors we get when we apply P to u and v:
uP = (0.95, 0.01, 0.01, 0.01, 0.01, 0.01)
and vP = (0.01, 0.95, 0.01, 0.01, 0.01, 0.01) .
There are three natural norms to consider on Rn: the l∞, l2 and l1 norms:
||x||∞ ≡ max {|xi|} ,
||x||2 ≡ √√√√
( n∑
i=1
|xi|2 )
,
||x||1 ≡ n∑
i=1
|xi| . (1)
It is not difficult to see that
||u − v||∞ = 1/3 , ||u − v||2 =
√ 6/9 =
√ 2/3 < 1 ,
||u − v||1 = 2 (2) and it is not difficult to see that
||uP − vP ||∞ = 0.94 , ||uP − vP||2 =
√ 2 × (0.94)2 = 1.329 ,
||uP − vP||1 = 2 × 0.94 = 1.88 . (3)
14
By this example we prove that the norms ||x||∞ and ||x||2 do not work. However, it turns out that the norm ||x||1 does work.
Theorem 0.2 Let P be a n×n Markov-chain transition matrix associated with an irreducible Markov chain. Assume that Pi,1 ≥ ² > 0 for all i, 1 ≤ i ≤ n. Then
||uP − vP||1 ≤ (1 − ²)||u − v||1 .
Proof. Note that
||uP − vP||1 = n∑
j=1
| n∑
i=1
uiPi,j − n∑
i=1
viPi,j| .
Now write Pi,1 ≡ ² + Qi,1 and Pi,j ≡ Qi,j for j 6= i , for all i .
Then Q is a nonnegative n × n matrix with row sums 1 − ². Now observe that n∑
j=1
| n∑
i=1
uiPi,j − n∑
i=1
viPi,j| = | n∑
i=1
ui(² + Qi,1) − n∑
i=1
vi(² + Qi,1)| + n∑
j=2
| n∑
i=1
uiQi,j − n∑
i=1
viQi,j|
= | n∑
i=1
uiQi,1 − n∑
i=1
viQi,1| + n∑
j=2
| n∑
i=1
uiQi,j − n∑
i=1
viQi,j|
= n∑
j=1
| n∑
i=1
uiQi,j − n∑
i=1
viQi,j|
≤ n∑
j=1
n∑
i=1
|ui − vi|Qi,j = n∑
i=1
n∑
j=1
|ui − vi|Qi,j = (1 − ²)||u − v||1 .(4)
with the equality in the second line holding because the epsilon terms can be dropped out, using
n∑
i=1
(ui − vi) = 0 ,
because u and v are probability vectors, summing to 1.
15