STOCHASTIC PROCESSES

KevinLiu

EC505notes_F19.pdf

Home >Mathematics homework help >Probability homework help >STOCHASTIC PROCESSES

EC505

STOCHASTIC PROCESSES

Class Notes

Boston University College of Engineering 8 St. Mary’s Street

Boston, MA 02215

Fall 2019

Contents

1 Introduction to Probability 7 1.1 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Conditional Probability and Independence of Events . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Characterization of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Important Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.5.1 Discrete-valued random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.5.2 Continuous-valued random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.6 Transformations of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.6.1 Method of equivalent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.6.2 Jacobian method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.7 Pairs of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.8 Conditional Probabilities, Densities, and Expectations . . . . . . . . . . . . . . . . . . . . . . 29 1.9 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.9.1 Transformation of random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.9.2 Expectations of functions of a random vector . . . . . . . . . . . . . . . . . . . . . . . 33

1.10 Properties of the Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.11 Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 1.12 Inequalities for Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

1.12.1 Markov inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 1.12.2 Chebyshev inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 1.12.3 Chernoff Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 1.12.4 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 1.12.5 Moment Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2 Sequences of Random Variables 43 2.1 Convergence Concepts for Random Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.2 The Central Limit Theorem and the Law of Large Numbers . . . . . . . . . . . . . . . . . . . 48 2.3 Advanced Topics in Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.4 Martingale Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.5 Extensions of the Law of Large Numbers and the CLT . . . . . . . . . . . . . . . . . . . . . . 55 2.6 Large Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.7 Spaces of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3 Estimation of Parameters 61 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2 Quick Review of Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3 General Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3.1 General Bayes Decision Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.2 General Bayes Decision Rule Performance . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.4 Bayes Least Square Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5 The Orthogonality Principle for Least Squares Estimation . . . . . . . . . . . . . . . . . . . . 70 3.6 Bayes Maximum A Posteriori (MAP) Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.7 Bayes Absolute Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4 CONTENTS

3.8 Bayes Linear Least Square (LLSE) Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.9 Nonrandom Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.9.1 Cramer-Rao Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.9.2 Maximum-Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.9.3 Comparison to MAP estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4 Recursive LLSE: The Kalman Filter 93

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.2 Historical Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.3 Recursive Estimation of a Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.4 The Discrete-Time Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.4.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.4.2 Measurement Update Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.4.3 Prediction Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.4.5 Additional Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.4.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 Detection Theory 105

5.1 Bayesian Binary Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.1.1 Bayes Risk Approach and the Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . 107

5.1.2 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.2 Performance and the Receiver Operating Characteristic . . . . . . . . . . . . . . . . . . . . . 113

5.2.1 Properties of the ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.2.2 Detection Based on Discrete-Valued Random Variables . . . . . . . . . . . . . . . . . . 118

5.3 Other Threshold Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.3.1 Minimax Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.3.2 Neyman-Pearson Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.4 M-ary Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.4.1 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.4.3 M-Ary Performance Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.5 Gaussian Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6 Stochastic Processes and their Characterization 135

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.2 Complete Characterization of Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . 135

6.3 First and Second-Order Moments of Stochastic Processes . . . . . . . . . . . . . . . . . . . . 136

6.4 Special Classes of Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.5 Examples of Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.5.1 The Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.5.2 The Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.5.3 Digital Modulation: Phase-Shift Keying . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.5.4 The Random Telegraph Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.5.5 The Wiener Process and Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . 147

6.6 Stationarity of Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.7 Moment Functions of Vector Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.8 Moments of Wide-sense Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.9 Power Spectral Density of Wide-Sense Stationary Processes . . . . . . . . . . . . . . . . . . . 153

CONTENTS 5

7 Discrete State Markov Processes 157

7.1 Discrete-time, Discrete Valued Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . 158

7.1.1 Process Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

7.1.2 Hitting probabilities and mean hitting times . . . . . . . . . . . . . . . . . . . . . . . . 161

7.1.3 Steady state behavior of discrete time Markov chains . . . . . . . . . . . . . . . . . . . 165

7.2 Continuous-Time, Finite Valued Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . 172

7.2.1 Process Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

7.2.2 Hitting probabilities and mean hitting times . . . . . . . . . . . . . . . . . . . . . . . . 175

7.2.3 Steady state behavior of continuous time Markov chains. . . . . . . . . . . . . . . . . 177

7.3 Birth-Death Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.4 Queuing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

7.5 Inhomogeneous Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

7.6 Applications of Poisson Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

8 Mean-Square Calculus for Stochastic Processes 187

8.1 Continuity of Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

8.2 Mean-Square Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

8.3 Mean-Square Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

8.4 Integration and Differentiation of Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . 194

8.5 Generalized Mean-Square Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

8.6 Ergodicity of Stationary Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

9 Linear Systems and Stochastic Processes 205

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

9.2 Review of Continuous-time Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

9.3 Review of Discrete-time Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

9.4 Extensions to Multivariable Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

9.5 Second-order Statistics for Vector-Valued Wide-Sense Stationary Processes . . . . . . . . . . 210

9.6 Continuous-time Linear Systems with Random Inputs . . . . . . . . . . . . . . . . . . . . . . 211

10 LLSE Estimation of Stochastic Processes and Wiener Filtering 217

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

10.2 Historical Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

10.3 LLSE Problem Solution: The Wiener-Hopf Equation . . . . . . . . . . . . . . . . . . . . . . . 219

10.4 Wiener Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

10.4.1 Noncausal Wiener Filtering (Wiener Smoothing) . . . . . . . . . . . . . . . . . . . . . 221

10.4.2 Causal Wiener Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

10.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

11 Series Expansions and Detection of Stochastic Processes 239

11.1 Deterministic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

11.2 Series Expansion of Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

11.3 Detection of Known Signals in Additive White Noise . . . . . . . . . . . . . . . . . . . . . . . 244

11.4 Detection of Unknown Signals in White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

11.5 Detection of Known Signals in Colored Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

A Useful Transforms 249

B Partial-Fraction Expansions 253

B.1 Continuous-Time Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

B.2 Discrete-Time Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

6 CONTENTS

C Summary of Linear Algebra 257 C.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 C.2 Matrix Inverses and Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 C.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 C.4 Similarity Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 C.5 Positive-Definite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 C.6 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 C.7 Vector Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

D The non-zero mean case 267

Chapter 1

Introduction to Probability

What is probability theory? It is an axiomatic theory which describes and predicts the outcomes of inexact, repeated experiments. Note the emphases in the above definition. The basis of probabilistic analysis is to determine or estimate the probabilities that certain known events occur, and then to use the axioms of probability theory to combine this information to derive probabilities of other events of interest, and to predict the outcomes of certain experiments.

For example, consider any card game. The inexact experiment is the shuffling of a deck of cards, with the outcome being the order in which the cards appear. An estimate of the underlying probabilities would be that all orderings are equally likely; the underlying events would then be assigned a given probability.

Based on the underlying probability of the events, you may wish to compute the probability that, if you are playing alone against a dealer, you would win a hand of blackjack. Certain orderings of the cards lead to winning hands, and the probability of winning can be computed from the combined information on the orderings.

There are several interpretations of what we mean by the probability of an event occurring. The fre- quentist interpretation is that, if an experiment is repeated an infinite number of times, the fraction of experiments in which the event occurs is its probability. On the other hand, the subjectivist interpretation is that a probability represents an individual belief that a certain event will occur. This interpretation is most appropriate when experiments cannot be repeated, such as in economics and social situations. Independent of which interpretation is used, the same axiomatic theory is used for manipulating probabilities.

In this chapter, we review some of the key background concepts in probability theory.

1.1 Axioms of Probability

A probability space is a triple (Ω,F,P) which is used to describe the outcomes of a random experiment. The set Ω is the set of all possible elementary experiment outcomes ω. For any set A ⊂ Ω, we use the notation Ac to denote the complement of A, or Ac = Ω −A, where B −A = {x | x ∈ Bx 6∈ A}.

The set F is a collection of subsets of Ω which satisfies the following axioms:

1. Ω ∈F

2. If A ∈F, then Ac ∈F

3. If A,B ∈F, then A∪B ∈F. Furthermore, if Ai ∈F, i = 1, . . ., then Ai ⊂ Ω,∪∞i=1Ai ∈F

4. If A,B ∈F, then A−B ≡{x ∈ A|x /∈ B}∈F

To show the last result, note A−B = A∩Bc, which is the intersection of two elements of F. Thus, the set F is closed under countable unions and complementation. The set F is called a σ-field (or

σ-algebra) because of its closure under countable union, and is referred to as the set of events. An element A ∈F is called an event.

Note that the above properties also imply that F is closed under countable intersections. If F is a σ-field, and A,B ∈ F, then A∩B ∈ F, because A∩B = (Ac ∪Bc)c. Using a similar reasoning, if A1,A2, . . . is a sequence of sets in F, then ∩∞i=1Ai ∈F.

8 CHAPTER 1. INTRODUCTION TO PROBABILITY

Events Ai indexed by a set I are called mutually exclusive if Ai ∩Aj = ∅ for all i,j ∈ I,i 6= j. The measure P assigns a probability value in [0, 1] to each event contained in F; that is, it maps the

set of events into the closed unit interval [0, 1]. Furthermore, the probability measure has some important properties, described below.

The axioms which a probability measure must satisfy are:

1. P(Ω) = 1.

2. P(A) ≥ 0 for all A ∈F.

3. P(∪∞i=1Ai) = ∑∞ i=1 P(Ai) if Ai ∩Aj = ∅ for all i 6= j (that is, Ai, i = 1, . . . is a collection of mutually

exclusive events). This property is called the countable additivity property of the probability measure.

Based on the above properties, probability measures can be shown to satisfy additional properties, such as:

1. P(A) = 1 −P(Ā).

2. P(∅) = 0.

3. P(A∪B) = P(A) + P(B) −P(A∩B).

4. If B ⊂ A, then P(B) ≤ P(A).

5. P(A∪B ∪C) = P(A) + P(B) + P(C) −P(A∩B) −P(A∩C) −P(B ∩C) + P(A∩B ∩C)

Consider again the example of a shuffle of a deck of cards. The outcomes are the possible orderings. Events are combinations of outcomes; for example, an event may be the set of all orderings such that the ace of spades is the first card.

For another example, consider the toss of a fair coin, with outcomes H,T . The set of outcomes Ω = {H,T}. The σ-field F is

F = {{H},{T},∅,{H,T}} If the coin is fair, the measure P will have the following properties:

P({H}) = 1

2 ; P({T}) =

2 ; P({H,T}) = 1; P(∅) = 0;

Cosider a third example where Ω = [0, 1], the unit interval. Let the set F contain every interval of the form [a,b], where a ≤ b are points in Ω. These are not enough sets, but these sets can form the beginning of a description for the set F. Note that individual points x can be represented as itervals [x,x]. Now, define the remainder of F to be sets that can be obtained by complementation and countable unions of these intervals, recursively, as required by the axioms of probability. That is, the set F is the smallest σ-field that contains all of the closed intervals. Note that every open interval (a,b),a < b is also contained in F, because it can be written as the complement of the union of two closed intervals [0,a] and [b, 1].

We could have also defined this σ-field starting from all the open subsets (a,b), 0 ≤ a < b ≤ 1. The smallest σ-field that contains all the open subsets of Ω is called the Borel σ-field. Note that every isolated point x will be a member of this σ-field because it can be written as the intersection of open intervals around x. Hence, every closed interval will also be an element of the Borel σ-field, because it can be written as the union of an open interval and its two endpoints. We refer to elements of the Borel σ-field as Borel sets. Note that not all subsets of Ω will be Borel sets, although most interesting subsets we encounter are likely to be Borel sets.

Why do we put such an emphasis on Borel sets? Primarily, because we would like to define the measure P for intervals, and extend the definiton of the measure to the rest of the Borel sets. For an open interval (a,b), we define the measure as its length:

P((a,b)) = b−a

We can now use the axioms of probability to extend this definiton to all Borel sets, because every Borel set can be written as complements and countable unions of intervals.

An important property that is needed for this extension is the continuity of probability. If we have a sequence A1 ⊂ A2 ⊂ ··· of increasing sets in F, the sequence Aj is converging to the union ∪∞i=1Ai. Will the probabilties converge also? The following lemma shows it to be true:

1.1. AXIOMS OF PROBABILITY 9

Lemma 1.1 Suppose A1,A2, . . . is a sequence of events in F. Then,

1. If A1 ⊂ A2 ⊂ ··· , then limk→∞ P(Ak) = P(∪∞k=1Ak). 2. If A1 ⊃ A2 ⊃ ··· , then limk→∞ P(Ak) = P(∩∞k=1Ak).

proof For the first part, let D1 = A1,Dk = Ak−Ak−1,k ≥ 2. Note that Dk ∈F, and that the D1,D2,D3, · · · collection is mutually exclusive. Then, by the axioms of probability,

P(Ak) = P(∪kj=1Aj) = P(∪ k j=1Dj) =

k∑ j=1

P(Dj)

Note that P(Ak) is an increasing sequence of numbers that are bounded by 1, hence this has a limit. Then,

lim k→∞

P(Ak) = P(∪∞i=1Ak) = P(∪ ∞ i=1Dk) = lim

k→∞

k∑ j=1

P(Dj) =

∞∑ j=1

m(Dj)

For the second part, consider the sets Bk = A c k. By the first part, we know

lim k→∞

P(Bk) = P(∪∞i=1Bk)

Now, note ∩∞k=1Ak = (∪ ∞ i=1Bk)

c, so

P(∩∞k=1Ak) = 1 −P(∪ ∞ i=1Bk) = 1 − lim

k→∞ P(Bk) = lim

k→∞ P(Ak)

Why is the concept of event needed over and above the concept of outcome? For experiments where the set of outcomes is discrete and finite, such as a shuffle of a deck of cards, one can consider only the set of outcomes, as each outcome can be an individual event. However, there are may situations where we want to model the set of possible outcomes as continuous, rather than discrete; for instance, the experiment of picking a random number in the interval [0,1]. In such cases, it is often impossible to associate non-zero probabilities with individual outcomes; indeed, there are an uncountable number of outcomes, and none of the axioms of probability can be used to combine the probabilities of an uncountable number of outcomes. By the laws of probability, there are at most a finite number of mutually exclusive events which have probability of at least 1/n. Thus, by defining the probability measure on events rather than on an uncountable number of outcomes, we can focus our definition on the significant outcomes of experiments, and also provide a meaningful way of combining probabilities.

Another important issue is that not every subset of Ω can be an event, because it is not possible to assign a probability to each subset in a manner which is consistent with the axioms of probability measures. For instance, consider the following construction of a subset of [0, 1]: Let As = {x ∈ [0, 1]|x−y mod 1 is a rational number for all other y ∈ As}. Construct the set B = { one element from each distinct As}. Denote the probability measure P as the uniform measure, such that P([a,b)) = b−a for 0 ≤ a ≤ b ≤ 1. Now, note the following properties of the constructed sets: ∪sAs = [0, 1],As ∩At = ∅ if As 6= At.

Denote the translation of B as B + r = {y | y = x + r,x ∈ B}. Denote by Bi = B + ri for each rational number ri. Note the following:

1. There are a countable number of Bi.

2. Bi ∩ Bj = ∅ if i 6= j, because B contains one and only one element from each As. Note that, if the conclusion were not true, then there is are x,y ∈ B,x 6= y such that x+ri = y +rj, which would imply that x,y ∈ As for some s, contradicting the construction of B.

3. ∪∞i=1Bi = ∪sAs = Ω.

Now, consider our dilemma if B were an event, what probability would we assign to B? Clearly, by construction, P(Bi) = P(Bj) = P(B). So, if P(B) 6= 0, then, since the Bi are mutually disjoint, P(Ω) = ∞. If P(B) = 0, then P(Ω) = 0 also! Thus, we have a set for which we cannot assign a probability which is compatible with the axioms of probability theory and the definition of the uniform probability measure.

Other useful properties of σ-fields are:

10 CHAPTER 1. INTRODUCTION TO PROBABILITY

1. A σ-field F′ is said to be a refinement of F (written as F′ < F), if and only if, for any event A ∈F′, said event is also A ∈F.

2. Given a collections of sets {Ai},Ai ∈ Ω, there exists a smallest σ-field which contains the sets {Ai}, denoted by σ({Ai}).

That last property is what we used to define Borel sets over the unit interval. The definiton of Borel sets can be generalized to the real line, or n-dimensional Euclidean spaces.

As a final note, in any probability space, there can be numerous events which have no probability of occurring. Thus, the difference between two events is often negligible; in such cases, we would like to define a notion of equivalence of events. This notion is stated as follows: two events A,B ∈F are said to be equal with probability one if and only if P(A∪B −A∩B) = 0.

1.2 Conditional Probability and Independence of Events

Consider a probability space, and a pair of events A,B ∈F such that P(B) > 0. We define the conditional probability of event A given that B has occurred as:

P(A|B) = P(A∩B) P(B)

. (1.1)

Note that this is not defined when P(B) = 0. Such events have no probability of being observed in practice, which leads to the lack of a definition.

Conditional probability functions have an interesting property: they are also probability measures, and a conditional probability space can be defined! In particular, let P(·|B) denote this probability measure. Then, {Ω,F,P(·|B)} is a probability space.

The total probability theorem is an important result. It can be stated as follows. Let A1, . . . denote a countable set of pairwise mutually exclusive events with P(Ai) > 0 for i = 1, . . ., and assume that A1 ∪A2 ∪ . . . = Ω. Then, for any event B ∈F,

P(B) = P(B|A1)P(A1) + P(B|A2)P(A2) + . . .

Another important result in probability theory is Bayes’ theorem, which can be combined with the total probability theorem to state:

P(Ai|B) = P(B|Ai)P(Ai)

P(B) (1.2)

= P(B|Ai)P(Ai)∑m j=1 P(B|Aj)P(Aj)

. (1.3)

Two events A,B are said to be independent if P(A ∩ B) = P(A)P(B). This implies that P(B|A) = P(B),P(A|B) = P(A). The concept of independence can be extended to a finite sequence of sets A1, . . . ,Am, which are mutually independent if P(A1 ∩ A2 ∩ . . . ∩ Am) = P(A1)P(A2) · · ·P(Am). Note that the above concept of mutual independence implies much more than pairwise independence; it is easy to construct examples of events which are pairwise independent, but not mutually independent. For example, consider the experiment of selecting an integer from 1 to 4. Consider the events {1, 2},{1, 3},{1, 4}; note that, if each number is equally likely, then the above events are pairwise independent but they are not mutually independent.

Note that the property of independence of two events A,B is a function of the probability measure P , and not whether A,B have outcomes in common. Indeed, if A,B are disjoint, then they cannot be independent! That is because P(A|B) = 0, since knowing that the experiment outcome is in b implies it is cannot be in A.

If A,B are independent events, then Ac,B are also independent, because

P(Ac ∩B) + P(A∩B) = P(B) = P(Ac ∩B) + P(A)P(B)

This implies P(Ac ∩B) = (1 −P(A))P(B) = P(Ac)P(B)

1.3. RANDOM VARIABLES 11

One of the important properties of independence explits the continuity of probability measures we dis- cussed earlier. Let Ak,k = 1, 2, . . . be a sequence of events. We define a new event as

{Akinfinitely often} = {ω ∈ Ω : ω ∈ Akfor infinitely many values of k

There is a way of expressing this set using unions and intersections, as

{Akinfinitely often} = ∩k≥1(∪n≥kAn)

Note that, for each k, the union ont he right hand side is an event, and so the countable intersection of events is an event, so the set {Akinfinitely often} is an event.

There is a famous result that computes the probability of this event, known as the Borel-Cantelli Lemma, stated below:

Lemma 1.2 (Borel-Cantelli) Let Ak,k = 1, 2, . . . be a sequence of events. Then,

1. if ∑∞ k=1

P(Ak) < ∞, then P({Akinfinitely often} = 0. 2. ∑∞ k=1

P(Ak) = ∞ and A1,A2, . . . are mutually independent, then

P({Akinfinitely often} = 1

Proof: For the first part, note that {Akinfinitely often} is the intersection of non-increasing events, as in Lemma 1.1. Then,

P(∪n≥kAn) = limm→∞P(∪mn=kAn) ≤ limm→∞ m∑ n=k

P(An) =

∞∑ n=k

Thus,

P({Akinfinitely often}≤ lim k→∞

∞∑ n=k

pn = 0

because the sum ∑∞ k=1 P(Ak) < ∞.

For the second part, since the Ak are independent, we have P(A1 ∪ A2 ∪ A3) = 1 − (1 − P(A1))(1 − P(A2))(1 −P(A3)) and similar generalizations for more events. Thus

P(∪n≥kAn) = limm→∞P(∪mn=kAn) = limm→∞[1 − m∏ n=k

(1 −P(An))]

We now use a simple inequality: 1 −x ≤ e−x for all x. Then,

P(∪n≥kAn) ≥ limm→∞[1 −e− ∑m n=k

P(An)] = 1 −e− ∑∞ n=k

P(An) = 1

This implies that P({Akinfinitely often} = 1.

1.3 Random Variables

A random variable is similar to a function; indeed, the most common definition of a random variable is a function which assigns real values to the outcomes in Ω. In this manner, it is possible to characterize experiments with similar outcomes entirely in terms of the numerical values of their outcomes. For instance, an experiment involving tossing two unbiased coins is similar in probability to an experiment for rolling a 4-sided unbiased die, as both experiments are equally likely to create any of 4 outcomes. However, their underlying probability spaces would be different. Assigning numerical values to the outcomes of each experiment lets us recognize that the resulting probability spaces give rise to equivalent random variables.

Formally, a random variable in a probability space (Ω,F,P) is a measurable function X : Ω →<, where < is the space of real numbers. By a measurable function, we mean a function where the sets {ω | X(ω) < a} are events in the original σ-field F. Thus, not every function can generate a random variable; it must be such that inverse images of intervals are well-defined events in the original probability space. We define the

12 CHAPTER 1. INTRODUCTION TO PROBABILITY

Borel σ-field B to be generated by the sets (−∞,a), for all real numbers a. In terms of functions, a random variable is a measurable function from (Ω,F) into (<,B).

Often, we wish to allow a random variable to take the values +∞ or −∞ for some specific events. We can allow this extension, provided that P[{ω | X(ω) = +∞}∪{ω | X(ω) = −∞}] = 0.

A random variable X induces a probability measure PX on (<,B) using the function mapping. For any of the elementary events (−∞,a), this probability is given by

PX((−∞,a)) = P({ω | X(ω) < a}).

We can extend this definition to arbitrary open intervals (b,a) as

PX((b,a)) = P({ω | b < X(ω) < a})

and, more generally, for any set B ∈B, we have

PX(B) = P({X(ω) ∈ B}).

Indeed, with this induced probability, we can show that (<,B,PX) is also a probability space. We call this space the sample space. In our study of stochastic processes, we will typically characterize random variables in terms of the properties of their sample spaces, rather than in terms of their underlying probability spaces. However, derivation of the probability measure of the sample space depends explicitly on the definition of the original experiment which gives rise to the random variable.

As an example, consider the experiment of tossing two unbiased coins. In the original space Ω, there are four outcomes: HH, HT, TH and TT, where H denotes a heads outcome and T denotes a tails outcome. We define a random variable X as follows:

X(ω) =

{ −1 if ω 6= HH,TT 1 otherwise.

The sample space can be taken to be either <, or {−1, 1}; in cases where the number of possible sample values is discrete, the random variable is said to be a discrete random variable, and we simplify the sample space to be the range of values taken by the random variable. Note that both of the sample values, 1 and -1, are equally likely. The induced probability PX is such that PX(1) = PX(−1) = 0.5.

Now, consider a second experiment, consisting of tossing a single unbiased coin, and define another random variable Y as

Y (ω) =

{ −1 if ω = H 1 otherwise.

The sample space and induced probability of this random experiment and random variable Y are the same as those of the previous experiment and random variable X. Rather than treating these random variables a different, by using the sample space, we can treat them as identical random variables, as they will have identical distributions.

1.4 Characterization of Random Variables

Consider a probability space (Ω,F,P), with a random variable X defined on it. We have the following definition:

Definition 1.1 (Probability Distribution Function) The probability distribution function of the random variable X is defined as the function PX : <→ [0, 1] which satisfies:

PX(a) ≡ PX((−∞,a]) = P({ω | X(ω) ≤ a}).

This function is also sometimes called the cumulative distribution function, since it is a “cumulative” measure of the probability of the random variable X falling in the interval (−∞,a]. It is sometimes referred to as the “PDF” or “CDF” (all upper case) of a random variable. We will often use the notation P(a) instead of PX(a) when it is clear which random variable we are referring to. In particular, for a generic argument, this is often written as PX(x) or just P(x). (Note that there is a possibility for confusion between P(·) as used for the probability of an event vs. for the PDF. The difference should be clear based on whether the argument is a set or a point, respectively.)

Probability distribution functions have the following properties:

1.4. CHARACTERIZATION OF RANDOM VARIABLES 13

1. PX(∞) = 1,PX(−∞) = 0

2. a ≤ b implies that PX(a) ≤ PX(b)

3. lim�→0+ PX(a + �) = PX(a) (continuity from the right)

Proof: The first properties follow from the continuity of probabilties. Define the events as An = {ω|X(ω) ≤ n}. These form a non-decreasing sequence, so by Lemma 1.1

lim n→∞

P(An) = lim n→∞

P(n) = P(∪∞k=1An) = P(Ω) = 1

Similarly, the sequence Bn = {ω|X(ω) ≤ −n} forms a non-increasing sequence with an empty intersection, so

lim n→∞

P(Bn) = lim n→∞

P(−n) = 0

The second property follows from the fact that {ω : X(ω) ≤ a} ⊂ {ω : X(ω) ≤ b}. The final property can be shown as follows: Define the sets An = {ω|X(ω) ≤ a + 1/n}. Again, these sets are non-increasing, so

lim n→∞

P(An) = lim n→∞

P(a + 1/n) = P(∩∞n=1An) = P(a)

Note that the probability distribution function can be used to completely characterize the induced prob- ability PX on (<,B), since it assigns a probability to each elementary set defining B.

Most of the random variables that we study are one of two types: discrete or continuous. A random variable X is discrete if there is a finite or countably infinite set of values xi ∈ < such that µ(X ∈ {xi, i ∈ I}) = 1. We describe these random variables in terms of a probability mass function (pmf) pX(x) = µ(X = x). Note that discrete random variables have CDFs that are piecewise constant, and have jumps at a discrete number of points.

A random variable X is a continous random variable when its CDF is a continuous function, and the CDF can be written as an integral of a density function pX(a), as follows:

Definition 1.2 (Probability density function) Assuming the function

PX(a) =

∫ a −∞

pX(s)ds

Then, at points where PX(a) is differentiable, the probability density function is defined as:

pX(a) = d

da PX(a),

with the constraints that

pX(x) ≥ 0 and ∫ pX(x)dx = 1.

This function is sometimes referred to as the “pdf” (lower case) of a random variable. A probability measure PX defined on (<,B) is absolutely continuous if, for any A such that

∫ <IA(s)ds = 0, we have also that

PX(A) = 0. For absolutely continuous probability measures, we have the following representation (known as the Radon-Nykodim theorem):

PX(A) =

∫ A

pX(s)ds,

where pX(s) is a non-negative measurable function corresponding to the probability density function. The probability density function can be interpreted in terms of the frequency of outcomes. If pX(a) is finite over an interval (a,a + �], then, for very small �, the probability that a sample value occurs in the above interval is approximately pX(a)�.

Even if PX(a) is discontinuous and not differentiable, we can often define a probability density function in terms of generalized functions such as the unit impulse function (or delta function) δ(a). Recall, the impulse function is defined by the following properties:

δ(a) = 0 if a 6= 0

14 CHAPTER 1. INTRODUCTION TO PROBABILITY

∫ c b

δ(a) da =

 

0 if b ≤ c < 0 1 if b ≤ 0 ≤ c 0 if 0 < b ≤ c∫ ∞

−∞ δ(a−s)g(s) ds = g(a) if g is continuous at a.

Probability distribution functions are discontinuous at points where a random variable can take a specific value with nonzero probability. For example,

pX(x) = 0.5δ(x + 1) + 0.5δ(x− 1)

is the density of a random variable taking on the values −1, 1 each with equal probability. Now let us define two types of probability measures. We say a probability measure PX defined on (<,B)

is singular if there exists a set A ∈< such that PX(A) = 1, and ∫ <IA(s)ds = 0, where IA(s) is the indicator

function of the set A; that is,

IA(s) =

{ 1 if s ∈ A 0 otherwise.

Probability measures on discrete-valued random variables (such as the outcomes of a coin toss) are singular. In these cases we sometimes describe the distribution at discrete points, in the same way discrete-time quantities are often represented in digital signal processing.

Definition 1.3 (Probability mass function) For discrete-valued random variables, the probability distribution can be characterized using a probability mass function, where:

0 ≤ pX(x) ≤ 1 ∑ i

pX(xi) = 1.

This function is sometimes referred to as the “pmf” (lower case) of a random variable. Using the pmf, the probability of an event A is computed as

PX(A) = ∑ xi∈A

pX(xi).

Given these definitions, every probability measure PX defined on (<,B) (or more generally on the vector space (<n,Bn) for vectors of random variables) can be decomposed in canonical form (Lebesgue decompo- sition):

PX = αP (1) X + (1 −α)P

(2) X ,

where P (1) X is absolutely continuous and P

(2) X is singular. Typically, the singular measure is represented

as a sum of delta functions, and the absolutely continuous part is represented by the continuous-valued probability density function. We summarize these ideas in Table 1.1.

Example 1.1 Consider the following random variable X: With probability 1/2, it has value 0; the remaining probability 1/2 is spread uniformly in the interval [0, 1]. This random variable is continuous-valued, but the probability distribution function is not absolutely continuous. Indeed, if we were to write the density of this random variable, it would consist of

pX(s) = 1/2δ(s) + 1/2I[0,1](s),

illustrating that the density is the sum of a singular part (corresponding to the non-zero probability at 0) and an absolutely continuous part.

For both singular and absolutely continuous cases, the relationship between probability distribution functions and probability density functions can be inverted as follows:

PX(a) =

∫ a −∞

pX(s)ds.

The probability density function can be used to characterize the induced probability as

PX((a,b]) = P({X(ω) ∈ (a,b]}) = ∫ b a

pX(s)ds.

1.4. CHARACTERIZATION OF RANDOM VARIABLES 15

C a se

D es

cr ip

ti o n

P ro

b a b

il it

y M

a ss

F u

n ct

io n p X

(x )

P ro

b a b

il it

y D

en si

ty F

u n

ct io

n p X

(x )

P ro

b a b

il it

y D

is tr

ib u

ti o n

F u

n ct

io n

C u

m u

la ti

v e

D is

tr ib

u ti

o n

F u

n ct

io n

P X

(x )

D is

cr et

e D

is tr

ib u

ti o n

a s

p m

p X

(x i ) ≥

0 , ∑ x ip

X (x i )

= 1

P X

(x )

= ∑ z≤xp

X (z

)

P X

(A )

= ∑ x∈A

p X

(x )

x 1

x 2

p X

x ( )

x 1

x 2

P X

x ( )

D is

cr et

e D

is tr

ib u

ti o n

v ia

Im p

u ls

p X

(x )

= ∑ ip

i δ (x − x i ), ∑ ip

i =

P X

(x )

∫ x −∞ p X

(z ) d z

P X

(A )

∫ z∈A p X

(z ) d z

p X

x d d x

P X

x ( )

( )

x 1

x 2

P x

X ( )

x 1

x 2

C o n ti

n u

o u

s D

is tr

ib u

ti o n

p X

(x ) ≥

0 ,

∫ p X (z

) d z

= 1

P X

(x )

∫ x −∞ p X

(z ) d z

P X

(A )

∫ z∈A p X

(z ) d z

a b

p X

x d d x

P X

x ( )

( )

x a

b x

P X

x ( )

M ix

ed D

is tr

ib u

ti o n

(L eb

es q u

e d

ec o m

p o si

ti o n

)

p X

(x )

= α p (x

) ︸︷︷︸ co

n t

d is

t+ (1 − α

) ∑ ip

i δ (x − x i )

︸︷︷

︸ si

n g

d is

P X

(x )

∫ x −∞ p X

(z ) d z

P X

(A )

∫ z∈A p X

(z ) d z

a b

p X

x d d x

P X

x ( )

( )

x 1

x 2

x a

b x 1

x 2

P X

x ( )

T a b

le 1 .1

: S

u m

m a ry

o f

p ro

b a b

il it

y d

is tr

ib u

ti o n

fu n

ct io

n a n

d p

ro b

a b

il it

y d

en si

ty re

la ti

o n

sh ip

16 CHAPTER 1. INTRODUCTION TO PROBABILITY

Using the probability density function also allows us to define certain operations and expectations of random variables. In order to avoid excess mathematics, let us define the concept of a measurable function of a random variable X to be a function for which the integral

∫∞ −∞g(s)pX(s) ds is well-defined. For any

measurable function g : <→<, we define its expected value as:

E [g(X)] =

∫ ∞ −∞

g(s)pX(s) ds. =

∫ Ω

g(X(ω))P(dω)

Note that this integral may be infinite-valued if g is unbounded. For a discrete-valued random variable, it is typically more convenient to express the expected value in terms of the probability mass function

E [g(X)] = ∑ k

g(xk)pX(xk).

The expectation operation inherits all of the properties of integrals (and sums) of functions, including linearity. Thus,

E [ag(X) + bh(X)] = aE[g(X)] + bE[h(X)].

There are some standard expectations of random variables, which depend on the choice of function g. First, the mean or average of a random variable is defined as

mX = E[X] =

∫ ∞ −∞

spX(s)ds. (continuous-valued RV)

= ∑ k

xkpX(xk). (discrete-valued RV)

More generally, the n-th moment is defined as

E[Xn] =

∫ ∞ −∞

snpX(s)ds (continuous-valued RV)

= ∑ k

xnkpX(xk). (discrete-valued RV)

The variance of a random variable is defined in terms of its first and second moments, as

σ2X = E [ (X −mX)2

] = E[X2] − (E[X])2.

The variance is the second-order case of the more general n-th central moment, E[(X −mX)n]. Another important expectation is the characteristic function, defined as

ΦX(w) = E[e jwX] =

∫ ∞ −∞

ejwspX(s)ds.

The characteristic function is the Fourier transform of the probability density function, where j = √ −1. It

uniquely characterizes the density function, as it can be obtained as the inverse Fourier transform as

pX(x) = 1

2π

∫ ∞ −∞

e−jwxΦX(w) dw.

Since the density function is integrable (it integrates to one), the characteristic function always exists. For discrete, integer-valued random variables, one often defines the moment generating function:

GX(z) = E[z X] =

∑ k

zkpX(k),

which is the Z-transform of the discrete probability mass function. (Note that here we have used k instead of xk to make the integer values explicit.) As for the continuous case, the transform can be inverted to obtain the pmf.

1.5. IMPORTANT RANDOM VARIABLES 17

Both the characteristic function and the moment-generating function can be used to obtain the mo- ments of x, assuming that the functions are differentiable, and that expectations and differentiations can be exchanged (usually, except in rare cases):

dw ΦX(w)|w=0 =

( d

∫ ∞ −∞

ejwspX(s) ds

)∣∣∣∣ w=0

(∫ ∞ −∞

dw ejwspX(s) ds

)∣∣∣∣ w=0

(∫ ∞ −∞

(js)ejwspX(s) ds

)∣∣∣∣ w=0

= j

∫ ∞ −∞

(s)pX(s) ds = jE[X]. (1.4)

More generally, assuming that the characteristic function is sufficiently differentiable,

dwn ΦX(w)

∣∣∣∣ w=0

= (j)nE[Xn].

Similarly, for the moment generating function, we have

GX(z) = E [ zX ]

dz GX(z)|z=1 = E[X]

dz2 GX(z)|z=1 = E[X2] −E[X]

and additional expressions can be developed for the higher-order moments.

1.5 Important Random Variables

There are a number of random variables that arise in many applications. These random variables model fundamental mechanisms that underlie random behavior. In this handout, we discuss several of these random variables, and their interrelations. A good reference for this material, from which this writeup is adapted, is A. Leon-Garcia’s book, Probability and Random Processes, published by Addison Wesley.

1.5.1 Discrete-valued random variables

Discrete-valued random variables arise mostly in applications where counting is involved. For example, the Bernoulli random variable is a model for a single coin toss. By counting the outcomes of multiple coin tosses, other random variables such as the binomial, geometric and Poisson, are obtained.

Bernoulli random variable: Let A be an event related to the outcome of some random experiment, such as a toss of a biased coin. Define the indicator function of A as:

IA(ω) =

{ 0 if ω is not in A 1 if ω is in A.

(1.5)

Thus, the indicator function is one if the event A occurs, and zero otherwise. Note that IA is a random variable, with discrete values in range {0, 1}, and with probability mass function given by:

pIA(0) = 1 −p, pIA(1) = p, (1.6)

where P(A) = p. Such a random variable is called a Bernoulli random variable, since it identifies the outcome of a Bernoulli trial if we identify the outcome IA = 1 as a success.

The important expectations of a Bernoulli random variable X are easily computed in terms of p. They are listed below:

E[X] = p Mean (1.7)

E [ X2 ] −E [X]2 = p(1 −p) Variance (1.8) E [ zX ]

= 1 −p + pz Moment Generating Function (1.9)

18 CHAPTER 1. INTRODUCTION TO PROBABILITY

Binomial random variable: Suppose that a random experiment is repeated n times. Let x denote the number of times that such an experiment was a success. In terms of the notation used above in the context of Bernoulli random variables, let A denote an event, and let x denote the number of times that such an event occurs out of n independent trials. Then, X is a random variable with discrete range {0, 1, . . . ,n}.

A simple representation of x is given by

x = I1 + I2 + . . . + In, (1.10)

where Ii is the indicator that event A occurs at the independent trial i. The probability mass function of X is given by

P(X = k) = n!

k!(n−k)! pk(1 −p)n−k, (1.11)

where the factorial notation k! = ∏k j=1 j is used, and p is the single-trial probability that the event A occurs.

The binomial random variable arises in various applications where there are two types of outcomes, and we are interested in the number of outcomes of one type. Such applications include repeated coin tosses, correct/erroneous bits, good/defective items, active/silent stations, etc. The important expectations of binomial random variables are given below:

E[X] = np Mean (1.12)

E [ X2 ] −E [X]2 = np(1 −p) Variance (1.13) E [ zX ]

= (1 −p + pz)n Moment Generating Function (1.14)

Geometric random variable: The binomial random variable is obtained by fixing the number of Bernoulli trials and counting the number of successes. A different random variable is obtained by counting the number of trials until the first success occurs. Denote this random variable as M; this is a geometric random variable, and it takes values in the discrete infinite set {1, 2, . . .}. The probability mass function of M is given by

P(M = k) = (1 −p)k−1p, (1.15)

where p is the single-trial probability that the event occurs. One of the interesting properties of the geometric random variable is that it is “memoryless”; that is,

P(M ≥ k + j|M > j) = P(M ≥ k) for all j,k > 1. (1.16)

In words, the above expression states that, if a success has not occurred in the first j trials, the probability of having to perform at least k more trials until a success is the same as the probability of initially having to perform at least k trials. Thus, the system “forgets” and begins anew as if it were performing the first trial.

The geometric random variable arises in applications where one is interested in the time between occur- rence of events in a sequence of independent experiments. Such random variables have broad applications in different aspects of queuing theory. The important expectations of geometric random variables are sum- marized below:

E[M] = 1

p Mean (1.17)

E [ M2 ] −E [M]2 =

1 −p p2

Variance (1.18)

E [ zM ]

= pz

1 − (1 −p)z Moment Generating Function (1.19)

In some applications, it is useful to represent the space of outcomes as starting at zero, i.e. {0, 1, 2, . . .}. In this case, which can be thought of as a shift by 1, the variance does not change, the mean becomes mx = (1 −p)/p, and the moment generating function becomes GX(z) = p/[1 − (1 −p)z].

Poisson random variable: In many applications, we are interested in counting the number of occurrences of an event in a certain time period or in a certain region of space. The Poisson random variable arises in situations where the events occur “completely at random” in time or space; that is, where the likelihood

1.5. IMPORTANT RANDOM VARIABLES 19

of an event occurring at a particular time is equal to and independent of the event occurring at a different time. For example, Poisson random variables arise in counts of emissions from radioactive substances, in the number of photons emitted as a function of light intensity, in counts of demands for telephone connections, and in counts of defects in a semiconductor chip.

The probability mass function of a Poisson random variable N is given by

P(N = k) = λk

k! e−λ, (1.20)

where λ is the average number of event occurrences in the specified time interval or region of space.

One of the applications of the Poisson random variable is as an approximation to the binomial probabilities when the number of trials is large. If the number of trials nt is large, and if p is small, then, letting λ = ntp,

nt!

k!(nt −k)! pk(1 −p)nt−k ≈

λk

j! e−λ. (1.21)

This approximation is obtained by taking the limit nt →∞ while keeping λ fixed. The Poisson random variable appears naturally in many situations which can be approximated by the

above limit. For example, imagine a sequence of Bernoulli trials taking place in time or space. Suppose the number of event occurrences in a T-second time interval is being counted. Divide the time interval into a very large number nt of subintervals, where a pulse in each subinterval can be viewed as a Bernoulli trial, and assume that the probability that an event occurs in each subinterval is p = λ/nt, where λ is the average number of events observed in a T-second interval. Then, as nt → ∞, the limiting distribution becomes a Poisson random variable.

The important expectations of Poisson random variables are summarized below:

E[N] = λ Mean (1.22)

E [ N2 ] −E [N]2 = λ Variance (1.23) E [ zN ]

= eλ(z−1) Moment Generating Function (1.24)

1.5.2 Continuous-valued random variables

Although most experimental measurements are of limited precision, it is often easier to model their outcomes in terms of continuous-valued random variables because it facilitates the resulting analysis. Furthermore, the limiting form of many discrete-valued random variables result in continuous-valued random variables. Below, we describe some of the most useful continuous-valued random variables.

Uniform random variable: The simplest continuous random variable is the uniform random variable X, where X is equally likely to achieve any value in an interval of the real line, [a,b]. The probability density function of X is given by:

pX(x) =

{ 1 b−a if x ∈ [a,b] 0 otherwise

(1.25)

The corresponding probability distribution function is given by

PX(x) = x−a b−a

(1.26)

The important expectations of uniform random variables are given by

E[X] = a + b

2 Mean (1.27)

E [ X2 ] −E [X]2 =

(b−a)2

12 Variance (1.28)

ΦX(ω) = ejωb −ejωa

jω(b−a) Characteristic Function (1.29)

20 CHAPTER 1. INTRODUCTION TO PROBABILITY

Exponential random variable: The exponential random variable arises in the modeling of the time between occurrence of events, such as the time between customer requests for call connections in phone systems, and the modeling of lifetimes of devices and systems. The exponential random variable X with parameter α has a probability density function

pX(x) =

{ 0 if x < 0 αe−αx if x ≥ 0 (1.30)

and corresponding probability distribution function

PX(x) =

{ 0 if x < 0 1 −e−αx if x ≥ 0 (1.31)

The exponential random variable can occur as the limit of the geometric random variable, as the difference between values of a geometric random variable gets small. For example, assume that an interval of length T was subdivided into subintervals of length T/n, and assume that, for each subinterval, there is a Bernoulli trial with probability of success p = α/n, where α is the average number of events per T seconds. Then, the number of subintervals until the occurrence of the next event is a geometric random variable M. Let X denote the time until the next successful event. Then, for any t which is a multiple of T/n,

P(X > t) = P

( M >

) = (1 −p)nt/T =

[( 1 −

)n]t/T In the limit as n →∞, we get

P(X > t) → e−αt/T

which is the complement of the probability distribution function in (1.31) of the exponential random variable. Like the geometric random variable, the exponential random variable has the memoryless property. That

is, for h > 0,

P(X > t + h|X > t) = P(X > h) (1.32)

This can be shown analytically as:

P(X > t + h|X > t) = P [(X > t + h) ∩P(X > t)]

P(X > t) (1.33)

= P(X > t + h)

P(X > t) = e−α(t+h)

e−αt (1.34)

= e−αh = P(X > h) (1.35)

The important expectations of exponential random variables are given by

E[X] = 1

α Mean (1.36)

E [ X2 ] −E [X]2 =

α2 Variance (1.37)

ΦX(ω) = α

α− jω Characteristic Function (1.38)

Gaussian random variable: Also known as the Normal random variable, the Gaussian random variable models many situations where the random event consists of the sum of a large number of small random variables. To develop the exact distribution of the sum of random variables is unwieldy; fortunately, the central limit theorem and the law of large numbers provide general conditions under which, as the number of components becomes large, the distribution of the sum can be approximated by that of a Gaussian random variable.

The probability density function of a Gaussian random variable is given by

pX(x) = 1

√ 2πσ

e −(x−µ)2

2σ2 , −∞ < x < ∞, (1.39)

1.5. IMPORTANT RANDOM VARIABLES 21

where µ is the mean and σ > 0 is the standard deviation. The probability distribution function is given by

PX(x) = 1

√ 2πσ

∫ x −∞

e −(t−µ)2

2σ2 dt

= 1 √

2π

∫ (x−µ)/σ −∞

e −y2

2 dy, (1.40)

where the last expression follows from a simple substitution y = (t−µ)/σ. The Gaussian PDF is sometimes characterized in terms of the Q-function

PX(x) = 1 −Q((x−µ)/σ) where Q(z) = ∫ ∞ z

1 √

2π e −y2

2 dy, (1.41)

and the Q function is tabulated in many texts.

Gaussian random variables occur often enough that we use the notation N(X; µ,σ2) to denote the density or distribution in equations (1.39,1.40). The important expectations of Gaussian random variables are given by:

E[X] = µ Mean (1.42)

E [ X2 ] −E [X]2 = σ2 Variance (1.43)

ΦX(ω) = e jωµ−σ2ω2/2 Characteristic Function (1.44)

Gamma random variable: The gamma random variable appears in may applications. For example, it is often used to model the time to service customers in queuing systems, the lifetime of devices in reliability studies, and the defect clustering behavior in VLSI chips. The probability density function of the gamma random variable has two parameters r > 0,α > 0, and is given by

pX(x) = α(αx)r−1e−αx

Γ(r) , (1.45)

where Γ(z) is the gamma function defined by the integral

Γ(z) =

∫ ∞ 0

xz−1e−x dx, z > 0.

The gamma function has the following properties:

Γ(0.5) = √ π (1.46)

Γ(z + 1) = zΓ(z) (1.47)

Γ(m + 1) = m! for m a positive integer (1.48)

The versatility of the gamma distribution is that, by properly choosing the two parameters, it can take a variety of shapes, which can be used to fit specific distributions. For instance, when r = 1, we obtain the exponential random variable. By letting α = 0.5,r = k/2, for a positive integer k, we obtain the distribution of a chi-square random variable, which is important in statistical problems as the sum of the squares of k independent, zero-mean, unit variance Gaussian random variables. By letting r = m, we obtain the m-stage Erlang distribution, which is the distribution of the sum of m independent and identical exponential random variables.

The important expectations of gamma random variables are given by:

E[X] = r/α Mean (1.49)

E [ X2 ] −E [X]2 =

α2 Variance (1.50)

ΦX(ω) = 1

(1 − jω/α)r Characteristic Function (1.51)

22 CHAPTER 1. INTRODUCTION TO PROBABILITY

Rayleigh random variable: Given a pair of independent, zero-mean, variance α2 Gaussian random variables Y and Z, the Rayleigh random variable is the magnitude of the vector corresponding to the ordered pair (Y,Z). That is, X =

√ (Y 2 + Z2). Based upon this, we can compute the probability density

function of Rayleigh random variables as

pX(x) = x

α2 e−x

2/2α2 (1.52)

with corresponding expectations

E[X] = α √ π/2 Mean (1.53)

E [ X2 ] −E [X]2 = (2 −π/2)α2 Variance (1.54)

Laplacian random variable: The Laplacian random variable models a two-sided exponential distribu- tion. The probability density function is given by

pX(x) = α

2 e−α|x|, (1.55)

with expectations:

E[X] = 0 Mean (1.56)

E [ X2 ] −E [X]2 =

α2 Variance (1.57)

ΦX(ω) = α2

ω2 + α2 Characteristic Function (1.58)

Cauchy random variable: The Cauchy random variable is often used as an example to illustrate dis- tributions which do not decay fast enough as x → ∞, so that no moments exist. The probability density function of Cauchy random variables is given by

pX(x) = β/π

β2 + x2 . (1.59)

Due to its symmetry, the mean is often taken to be zero, though the formal expected value of the density does not have a unique value. It is easy to verify that the variance of this distribution does not exist, as follows. Consider the following expression for the second moment:

E [ X2 ]

∫ ∞ −∞

x2 β/π

β2 + x2 dx. (1.60)

Note that, as x → ∞, the integrand does not approach zero, so the integral will be infinite (i.e. will not exist). However, the Cauchy random variable does have a characteristic function, given by

ΦX(ω) = e −β|ω| Characteristic Function (1.61)

In Table 1.2 we summarize the characteristics of important random variables, where the more general (shifted) forms of the Laplacian and Cauchy distributions are given.

1.6 Transformations of a Random Variable

In this section we focus on new random variables defined as real-valued functions of a single, real-valued random variable. Let X be a real-valued random variable. Assume this random variable is characterized by its known probability density function (pdf) pX(x) or, equivalently, its probability distribution function (PDF) PX(x). Suppose we are given a transformation of this random variable:

Y = g(X) (1.62)

for the real-valued function g(·). Now the random variable X(ω) is a mapping of the sample space Ω to the real line, and thus so is g(X(ω)), so Y is a random variable as well. To characterize it we need to find its probability density function pY (y) or, equivalently, its probability distribution function PY (y), which is the focus of this section.

Throughout we will assume that the function g(·) has the following properties:

1.6. TRANSFORMATIONS OF A RANDOM VARIABLE 23

D is

cr et

e- V

a lu

ed X

N a m

e R

a n

g e

P a ra

m et

er s

p m

f p X

(x )

P D

F P X

(x )

M ea

n V

a ri

a n

ce G X

(z )

= E

[z X

]

B er

n o u

ll i

{0 ,1 }

0 ≤ p ≤

1 p x (1 − p )(

1 − x

) N

/ A

p p (1 − p )

1 − p

+ p z

B in

o m

ia l

{0 ,. .. ,n }

0 ≤ p ≤

( n x ) p

x (1 − p )( n − x

) N

/ A

n p

n p (1 − p )

(1 − p

+ p z )n

G eo

m et

ri c

{1 ,. .. }

0 < p <

1 (1 − p )x p

(1 − p )(

1 −

(1 − p )x

) 1 p

1 − p

p 2

p z

1 −

(1 − p )z

P o is

so n

{0 ,1 ,. .. }

0 < λ

λ x e − λ

x !

N / A

λ λ

e λ

(z −

1 )

C o n ti

n u

o u

s- V

a lu

ed X

N a m

e R

a n

g e

P a ra

m et

er s

p d

f p X

(x )

P D

F P X

(x )

M ea

n V

a ri

a n

ce Φ X

(w )

= E

[e j w X

]

U n

if o rm

[a ,b

] a < b

1 b − a

x − a

b − a

a + b

2 (b − a )2

1 2

e j w b − e j w a

j w

(b − a )

G a u

ss ia

n [− ∞ ,∞

] µ ,σ

2 1

√ 2 π σ e −

(x − µ

)2 / 2 σ

1 − Q

(( x − µ

)/ σ

) µ

σ 2

e (j w µ − σ

2 w

2 )

E x p

o n

en ti

a l

[0 ,∞

] α >

0 α e − α x

1 − e − α x

1 α 1 α 2

α α − j w

E rl

a n

g [0 ,∞

] α >

0 , n >

0 α n x n −

1 e − α x

(n −

1 )!

1 − e − α x ∑ n−

1 k =

0 (α x

k !

n α n α 2

α n

(α − j w

G a m

m a

[0 ,∞

] α ,r >

0 α

(α x

)( r −

1 ) e − α x

Γ (r

) N

/ A

r α r α 2

α (α − j w

R a y le

ig h

[0 ,∞

] α

2 x α 2 e − x

2 / 2 α

1 − e − x

2 / 2 α

α √ π 2

(2 −

π 2 )α

2 N

/ A

L a p

la ci

a n

[− ∞ ,∞

] α >

0 ,µ

α 2 e − α |x − µ |

{ 1 2 e α

(x − µ

) x < µ

1 −

1 2 e − α

(x − µ

) x > µ

µ 2 α 2

α 2 e − j w µ

w 2 + α

C a u

ch y

[− ∞ ,∞

] α ≥

0 ,β

> 0

β / π

β 2 +

(x − α

)2 1 2

+ 1 π

ta n −

1 ( x−

α β

) U

n d

ef U

n d

ef e j α ω − β |ω |

T a b

le 1 .2

: Im

p o rt

a n t

ra n

d o m

v a ri

a b

le s.

(N / A

u n

d er

th e

P D

F co

lu m

n in

d ic

a te

s th

a t

th er

e is

n o

si m

p li

fi ed

fo rm

24 CHAPTER 1. INTRODUCTION TO PROBABILITY

p yY( )

p xX( )

y y y=g x( )

g x( )

{ | ( ) }x g x y£ 0

P yY( )

Pr( ) = Pr( ( ) ) = Shaded AreaY y g X£ £0 y0

{ }

y y

y |

£ 0

y0 y0 y0

P yY 0( )

Figure 1.1: Illustration of method of equivalent events

1. The domain of g(·) includes the support of X.

2. The set {ω|g(X(ω)) < y} is an event for every y.

3. The events {ω|g(X(ω)) = ±∞} have zero probability.

There are two ways to find the desired densities of Y . The first approach, which we term the method of equivalent events, first finds the probability distribution function PY (y) for Y , then finds the density function pY (y) through differentiation.

1.6.1 Method of equivalent events

The probability distribution function of Y is given by:

PY (y) = Pr (Y ≤ y) (1.63) = Pr (g(X) ≤ y) (1.64) = Pr ({X|g(X) ≤ y}) (1.65)

∫ {x|g(x)≤y}

pX(x) dx (1.66)

Notice that the last expression is in terms of the known pdf of X. Basically what we are doing here is taking the events {Y |Y ≤ y} and mapping them into the X space through the inverse function g−1(·).

At this point a picture should make things clear, see Figure 1.1. Suppose we fix y = y0. Through the mapping g(·) the event {Y |Y ≤ y0} is seen to be equivalent to the event {X|g(X) ≤ y0} (recall here that yo represents a fixed, non-random value). That is, the event of getting a value of Y in the red hatched region is equivalent to the event of getting a value of X in the green hatched region. Since the events are equivalent, their probabilities must be the same. In particular, PY (y0) must equal the area under the pX(x) curve in Figure 1.1 that is shaded in green. Since we know the pdf of X we can in principle calculate this probability for the given value of y0.

Conceptually, this process can be applied to any and all values of y0 and in this way PY (y) can be found. Once found, the pdf of Y can be found from the relation:

pY (y) = d

dy PY (y) (1.67)

1.6. TRANSFORMATIONS OF A RANDOM VARIABLE 25

0 1

y0y0-

p xX( )

y y=g x x( )= 2

P yY( )

Pr( ) = Pr( ( ) ) = Shaded AreaY y g X y£ £0 0

{ }

y y

y |

£ 0

y0 P yY 0( )

Figure 1.2: Example of method of equivalent events

Example 1.2 Let us now do an example to illustrate (see Fig. 1.2). Suppose that X is uniformly distributed on the interval [0, 1] and that the function g(x) is given by Y = X2. Note that with this information we know that:

PY (y0) = Pr(Y ≤ y0) = 0 when y0 < 0 (1.68)

When y0 < 0 there are no values of X that correspond to {X|g(X) ≤ y0}. That event just does not happen. That is, Y cannot be less than zero in this case.

Similarly, since the values of X have to lie in the range [0, 1] we know that:

PY (y0) = Pr(Y ≤ y0) = 1 when y0 > 1 (1.69)

When y0 > 1 the values of {Y |Y ≤ y0} corresponds to an interval in the X space {X|g(X) ≤ y0} that includes the entire domain [0, 1] for X. In other words, Y is always less than 1.

Now we will examine the case when y0 ∈ [0, 1]. Note that

{Y |Y ≤ y0} ≡ {X|g(X) ≤ y0} (1.70) = {X|X2 ≤ y0} (1.71) = {X|−

√ y0 ≤ X ≤

√ y0} (1.72)

∫ {X|−

√ y0≤X≤

√ y0}

pX(x) dx (1.73)

∫ √y0 0

1 dx (1.74)

= √ y0 (1.75)

Collecting these pieces we have overall for PY (y):

PY (y) =

 

0 y < 0√ y 0 ≤ y ≤ 1

1 y > 1 (1.76)

26 CHAPTER 1. INTRODUCTION TO PROBABILITY

p Y ( y )

p X ( x )

y y = g ( x )

g ( x )

P r ( y 0 £ Y £ y 0 + D y ) =

P r ( x 1 £ X £ x 1 + D x 1 ) +

P r ( x 2 £ X £ x 2 + D x 2 ) +

P r ( x 3 £ X £ x 3 + D x 3 )

= S h a d e d A r e a

y 0 y 0

y 0 + D y

x 1 x 2 x 3

D y

x 1 x 2 x 3x 1 + D x 1 x 2 + D x 2 x 3 + D x 3

Figure 1.3: Example of Jacobian method

Differentiating this PDF we obtain the pdf:

pY (y) =

 

0 y < 0 1

2 √ y

0 ≤ y ≤ 1 0 y > 1

(1.77)

Notice there are discontinuities in the pdf at Y = 0 and Y = 1.

1.6.2 Jacobian method

When the function g(x) is continuous, one can use the Jacobian Method. In this method one directly finds the pdf of y from the pdf of x using the Jacobian of the transformation. This is a differential based approach, as illustrated in Fig. 1.3. Suppose the equation y = pX(x) has K roots or solutions xi for a given value of y. The key formula is given by:

pY (y0) =

K∑ i=1

pX(xi)∣∣∣dgdx(xi)∣∣∣ (1.78) where the denominator is the magnitude of the Jacobian dg/dx evaluated at the roots xi of the transformation for a given value of y. Note that these roots will in general be different for different values of y. The Jacobian is the slope of the mapping and serves to scale the differential areas of the contributions in the X event space to the event [y,y + dy]. To use this approach the Jacobian needs to be well defined and one must be careful to understand the “big picture” involving the domain of definition of both y and x.

Notice from the figure that:

∆xi = ∆y∣∣∣dgdx(xi)∣∣∣ (1.79)

That is, the incremental distance in the x domain is the incremental distance y domain scaled by the magnitude of the Jacobian. The larger this slope, the smaller the corresponding distance in x. Also, as can be seen from the figure, a given incremental event in y has contributions in x at all the roots of the inverse mapping.

Example 1.3 Let us apply this method to the same example as before, where y = g(x) = x2 and X is uniformly distributed on the interval [0, 1]. In this case the Jacobian of the transformation is given by:

dx = 2x (1.80)

1.7. PAIRS OF RANDOM VARIABLES 27

And the roots of the transformation are given by:

x1 = + √ y (1.81)

x2 = − √ y (1.82)

Putting these pieces together with the Jacobian formula we obtain for the y pdf:

pY (y) = pX(+

√ Y )

|2x| x=+

√ y

+ pX(−

√ Y )

|2x| x=−

√ y

(1.83)

= 1

2 √ y

+ 0

2 √ y

(1.84)

= 1

2 √ y

(1.85)

This result agrees with our previous answer for 0 ≤ y ≤ 1. But notice that the Jacobian method does not “automatically” define the domain of definition of the solution, and one must be careful to think about the global answer. In this case (and as argued previously) pY (y) = 0 when y < 0 or y > 1.

1.7 Pairs of Random Variables

Now let us now consider a pair of random variables X, Y . In addition to the probabilistic structure and quantities associated with a single random variable, when we consider pairs of random variables we also have the additional richness of the interrelationship between the random variables. The joint probability density of the pair (X, Y ) is denoted by pX,Y (x,y) or, in short, p(x,y). Let <2 denote the plane. Suppose we want to know the probability of obtaining an (x,y) pair in any subset A of the plane. For any measurable set (roughly a set with some area to it) A ⊂<2,

Pr ({ω | (X(ω),Y (ω)) ∈ A}) = ∫ A

p(x,y) dxdy.

This result just says that the probability of obtaining an outcome in this set is the integral of the probability mass over this region. In particular, suppose we want to know the probability of obtaining an x, y pair in a square located at x, y and of infinitesimal size dx, dy. Then we have:

Pr ({ω | x < X(ω) ≤ x + dx,y < Y (ω) ≤ y + dy}) = p(x,y) dxdy.

The joint Probability Distribution Function (PDF) (also referred to as the joint Cumulative Distribution Function (CDF)) of X, Y is defined as in the single random variable case as:

PX,Y (x,y) = Pr ({ω | X(ω) ≤ x, Y (ω) ≤ y}) = ∫ x −∞

∫ y −∞

pX,Y (x,y) dxdy.

Thus the joint probability density and distribution functions are also related via:

pX,Y (x,y) = ∂2

∂x∂y PX,Y (x,y)

The marginal density of either X or Y can be recovered by integrating out the other variable, e.g.

pX(x) =

∫ ∞ −∞

pX,Y (x,y) dy.

Two random variables X, Y are said to be independent if and only if pX,Y (x,y) = pX(x)pY (y). Equivalently, in terms of CDFs, X, Y are said to be independent if and only if PX,Y (x,y) = PX(x)PY (y).

The expected value of a function of X, Y is given by

E [f(X,Y )] =

∫ ∞ −∞

f(x,y)pX,Y (x,y) dxdy.

28 CHAPTER 1. INTRODUCTION TO PROBABILITY

Note that this gives a consistent definition for the expected value of a function of X only:

E [f(X)] =

∫ ∞ −∞

f(x) pX,Y (x,y) dy dx

∫ ∞ −∞

f(x)

∫ ∞ −∞

pX,Y (x,y) dy dx

∫ ∞ −∞

f(x) pX(x) dx (1.86)

Given two random variables X and Y , there are two important expectations of interest.

Cross-Correlation: The cross-correlation (or just correlation) is given by E[XY ]. An important property of correlations is

E[XY ]2 ≤ E[X2]E[Y 2]. (1.87)

This follows from the Schwarz inequality, a well-known inequality for integrals. Furthermore, if E[X2] 6= 0 and E[Y 2] 6= 0, then equality holds only if P(X = cY ) = 1 for some constant c, so that X, Y are linearly dependent.

To show the inequality above, note the following: E[(X − αY )2] ≥ 0 for any α. Choose α = E[XY ]/E[Y 2] and note

0 ≤ E[(X −αY )2] = E[X2] − 2αE[XY ] + α2E[Y 2]

= E[X2] − E[XY ]2

E[Y 2]

This implies E[XY ]2 ≤ E[X2]E[Y 2]. Equality can only follow if (X −αY ) = 0 with probability 1.

Cross-Covariance: The cross-covariance is another measure of interrelationship and is defined by

Cov(X,Y ) ≡ σXY = E [(X −mX)(Y −mY )] = E[XY ] −mXmY . (1.88)

Using the Schwarz inequality on the random variables X −mX,Y −mY yields

|Cov(X,Y )| = |σXY | ≤ √ V ar(X)V ar(Y ) = σXσY

The corelation coefficient ρX,Y is defined using this inequality, as

ρX,Y = σXY σXσY

By the Schwarz inequality, the correlation coefficient takes values in −1 ≤ ρX,Y ≤ 1. Some very important properties of random variable pairs are defined in terms of these quantities:

Uncorrelated Random Variables: Two random variables X, Y are said to be uncorrelated if:

σXY = 0. (1.89)

From the definition of the cross-covariance we can see that an equivalent statement of the uncorrelated property is: E[XY ] = E[X]E[Y ].

Orthogonal Random Variables: The variables are said to be orthogonal if:

E[XY ] = 0. (1.90)

First note that orthogonal and uncorrelated are different concepts – be careful in your use of these terms! Also note that if two random variables are both orthogonal and uncorrelated, then the mean of at least one must be zero. Finally, for zero mean random variables, orthogonality and uncorrelated are equivalent.

1.8. CONDITIONAL PROBABILITIES, DENSITIES, AND EXPECTATIONS 29

Before moving on, note the extremely important (and often missed) fact that uncorrelated or orthogonal random variables are not necessarily independent, but independent random variables are always uncorrelated. Remember that independence is a strong property of the underlying densities, while uncorrelatedness is only a property of second order moments. Think, for example, of the difference between a random variable that is always zero and a zero mean random variable.

An important property of covariance is that it is a bilinear function of its two arguments. Suppose the random variables X,Y were defined in terms of other random variables U,V as

X = aU + bV ; Y = cU + dV

for some real-valued constants a,b,c,d. Then, by direct computation expressing the Covariance as an ex- pectation, we get:

Cov(X,Y ) = Cov(aU + bV,cU + dV ) = Cov(aU,cU) + Cov(bV,cU) + Cov(aU,dV ) + Cov(bV,dV )

= acCov(U,U) + bcCov(V,U) + adCov(U,V ) + bdCov(V,V )

= acV ar(U) + (bc + ad)Cov(U,V ) + bdV ar(V )

Like variance, covariance does not depend on the mean of random variables, but rather the variability of the random variable about their mean. Thus, for any constants a,b, we have

Cov(X + a,Y + b) = Cov(X,Y )

We will exploit these properties throughout the course.

1.8 Conditional Probabilities, Densities, and Expectations

We defined conditional probabilities in terms of events A, B in a probability space (Ω,F,P). As we have discussed, random variables can be used to define events in the original probability space; conditioning on these events can be used to define a conditional probability in the original space, which will induce a conditional probability in the sample space. In this section, we discuss the properties of such a conditional probability, and the conditional densities associated with them.

To begin with, consider a random variable X, and denote by B the event {ω | a < x(ω) ≤ b}. Denote by A the event {ω | c < X(ω) ≤ d}. Then, by (1.1), we have

P(A|B) = P(A∩B) P(B)

= P ({ω | X(ω) ∈ (a,b] and X(ω) ∈ (c,d]})

P(B)

= PX((a,b] ∩ (c,d])

PX((a,b]) ≡ PX((c,d]|(a,b]). (1.91)

Can we define the conditional probability of a single outcome, given observation of an event? Suppose we let d = c + �, and we let � → 0. Then, if the probability distribution function PX is differentiable at c, we know that PX(c) = 0, and so will PX(c|(a,b]). However, we may be able to define the conditional probability density function pX(c|(a,b]) by taking derivatives, as follows:

pX(c|(a,b]) = lim �→0+

PX((c,c + �]|(a,b]) �

(1.92)

= lim �→0+

PX((a,b] ∩ (c,c + �]) �PX((a,b])

(1.93)

{ 0 if c /∈ (a,b) lim�→0+

PX((c,c+�]) �PX((a,b])

otherwise. (1.94)

In the latter case,

pX(c|(a,b]) = pX(c)

PX((a,b]) .

30 CHAPTER 1. INTRODUCTION TO PROBABILITY

What about the converse event, of defining the conditional probability of an event given observation of a particular outcome of a random variable? We can let b = a + � in eq. (1.91), to obtain

P(A|a) = lim �→0+

P(A|(a,a + �]) (1.95)

= lim �→0+

P(A∩{ω | x(ω) ∈ (a,a + �]} PX((a,a + �])

(1.96)

= lim �→0+

P(A)PX((a,a + �]|A) PX((a,a + �])

(1.97)

= lim �→0+

PX((a,a + �]|A) �

�

PX((a,a + �]) P(A) (1.98)

= pX(a|A)P(A)

pX(a) (1.99)

assuming that pX(a) 6= 0. Using the above relationships, we have another version of the total probability theorem, expressed in

terms of probability densities. In essence, since the set of possible values of a random variable automatically generates disjoint sets in the event space, we have

P(A) =

∫ ∞ −∞

P(A|a)pX(a)da

and the corresponding version of Bayes’ rule:

pX(a|A) = P(A|a)pX(a)

P(A) =

P(A|a)pX(a)∫∞ −∞P(A|a)pX(a) da.

Given two random variables, we can compute the conditional density of one random variable given the other as a straightforward extension of the previous conditional density relationships:

pX|Y (x | y) = pX,Y (x,y)

pY (y)

Similarly, Bayes’ rule becomes

pX|Y (x | y) = pY |X(y | x)pX(x)

py(y) .

In particular, note that for independent random variables, pX|Y (x | y) = pX(x). This is an equivalent condition for independence of two random variables.

Given the conditional density of a random variable based on observation of another random variable, pX|Y (x | y), we can define conditional expectation of X given Y = y as

E [X | Y = y] = ∫ ∞ −∞

xpX|Y (x | y) dx. (1.100)

Note that if a particular value y is not specified, E[X | Y ] is a function of Y , and thus can be viewed as a random variable, since it is a function of a random variable. Therefore, we can take its expectation as

E [E [X | Y ]] = ∫ ∞ −∞

E[X | Y = y]pY (y) dy

∫ ∞ −∞

(∫ ∞ −∞

xpX|Y (x | y) dx ) pY (y) dy

∫ ∞ −∞

(∫ ∞ −∞

pX,Y (x,y) dy

) dx

∫ ∞ −∞

xpX(x) dx = E[X]. (1.101)

The above property is known as the smoothing property of conditional expectations. It can be very useful for finding expectations of random variables that have two sources of randomness, as in the example below.

1.9. RANDOM VECTORS 31

Example 1.4 Assume that the number of people in line at the bank when you arrive is N, where N is random, having a Poisson distribution with parameter λ. The time Ti that it takes to serve each person ahead of you can be described by an exponential distribution with parameter α, and the times for different people are mutually independent. How long do you expect to wait before someone starts to serve you? Let T be the time you will wait, then

T =

N∑ i=1

and

E[T] = E[E[T|N]] = E[N/α] = λ/α.

1.9 Random Vectors

We will frequently deal with several random variables. In this case rather than extend the notation introduced for two random variables, it will prove much more convenient and insightful to use vector notation. The vector notation simply serves as a compact way to “carry around” the associated collection of random variables. It is best if you are familiar and comfortable with such vector notation and concepts, so you should refer to the appendix or a suitable linear algebra text at this point if you need a review. Using vector notation, all of the concepts and results we developed for the cases of a single random variable and pairs of random variables can be generalized to random vectors where all of the elements are defined on a common probability space. Let:

X =

  X1... XN

 

denote a vector of N random variables. (Note that we may use the alternate notation of X for vectors as well.) The joint distribution function is given by

PX(x) = PX1,...,XN (x1, . . . ,xN ) = P({ω | X1(ω) ≤ x1, . . . ,XN (ω) ≤ xN}).

The joint density function is

pX(x) = ∂N

∂x1 . . .∂xN PX(x).

For any measurable set A ∈<N , we have

P(X(ω) ∈ A) = ∫ · · · ∫ A

pX(x)dx.

We can have several random vectors defined on the same probability space. Given two random vectors X,Y , we have a joint density pX,Y (x,y), from which we can recover marginal densities as

pX(x) =

∫ ∞ −∞

pX,Y (x,y) dy.

The vectors X,Y are said to be independent if pX,Y (x,y) = pX(x)pY (y).

The conditional density for X given Y is given by

pX|Y (x | y) = pX,Y (x,y)

pY (y) = pY |X(y | x)pX(x)

pY (y) .

Of course, the above formulas can be extended to more than two random vectors.

32 CHAPTER 1. INTRODUCTION TO PROBABILITY

1.9.1 Transformation of random vectors

The development in Sec. 1.6 can be extended to the case of random vectors. Suppose that Y = g(X) is a function of the random vector X. We can always compute the probability distribution of Y , based on the distribution of X using an extension of the method of equivalent events, as:

PY (y) = P ({ω | g1(X(ω)) ≤ y1, . . . ,gM (X(ω)) ≤ yM}) = ∫ A(y)

pX(x) dx,

where

A(y) = {x | g1(x) ≤ y1, . . . ,gM (x) ≤ yM}.

We can then obtain the density by differentiation of the distribution function. When g(X) is differentiable with discrete roots, and the dimension of X and Y are the same, we can use

an extension of the Jacobian method:

pY (y) =

K∑ i=1

pX(xi)∣∣∣det [dgdx(xi)]∣∣∣ (1.102) where [

dx ] is the Jacobian matrix, det(·) denotes the determinant of the matrix, and xi are the roots of the

transformation at the value y.

Example 1.5 Let U,V have the joint probability density function

pU,V (u,v) =

{ u + v 0 ≤ u,v ≤ 1 0 else

Let X = U2,Y = U(1 + V ). Note that this maps the U,V domain [0, 1]2 one to one into the area A = {(x,y) : 0 ≤ x ≤ 1, √ x ≤ y ≤ 2

√ x}. The Jacobian of this transformation is∣∣∣∣∂x∂u ∂x∂v∂y

∂u ∂y ∂v

∣∣∣∣ = ∣∣∣∣ 2u 01 + v u

∣∣∣∣ = 2u2 Using the above formula, we get:

pX,Y (x,y) = pU,V (u,v)

2u2 = u + v

2u2

Of course, the above expression needs to substitute for u,v in terms of x,y, as: u = √ x,v = y√

x − 1. The final result is:

pX,Y (x,y) =

{√ x+

y√ x −1

2x (x,y) ∈ A

0 else

Example 1.6 Some of the easiest transformations are those that are linear in the random variables, because the Jacobian matrix is constant. As long as the determinant is non-zero, the transformation would be one-to-one and invertible. Let U,V be independent random variables with known density functions. Define

X = U + V ; Y = U −V

Note the inverse transformation is easy to compute:

U = (X + Y )/2; V = (X −Y )/2

The Jacobian matrix is ∣∣∣∣∂x∂u ∂x∂v∂y ∂u

∂y ∂v

∣∣∣∣ = ∣∣∣∣ 1 1−1 1

∣∣∣∣ = 2 Hence,

pX,Y (x,y) = 1

2 pU ((x + y)/2)pV ((x−y)/2)

1.9. RANDOM VECTORS 33

1.9.2 Expectations of functions of a random vector

The expectation of a function of a random vector X is given by

E[g(X)] =

∫ ∞ −∞

g(x)pX(x) dx.

The expectation of an M-dimensional vector-valued function g can be defined componentwise, as

E [ g(X)

] =

  E [g1(X)]... E [gM (X)]

  .

Some important expectations are:

Mean Vector: This is just the collection of individual expected values of each element of the random vector:

E[X] = mX. (1.103)

Covariance Matrix: This is the matrix of variances and cross-covariances of the variables within the random vector:

ΣXX = E [( X −mX

)( X −mX

)T] = E

[ XXT

] −mXmTX. (1.104)

Thus the covariance matrix gives us information about the interrelationship between the elements within a single random vector. In particular note that the elements of ΣXX are just:

(ΣXX)ii = σ 2 Xi

; (ΣXX)ij = E [ (Xi −mXi)

( Xj −mXj

)] = σXiXj (1.105)

so that the covariance matrix can be seen to be a compact way of representing the collection of variance and cross-covariance information for the collection of random variables in the random vector. The covariance matrix can be seen to be the natural generalization of the concept of the variance of a random variable, extended to a random vector.

In these notes, we will often drop the double subscript of the covariance matrix for convenience and brevity. For instance, we will often write ΣX instead of ΣXX. Note the dimensions of the covariance matrix: if X is N-dimensional then ΣX is a square N ×N matrix. Properties of the covariance matrix are discussed in Section 1.10.

Cross-covariance Matrix: Analogous to the covariance matrix, the cross-covariance matrix is simply the matrix of cross-covariances between the elements of two different random vectors:

ΣXY = E [( X −mX

)( Y −mY

)T] = E

[ XY T

] −mXmTY . (1.106)

Thus the cross-covariance matrix gives us information about the interrelationship between the elements of two different random vectors. In particular note that the elements of ΣXY are just:(

ΣXY ) ij

= E [ (Xi −mXi)

( Yj −mYj

)] = σXiYj (1.107)

so that the cross-covariance matrix is a compact way of representing the collection of cross-covariance information between the collection of random variables in the two different random vectors. Like the covariance matrix, the cross-covariance matrix can be seen as the natural generalization of the cross- covariance between two random variables, extended to two random vectors. Note the dimensions of the cross-covariance matrix: if X is N-dimensional and Y is M-dimensional then ΣXY is an N ×M matrix, and so is not necessarily square. Properties of the cross-covariance matrix are discussed in Section 1.10.

Characteristic Function: This function is the generalization of the characteristic function we defined for a random variable to a random vector:

ΦX(jw) = E [ ejw

TX ] . (1.108)

34 CHAPTER 1. INTRODUCTION TO PROBABILITY

Note that now the frequency variable w is itself a vector, since X is a vector. As a consequence the characteristic function for an N-dimensional random vector is really an N-dimensional function. High and low frequencies are defined by the magnitude or length of the w vector, while its orientation determines direction. It is just the multidimensional Fourier Transform of the pdf. As for the scalar case, it completely determines the pdf.

As stated above, the cross-covariance matrix gives us information about the interrelationship between two random vectors. Along these lines we have the following two definitions, which are the generalizations of our earlier definitions to pairs of random vectors:

Uncorrelated Random Vectors: Two random vectors X and Y are said to be uncorrelated if the cross- covariance matrix is identically zero (i.e. each element is zero):

(ΣXY )ij = 0 ∀i,j (1.109)

From the definition of the cross-covariance we can see that an equivalent statement of the uncorrelated property is: E

[ XY T

] = E [X]E[Y ]

T . If we are talking about a single random vector, then to say X

is uncorrelated is to say that all elements are uncorrelated to each other, in which case the covariance matrix ΣX is diagonal.

Orthogonal Random Vectors: The random vectors X, Y are said to be orthogonal if the cross-correlation matrix is identically zero (i.e. each element is zero):

(E [ XY T

] )ij = 0 ∀i,j. (1.110)

Again note that orthogonal and uncorrelated are different concepts – be careful in your use of these terms! Also note that if two random vectors are both orthogonal and uncorrelated, then the mean vector of at least one must be zero. If we are talking about a single random vector, then to say X is orthogonal is to say that all elements are orthogonal to each other, in which case the correlation matrix E[XXT ] is diagonal. Finally, for zero mean random vectors, orthogonality and uncorrelated are equivalent (as for random variables).

We may also define conditional quantities for random vectors, in an analogous manner to our definitions for random variables.

Conditional Mean Vector:

E[X | Y = y] = mX|y = ∫ ∞ −∞

xpX|Y (x | y) dx (1.111)

Conditional Covariance Matrix:

ΣX|y =

∫ ∞ −∞

( x−E

[ X | Y = y

])( x−E

[ X | Y = y

])T pX|Y (x | y) dx (1.112)

As before, these conditional quantities can have two interpretations. If we observe a particular value of y we can think of these conditional expectations as deterministic. Alternatively, mX|Y and ΣX|Y can be thought of as functions of Y , and thus as random quantities themselves. In this latter case, these quantities have their own densities and e.g. we can find their expectations. In particular, as for the scalar case, the smoothing property of conditional expectations still holds: E [E [X | Y ]] = E[X].

1.10 Properties of the Covariance Matrix

In this section we examine and summarize properties of covariance matrices and cross-covariance matrices. Since (

ΣX ) ij

= σXiXj = σXjXi = ( ΣX ) ji , (1.113)

the first obvious property of the covariance matrix is that it is symmetric:

ΣX = Σ T X. (1.114)

1.10. PROPERTIES OF THE COVARIANCE MATRIX 35

To proceed, let us first understand how the covariance matrices of random vectors are transformed by linear operations such as matrix multiplication and vector addition. Let X, Y be random vectors, and define a new random vector Z by linear operations as follows:

Z = AX + BY + c

for some deterministic matrices A, B of appropriate dimensions and a deterministic vector c. Since expec- tation is a linear operation, we can compute

E[Z] = E[AX + BY + c]

= E[AX] + E[BY ] + E[c]

= AE[X] + BE[Y ] + c

= AmX + BmY + c (1.115)

Similarly, we can compute the covariance Σz as

ΣZ = E [ ZZT

] −E

[ mZm

T Z

] = E

[ (AX + BY + c) (AX + BY + c)

T ] −E

[ mZm

T Z

] = AE

[ XXT

] AT + AE

[ XY T

] BT + BE

[ Y XT

] AT + BE

[ Y Y T

] BT

+E [AX + BY ] cT + cE [AX + BY ] T

+ ccT −E [ mZm

T Z

] = AΣXA

T + AΣXY B T + BΣY XA

T + BΣY B T (1.116)

These results are general, in that they apply to any linear transformation of a pair of random vectors. Now let’s use this result to continue to characterize the covariance matrix. To this end, consider the

special case arising if we define the scalar random variable Z as follows:

Z = aTX (1.117)

for a deterministic vector a and a random vector X of equal dimension. Then, from (1.116) we have that the variance of Z is given by:

σ2Z = a T ΣXa (1.118)

Since we know that σ2Z ≥ 0 and the vector a was general we thus have that:

aT ΣXa ≥ 0, for all a (1.119)

A symmetric matrix that has this property is termed positive semi-definite. So we have just proved that a covariance matrix must be a positive semi-definite matrix. While the definition provided in (1.119) is correct, it is not convenient to apply (since forming the quadratic form aT ΣXa for all a is difficult). It turns out that an equivalent condition for positive semi-definiteness of a symmetric matrix is that all its eigenvalues must be nonnegative. This is not hard to see. Suppose the matrix ΣX has an eigen-decomposition as follows:

ΣX = UΛU T = U

  λ1 . . .

λN

 UT , (1.120)

where U is the matrix whose columns are the eigenvectors of ΣX and λ1, · · · ,λN are the corresponding eigenvalues. For a symmetric matrix, the eigenvector matrices can always be chosen as unitary, and so correspond to (generalized) rotation matrices. Now consider the quadratic form:

aT ΣXa = a TU︸︷︷︸ ãT

Λ UTa︸︷︷︸ ã

= ãT

  λ1 . . .

λN

  ã ≥ 0. (1.121)

36 CHAPTER 1. INTRODUCTION TO PROBABILITY

Now the above expression must be nonnegative for all possible choices of the vector a or, equivalently, the vector ã (since we can always find a vector a to generate any vector ã = UTa). In particular, it must hold if we choose ã as the unit coordinate vectors

 1 0 0 0 ...

  ,

 

0 1 0 0 ...

  ,

 

0 0 1 0 ...

  ,

 

0 0 0 1 ...

  , · · · (1.122)

However, this choice of vectors just picks out each eigenvalue in turn! Thus, for the quadratic form to be nonnegative each eigenvalue must be nonnegative, so nonnegativeness of the eigenvalues is necessary. Is it sufficient? Yes, since any other vector can be formed as a linear combination of these coordinate vectors. In summary, a covariance matrix must be a symmetric, positive semi-definite matrix.

Note that if ΣX is positive definite, the quadratic form will always be positive or equivalently all the eigenvalues will be positive. In this case, ΣX will be invertible and the covariance of derived random variable Z would always be strictly positive.

The singular or indefinite case, when ΣX has a zero eigenvalue seems special. When does such a case arise? Lets take a closer look. Suppose we know that one element of the random vector X is really a linear combination of the other elements (i.e. suppose the elements are linearly dependent). In this case, there exists a vector a such that:

aTX = 0, (1.123)

which implies that:

aT ΣXa = a T ( E [ XXT

] −E [X] E [X]T

) a =

( E [ aTXXTa

] −E

[ aTX

] E [ aTX

]T) = 0, (1.124)

which implies that ΣX is singular. Thus, if one element of a random vector is a linear combination of the other elements, then the covariance matrix of the vector will be singular!

Finally, what about the cross-covariance matrix? Again, using the definition of the cross-covariance we see that (

ΣXY ) ij

= σXiYj = σYjXi = ( ΣY X

) ji , (1.125)

so the cross-covariance matrix satisfies:

ΣXY = Σ T Y X.

Unlike the entries of the covariance matrix, the entries of the cross-covariance do not satisfy any restrictions. Indeed any matrix could be the cross-covariance of some pair of random vectors. It does not even need to be square.

1.11 Gaussian Random Vectors

A special case of random vectors is the case of what are termed Gaussian random vectors. Recall that Gaussian random variables have at least two extremely important properties. First, their probability density functions can be completely characterized by just two quantities: the mean and the covariance. Second, linear functions of a Gaussian random variable result in another Gaussian random variable. The extensions to Gaussian random vectors will also possess generalizations of these properties. In particular, Gaussian random vectors will be completely characterized by their mean vectors and covariance matrices, and linear functions of Gaussian vectors will also result in Gaussian random vectors. This has important consequences. For example, the analysis of linear systems driven by Gaussian random variables can be restricted to analyzing the first and second-order expectations of the inputs and outputs.

Recall that, for a Gaussian random variable X, we use the notation N(m,σ2) to denote a Gaussian distribution of mean m and variance σ2 (cf. eq. (1.39)). Then, the random variable Z = aX + b has distribution N(am + b,a2σ2), from equations (1.115) and (1.116).

1.11. GAUSSIAN RANDOM VECTORS 37

Now how will we define a Gaussian random vector? The answer is in terms of quantities we already know – in particular, in terms of Gaussian random variables. An n-dimensional random vector

X =

  X1 X2 ... Xn

 

is defined to be a Gaussian random vector (or equivalently, {X1, . . . ,Xn} are defined to be a set of jointly Gaussian random variables) if, for all constants

a =

  a1 a2 ... an

  ,

the random variable Y = aTX is a Gaussian random variable. Note that it is not enough that each entry is marginally a Gaussian random variable for the vector to be a Gaussian random vector! All linear combinations of the entries must also be Gaussian. The converse, however is true: the entries of a Gaussian random vector are individually Gaussian random variables.

The probability density of Gaussian random vectors is completely described by the mean mX and the covariance ΣX. We use the notation X ∼ N(mX, ΣX) to denote this distribution. Then,

pX(x) = N(x; mX, ΣX) = 1√

(2π)n|det ΣX| e−0.5(x−mX)

T (ΣX) −1(x−mX) (1.126)

where we have assumed that ΣX is invertible. By extension of the properties of Gaussian random variables, we can compute explicitly other important expectations of Gaussian random vectors. In particular, the joint characteristic function of Gaussian random vectors is given by

ΦX(jw) = E [ ejw

TX ]

= ejw TmX−w

T ΣXw/2,

where the above formula is valid even if ΣX is not invertible. Using the above formula for characteristic functions, it is easy to show that linear combinations of

Gaussian random vectors are Gaussian random variables. Let Z = aTX + b for some constants a, b. Consider now the characteristic function of Z, given by:

ΦZ(jv) = E [ ejvZ

] = E

[ ejv(a

TX+b) ]

= E [ ej(va)

TX ] ejvb

= ΦX (jva) e jvb = ejv(a

TmX+b)−v 2aT ΣXa/2, (1.127)

which is the characteristic function of a Gaussian random variable with mean aTmX + b and variance

aT ΣXa. Recall that there is a one-to-one correspondence between characteristic functions and probability density functions.

An important property of Gaussian random vectors is that two Gaussian random vectors are independent if and only if they are uncorrelated! To see this, let X and Y be jointly Gaussian random vectors of dimensions n,m respectively. Define a Gaussian random vector Z as

Z =

[ X Y

] .

Then, Z ∼ N(mZ, ΣZ), with

mZ =

[ mX mY

] ΣZ =

[ ΣX ΣXY

ΣY X ΣY

] .

38 CHAPTER 1. INTRODUCTION TO PROBABILITY

If X and Y are uncorrelated, it means that ΣXY = ΣY X = 0. Under this condition, we have:

det ΣZ = det ΣX det ΣY

and

Σ−1Z =

[ Σ−1X 0

0 Σ−1Y

] .

Substituting into (1.126), we get

pZ(x,y) = 1√

(2π)n+m det ΣZ e−0.5(z−mZ)

T (ΣZ) −1(z−mZ)

= 1√

(2π)n det ΣX

1√ (2π)m det ΣY

e−0.5(z−mZ) T (ΣZ)

−1(z−mZ)

= 1√

(2π)n det ΣX e−0.5(x−mX)

T (ΣX) −1(x−mX)

1√ (2π)m det ΣY

e−0.5(y−mY ) T (ΣY )

−1(y−mY )

= pX(x)pY (y). (1.128)

Another important property of Gaussian random vectors, which we will derive later when we deal with estimation, is that the conditional density of a Gaussian random vector, X, given an observation of another Gaussian random vector Y , is also Gaussian! Thus, this allows us to represent the conditional density in terms of two expectations which are readily computed: the conditional mean E[X|Y ] and the conditional covariance ΣX|Y . Furthermore, we shall see that the conditional covariance does not depend on Y , but is a constant matrix! The resulting formulas, which we will derive later in the course, are:

E[X | Y ] = mX + ΣXY Σ −1 Y Y (Y −mY )

ΣX|Y = ΣX − ΣXY Σ−1Y Y ΣY X The properties of Gaussian random vectors are summarized in the next theorem.

Theorem 1.1 1. If X is a Gaussian random vector, then each of the random variables Xk are Gaussian.

2. if Xi, i = 1, . . . ,n are each Gaussian and they are independent, then the vector X = (X1, . . . ,Xn) T is a joint

Gaussian random vector.

3. The characteristic function of a joint Gaussian random vector with mean m and covariance Σ is given by

ΦX(jω) = e jωTm−omegaT Σω

4. If X is a random vector, and its covariance ΣX is diagional, then the coordinate random variables X1, . . . ,Xn are independent.

5. A Gaussian random vector X with mean mX and covariance ΣX that is invertible has probability density

1√ (2π)n|det ΣX|

e −0.5(x−mX)

T (ΣX) −1(x−mX)

If the covariance is not invertible, then the probability density function does not exist.

6. if X,Y are jointly Gaussian random vectors, they are independent if and only if Cov(X,Y ) = 0.

As a final note on Gaussian random vectors, since the joint density function depends only on the mean and covariance parameters, then all of the moments and other expectations must be expressible in terms of these parameters. Indeed, there are general formulas, based on using the characteristic function to obtain the moments.

1.12 Inequalities for Random Variables

In order to analyze notions of convergence of random variables, it is useful to bound the errors between the limit random variable and elements of the sequence using simple inequalities. Below, we present several of the most useful inequalities:

1.12. INEQUALITIES FOR RANDOM VARIABLES 39

1.12.1 Markov inequality

Suppose that X is a non-negative random variable with known mean, and we want to obtain some bounds on the probability distribution function of X. A simple inequality is given by

P(X ≥ a) = ∫ ∞ a

pX(x) dx ≤ E(X)/a. (1.129)

This follows from

E(X) =

∫ ∞ a

xpX(x) dx +

∫ a 0

xpX(x) dx

≥ a ∫ ∞ a

pX(x) dx = aP(X ≥ a). (1.130)

The above argument can be generalized as follows: Let f(x) ≥ 0 everywhere, and let f(x) > a > 0 for all x ∈ A, for a subset A of the real line <. Then,

E[f(X)] =

∫ x∈A

f(x)pX(x) dx +

∫ x/∈A

f(x)pX(x) dx

≥ ∫ x∈A

f(x)pX(x) dx ≥ a ∫ x∈A

pX(x) dx = aP(X ∈ A). (1.131)

1.12.2 Chebyshev inequality

Suppose that the mean m and variance σ2 of a random variable X are known, and we would like to bound the probability that the variable is far from its mean. The Chebyshev bound is given by

P(|X −m| ≥ a) ≤ σ2

a2 . (1.132)

This bound is a special case of the Markov bound, since we can take the random variable (x−m)2, which is nonnegative and has known variance σ2. Then, by (1.129),

P(|X −m| ≥ a) = P((X −m)2 ≥ a2) ≤ σ2

a2 .

The above can be generalized for any random variable with finite higher-order moments, as

P [|X −m| ≥ a] = P(|X −m|n ≥ an) ≤ E[|X −m|n]

or, more generally, for any real, nonnegative, even function f(x) which is non-decreasing for x > 0, and has finite expectation. Then,

P [f(X) ≥ f(a)] ≤ E[f(X)]

f(a) .

1.12.3 Chernoff Inequality

Given a random variable X, we can define a new random variable Y� as:

Y�(ω) =

{ 1 if X(ω) ≥ � 0 otherwise

That is, Y is the indicator random variable that X ≥ �. Then, for all t ≥ 0, for all outcomes ω, the following inequality holds:

etX ≥ et�Y.

Thus, E [ etX ] ≥ E

[ et�Y

] = et�P [X ≥ �],

40 CHAPTER 1. INTRODUCTION TO PROBABILITY

which implies that P [X ≥ �] ≤ e−t�E

[ etX ] , t ≥ 0.

This bound can be tightened through the choice of t, as follows:

P [X ≥ �] ≤ min t≥0

e−t�E [ etX ] .

Note that this bound requires computation of E [ etX ] , which is equivalent to computing the characteristic

function of X! Thus, this bound requires extensive knowledge of the full probability density function of X, and not just its mean and variance.

1.12.4 Jensen’s Inequality

A convex function of a continuous variable in an interval I is a function such that, for any α ∈ [0, 1], any x, y ∈ I, the following is true:

f(αx + (1 −α)y) ≤ αf(x) + (1 −α)f(y).

If the function f is twice differentiable, then it is convex if and only if the second derivative f̈ ≥ 0 for all x ∈ I. If it is only once differentiable, it is convex if and only if f(x) + ḟ(x)(y −x) ≤ f(y) for all x,y ∈ I. Now, let X denote a random variable with probability density distributed over I, and let m denote its mean, which must be in I. Then, for any convex function f, we have

f(m) + ḟ(m)(X −m) ≤ f(X).

Taking expectation of both sides, we have

f(m) + ḟ(m) (E[X] −m) = f(m) = f (E [X]) ≤ E [f(X)] ,

which is known as Jensen’s inequality. A concave function is a function f whose negative −f is convex. For concave functions, Jensen’s inequality

is reversed.

1.12.5 Moment Inequalities

Using Jensen’s inequality, we can derive a number of inequalities involving expectations. We list some of these below.

E[|X + Y |r] ≤ cr(E[|X|r] + E[|Y |r])

where

cr =

{ 1 if r ≤ 1 2r−1 if r > 1

To show this, note that the function f(z) = zr + (1 − z)r is a convex function on (0, 1) if r ≥ 1, and a concave function if r < 1. It is symmetric about the value z = 0.5, and achieves its maximum or minimum

at this value, taking the value 21−r. Thus, crf(z) ≥ 1. Now, let z = |x| |x|+|y|. Then,

crf(z) = cr |x|r + |y|r

(|x| + |y|)r ≥ 1.

Multiplying through by the denominator gives

cr(|x|r + |y|r) ≥ (|x| + |y|)r ≥ |x + y|r.

Taking expectations of both sides gives the result. The next inequality is known as the Holder inequality; it is

E[|XY |] ≤ E[|X|r]1/rE[|Y |p]1/p,

where 1/r + 1/p = 1. This follows because the function f(z) = ln z is concave on the interval (0,∞), so, for p ∈ [0, 1], x1,x2 > 0, we have

(1 −p)f(x1) + pf(x2) ≤ f((1 −p)x1 + px2).

1.12. INEQUALITIES FOR RANDOM VARIABLES 41

Taking exponentials of both sides yields

x 1−p 1 x

p 2 ≤ (1 −p)x1 + px2.

Define

X1 = |X|r

E[|X|r] , X2 =

|Y |s

E[|Y |s] ,

where 1/r = 1 −p, 1/s = p. Then, the above inequality becomes

|XY | E[|X|r]1/rE[|Y |s]1/s

≤ |X|r

rE[|X|r] +

|Y |s

sE[|Y |s] .

Taking expectations and multiplying through, we obtain

E[|XY |] ≤ E[|X|r]1/rE[|Y |s]1/s( E[|X|r] rE[|X|r]

+ E[|Y |s] sE[|Y |s]

)

≤ E[|X|r]1/rE[|Y |s]1/s(1/r + 1/s) = E[|X|r]1/rE[|Y |s]1/s. (1.133)

The Schwarz inequality is obtained by taking r = s = 2 in the Holder inequality. The final inequality which we will show is the Lyapunov inequality. Consider the function f(t) =

ln E[|U|t], for t ≥ 0. This is a convex function of t for any interval in which the expectation exists (the integrand is a convex function for each U, and the integral of convex functions or the sum of convex func- tions remains convex). Also, note that f(0) = 0. Thus, the slope f(t)/t must increase monotonically with t. Taking exponentials maintains this monotonic property, so that (E[|U|t])1/t is also a monotone function. Thus,

(E[|U|t])1/t ≥ (E[|U|s])1/s

for t ≥ s ≥ 0 .

42 CHAPTER 1. INTRODUCTION TO PROBABILITY

Chapter 2

Sequences of Random Variables

Consider a sequence of random variables {xn},n = 1, . . . ,∞, defined on a common probability space (Ω,F,P). Under the analogy that random variables are functions, a natural question to ask is when would there be a random variable which would be considered a limit of the sequence? Before we can answer that question, consider some options for what we mean by a limit:

1. For every single outcome ω, the numbers {xn(ω)} must approach a limit, and collectively across all outcomes, that limit is a random variable.

2. For almost every outcome ω (except for a negligible set of probability zero), the above must occur.

3. The probability distribution functions of {xn} must converge to a valid probability distribution.

Clearly, the above interpretations would all be useful notions of convergence, and are all important concepts of how to interpret limits of random variables. Before we begin the proper definitions, let’s consider a couple of motivating examples.

Example 2.1 Let y be a random variable, selected uniformly from the interval [0, 1]. For n = 1, . . . , define the random variable xn = y(1 − 1/n).

Example 2.2 Let y1 be a random variable, selected uniformly from [0, 1]. For each n > 1, let yn be a random variable, selected uniformly from the interval [yn−1, 1]. Define

xn =

{ 1 if yn > 1 − 1/n 0 otherwise

Consider the two sequences {yn}, {xn}. In the first example, it is clear that the sequence {xn} can be interpreted to converge for each outcome to whatever value y has for that outcome. However, convergence in the second example is harder to define. Consider the experiment described in this example: it is a compound experiment, where an infinite number of yn will be generated, and each yn will depend strongly on the values of the previous one. Intuitively, it seems that the random variables yn are increasing towards 1, but they do so at different rates for each outcome. Furthermore, there are an uncountable sequences of outcomes which do not converge to 1, such as yn = 0.5(1 − 1/n). Does this example converge, and in what sense?

2.1 Convergence Concepts for Random Sequences

Before discussing convergence of random sequences, it is useful to recall that random variables are similar to functions; thus, let us review first the notions of convergence for sequences of functions.

Let D denote the domain of a sequence of real-valued functions fn(x),x ∈ D, and the function f. We say that the sequence {fn} converges pointwise to f if and only if, for any � > 0, and any point x ∈ D, there is an integer N(x,�) such that, for n ≥ N, |fn(x) − f(x)| < �. A sequence of function converges uniformly in D to f if and only if, for any � > 0, there is an integer N(�) such that, for any x ∈ D and for n ≥ N, we have |fn(x) −f(x)| < �. The difference is that, in the second case, the same value of N works for all x, where in the first case, the value of N could depend on x.

44 CHAPTER 2. SEQUENCES OF RANDOM VARIABLES

The uniform convergence criterion is more concisely stated in terms of a distance norm. Let:

‖f‖ = sup x∈D |f(x)|

Then, the sequence {fn} converges uniformly to f if and only if

lim n→∞

‖fn −f‖ = 0

You may wonder why uniform convergence is important. The reason is essentially that it allows you to interchange certain limit operations. For example, if {fn} converges uniformly to f, then

lim n→∞

∫ D

fn(x)dx =

∫ D

f(x)dx.

Based on the above discussion, let’s define several concepts of convergence for sequences of random variables.

Definition 2.1 (Sure Convergence) The sequence {xn} defined on a probability space (Ω,F,P) is said to converge surely or everywhere to a random variable x, if, for each outcome ω, the sequence of numbers {xn(ω)} converges to a limit x(ω).

Note that, in the above definition, the convergence could be at different rates for different outcomes; this is equivalent to the notion of pointwise convergence of functions. We can also define the notion of uniform convergence, by requiring that, for each � > 0, there would have to be an N(�) such that |xn(ω) −x(ω)| < � for all n > N(�), for all ω; the uniformity arises because N(�) is the same for all outcomes ω.

Up to now, we have considered sequences of random variables as nothing more than sequences of func- tions, without exploiting the probability structure of the random variables. We now define some notions of convergence which take into account the probabilistic structure.

Definition 2.2 (Almost Sure Convergence) The sequence {xn} defined on a probability space (Ω,F,P) is said to converge almost surely or almost everywhere to a random variable x if, for each outcome ω except those in a set A ∈ F such that P(A) = 0, the sequence of numbers {xn(ω)} converges to a limit x(ω). We write

lim n→∞

xn a.e. = x

Mathematically, almost sure convergence requires that, for any given δ,� > 0, there exists an N(�,δ) such that:

P[∪n>N{ω : |xn(ω) −x(ω)| > �}] < δ

or equivalently, P [ sup n>N |xn(ω) −x(ω)| < �] > 1 − δ

Again, one can define the concept of uniformly almost everywhere convergence as a variation. Note that the concept of almost everywhere convergence implies that, the set of “bad” outcomes (for which some |xn(ω) −x(ω)| > �,n > N) shrinks as N increases to a set with zero probability.

Example 2.3 Consider a probability space on the unit the interval [0, 1], with a uniform probability measure. Define

xn(ω) =

{ 1 if 2−n ≤ ω < 2−n+1 0 otherwise

Denote the limit as x(ω) = 0 everywhere. Note that the bad sets Bn, for each n, have probability 2 −n. Furthermore,

∞∑ n=1

P(Bn)

converges. Then, for every δ > 0, there exists an integer N such that

P[∪n>N{ω : |xn(ω) −x(ω)| > �}] < δ

which guarantees almost sure convergence.

2.1. CONVERGENCE CONCEPTS FOR RANDOM SEQUENCES 45

Example 2.4 Let xn be a sequence of random variables on the standard unit interval probability space ([0, 1],B([0, 1]),P) with P being the standard Lebesgue measure on Borel sets. This sequence is defined as follows: x1 = 1 everywhere. For arbitrary n, let n = 2k + j, where k = blog2 nc. Then, xn(ω) = 1 on the interval (j2

−k, (j + 1)2−k], and 0 elsewhere. Thus, x2(ω) = 1 for ω ∈ (0, 0.5]; x3(ω) = 1 for ω ∈ (0.5, 1].

As n increases, the variables xn(ω) have thinner support. However, this support rotates over every possible value of ω ∈ (0, 1]. While it appears that xn is converging to x = 0, we can show that, for any ω ∈ (0, 1], there is an infinite number of xn(ω) = 1. Hence,

∪n>N{ω : |xn(ω) −x(ω)| > �} = (0, 1] and this set has probability 1, therefore the sequence does not converge almost everywhere.

There are many examples of random variables which converge almost everywhere. However, in order to determine whether a particular sequence converges almost everywhere, we need to know in detail the prob- ability law that governs the selection of ω, and the relationship between the outcome ω and the sequence. There are weaker notions of convergence, which may not require knowledge of the behavior of entire sample sequences. We list some of these below.

Definition 2.3 (Convergence in Probability) The sequence of random variables {xn} is said to converge in probability to the random variable x if, for any � > 0,

lim n→∞

P [{ω | |xn(ω) −x(ω)| > �}] = 0

We use the following notation to denote convergence in probability:

lim n→∞

xn p. = x

Convergence in probability require that the probability that |x−xn| < � be increasing to 1 as n increases to ∞. However, the actual value x−xn can be large for some outcomes ω. Almost sure convergence implies convergence in This set of bad outcomes can depend on n, and its probability is decreasing. Consider example 2.4 above; by direct computation, we have:

P(|xn −x| ≥ �) = 2−blog2 nc

and hence is converging to 0 as n →∞. This example converges in probability. Note that convergence in probability is unique, in the sense that if x,y are both limits of the same

sequence in probability, then P [x 6= y] = 0. However, convergence in probability still allows a large gap between xn and x for some values of ω; this may not be acceptable, so we define a third form of convergence.

Definition 2.4 (Mean Square Convergence) The sequence of random variables {xn}, with the property that E[x2n] < ∞, is said to converge in the mean-square sense to the random variable x if

lim n→∞

E [ (xn −x)2

] = 0

We denote mean-square convergence as

lim n→∞

xn mss = x

Mean square convergence is also called convergence in quadratic mean. Note that the limit random variable will also have, by necessity, E[x2] < ∞, because convergence implies that, for some K, E[(xK−x)2] < 1. Note then that

E[x2] = E[(x−xK + xK)2] ≤ E[(x−xK)2] + E[x2K] ≤ 1 + E[x 2 K] < ∞

Mean square convergence is of great practical interest in engineering applicaions because of the interpre- tation of E[(xn−x2)] as the power in the error signal, and because convergence can usually be established in terms of second-order statistics of the random variables involved. We also have a useful approach to verifying when a sequence converges in mean-square sense, without having to determine the limiting random variable first. This known as the Cauchy Criterion:

Definition 2.5 (Cauchy Criterion) The Cauchy criterion for establishing mean-square convergence of a sequence of random variables {xn} states that {xn} converges in mean-square sense if and only if

lim m,n→∞

E [ (xn −xm)2

] = 0

46 CHAPTER 2. SEQUENCES OF RANDOM VARIABLES

Mean-square convergence does not imply that, as n increases, almost all sequences approach the limit and remain close to the limit; instead, it implies that, for each n, an increasing proportion of trajectories get close to the limit, but allows some trajectories to be far from the limit. Thus, mean-square convergence is more uniform across trajectories, but does not require that almost every trajectory converge. We illustrate the difference with two examples.

Example 2.5 Let y be selected uniformly in the interval [0, 1]. Define the sequence of random variables

xn = e −n(ny−1)

Note that the probability that ny > 1 increases to 1 as n → ∞, which suggests that the xn approach a limit of 0. As a matter of fact, for {ω : y(ω) > 0} (which is an event of probability one), xn will approach its limit of 0, and so

lim n→∞

xn a.e. = 0

Now, consider whether the same sequence converges in the mean-square sense. Letus use the definition:

E [ (xn − 0)2

] = E[e

−2n(ny−1) ]

= e 2n

∫ 1 0

e −2n2y

dy = e2n

2n2 (1 −e−2n

) (2.1)

This blows up as n →∞! Thus, almost sure convergence does not imply mean-square convergence.

Example 2.6 Consider a communication channel transmitting bits, and let xn denote the random variable that the n-th bit was trans- mitted in error. Suppose the error mechanism in the channel is described as follows: the first bit is always in error; the next two bits have one error total, with probability distributed equally among them. The next 3 bits have one error total, distributed equally among them. The construction continues recursively as above, so that there is one error between the m(m − 1)/2 bit and the m(m + 1)/2 bit (right side inclusive), distributed uniformly, for each positive integer m. Note that, as n →∞, the probability that a bit is in error decreases to zero. Indeed, let’s verify that this sequence converges in mean-square sense. Let n ∈ (m(m− 1)/2,m(m + 1)/2]. Then,

lim n→∞

E[(xn) 2 ] = lim

n→∞ 1/m = 0

so we have mean-square convergence. However, we can’t have almost everywhere convergence, because, no matter how large we have n (or equivalently m), every sequence is guaranteed to have an errored bit in [m(m− 1)/2,m(m + 1)/2] and thus, has elements that are far from zero.

Mean-square convergence has several strong implications. First, we can show that, if x is the limit in mean-square sense of {xn}, then

lim n→∞

E[xn] = E[x]

This is because 0 ≤ E2[x−xn] ≤ E[(x−xn)2]

Taking limits establishes the result. Furthermore, the limit is unique in that, if x,y are both limits in mean-square sense of {xn}, then P [x 6= y] = 0.

Note that mean-square convergence implies convergence in probability. By the Chebyshev inequality,

P [{ω | |xn(ω) −x(ω)| > �}] ≤ E [ (xn −x)

2 ]

�2 → 0

if the sequence is mean-square convergent. As in mean-square convergence, the trajectories are not required to stay close to the limit, although, for n large enough, most of them will be close to the limit.

With some additional conditions, we can obtain that mean-square convergence also implies almost ev- erywhere convergence. Indeed, if for some p > 0, we have

∑∞ n=1 E[|x − xn|

p] < ∞, then the sequence is guaranteed to converge in mean-square sense and almost everywhere.

Convergence in probability can imply convergence in mean square sense if there exists a random variable x0 with E[x

2 0] = K < ∞ and |xn| ≤ x0 almost everywhere. That is because

E[(xn −x)2] ≤ �2P(|xn −x| ≤ �) + P(|xn −x| ≤ �)2K

2.1. CONVERGENCE CONCEPTS FOR RANDOM SEQUENCES 47

which goes to 0 as n →∞,� → 0. As a final concept, we define the notion of convergence in distribution. This type of convergence does

not require trajectories to remain close at all, but the resulting probability distributions must converge. We don’t even require that all variables be defined on the same probability space, unlike the other types of convergence defined above.

Definition 2.6 (Convergence in Distribution) The sequence of random variables {xn} with probability distribution functions Pn(x) is said to converge in distribution to the random variable x with probability distribution P(x) if,

lim n→∞

Pn(x) = P(x)

for all x at which P(x) is continuous. We use the following notation to denote convergence in distribution:

lim n→∞

xn d. = x

Convergence in probability implies convergence in distribution, since the probability distribution functions are defined in terms of inequalities on the values P([{ω : xn(ω) ≤ �}]. Thus convergence in distribution is a weaker concept. Consider the following example:

Example 2.7 Define the sequence of random variables xn consisting of independent, identically distributed uniformly distributed random variables on the interval [0, 1]. Clearly, the sample sequences will not converge almost everywhere, or in mean-square sense or in probability, since each subsequent value is chosen independent of its previous ones. However, every xn has an identical probability distribution function, so it converges trivially in distribution.

Note that a distribution function is uniquely determined by its values at points of continuity. This is because distributions are nonnegative, monotone, right-continuous and bounded by 1; thus, the number of jumps of size 1/n or greater is less than n. Hence, the points of discontinuity are at most countable, so that the points of continuity are dense in (−∞,∞), and so the full distribution function can be determined from the right-continuity property.

An important property of convergence in distribution is that the moments and other statistics of the random variables also converge. That is because these statistics are defined in terms of integrals with respect to the distribution functions, which are converging. The following lemma provides a characterization of convergence in distribution, which also known as weak convergence in functional analysis:

Lemma 2.1 Let {xn} be a sequence of random variables, and let x be another random variable. The following conditions are equivalent:

1. limn→∞ xn d. = x

2. For any bounded, continuous function f, limn→∞ E[f(xn)] = E[f(x)]

3. The characteristic functions Φxn(u) converge to Φx(u) for each frequency value u.

The following result summarizes the relationships between the different types of convergence established above.

Theorem 2.1 Let {xn} be a sequence of random variables on the same probabilty space, and let x be another random variable in the same probability space. Then

If limn→∞ xn a.e. = x, then limn→∞ xn

p. = x.

1.2. If limn→∞ xn mss = x, then limn→∞ xn

p. = x.

3. If P(|xn| ≤ x0) = 1 for some random variable x0 with E[x20] < ∞, and limn→∞ xn p. = x, then limn→∞ xn

mss = x.

4. If limn→∞ xn p. = x, then limn→∞ xn

d. = x.

5. Suppose limn→∞ xn = x and limn→∞ xn = y in the a.e, p, or m.s. sense. Then, P(x = y) = 1, so the limits are unique except for a set of outcomes with zero probability.

6. Suppose limn→∞ xn d. = x and limn→∞ xn

d. = y. Then, x,y have the same CDF or probability distribution function.

We conclude this section with a result for sequences of Gaussian random variables.

48 CHAPTER 2. SEQUENCES OF RANDOM VARIABLES

Theorem 2.2 Let {xn} be a sequence of Gaussian random variables on the same probabilty space, and let x be another random variable in the same probability space. Then, if limn→∞ xn = x in either a.e., m.s, p. or d sense, the limit x must be a Gaussian random variable.

Note that one only has to show this for convergence in distribution, since all the other types of convergence imply this one. It is easy to see that, as long as the limit is not a constant, the CDF of the limit random variable will be continuous everywhere, because of the strong continuity and differentiability properties of the Gaussian CDFs of xn. It is also easy to show the limit will have finite mean and variance, because otherwise the CDFs would be spreading out and not converging. Once that is determined, it is clear that the shape of the CDF will be Gaussian.

2.2 The Central Limit Theorem and the Law of Large Numbers

The two most famous examples of convergence are the law of large numbers and the central limit theorem. We discuss these below.

Let {xn} denote a sequence of independent, identically distributed random variables, and define the sample mean

MN = 1

N∑ i=1

The claim is that new random sequence {Mn} converges to the mean of the distribution of xn. This estab- lishes an empirical relationship for computing the mean of any random variable, by repeating independent experiments and averaging the observed values of the random variable.

Let mx denote the mean of the random variables xn. Then, we can compute

E[(Mn −mx)2] = E[( 1

n∑ i=1

xi −mx)2]

= E[( 1

n∑ i=1

(xi −mx))2]

= 1

n∑ i=1

E[(xi −mx)2] = nσ2x n2

(2.2)

where the last line follows from the independent, identically distributed property of the xn, and σ 2 x is the

variance of each xn. It is easy to see that limn→∞E[(Mn−mx)2] = 0, which shows that the sequence {Mn} converges in mean-square sense to mx (limn→∞Mn

mss = mx). This proves the weak law of large numbers in

the case that xn has a finite variance. The more general version of the weak law of large numbers (even if there is no finite variance) is stated as:

Theorem 2.3 (Weak Law of Large Numbers) Let {xn} be a sequence of independent, identically distributed random variables with finite means, and define the sequence of sample means {Mn} as

Mn = 1

n∑ i=1

Then, {Mn} converges in distribution to mx, the mean of the random variables xn.

We have proven a stronger statement if the random variables xn have finite variance, that convergence is at least as strong as in the mean-square sense. The Strong Law of Large Numbers states a third result, which is:

Theorem 2.4 (Strong Law of Large Numbers) Let {xn} be a sequence of independent, identically distributed random variables with finite mean mx and finite variance. Then the sequence of sample means {Mn}:

Mn = 1

n∑ i=1

converges almost everywhere to mx.

2.2. THE CENTRAL LIMIT THEOREM AND THE LAW OF LARGE NUMBERS 49

The law of large numbers characterizes that the sample means converge to a deterministic quantity. However, it is often of interest to characterize the error. This is the purpose of the central limit theorem.

Theorem 2.5 (Central Limit Theorem) Consider a sequence of independent, identically distributed random variables {xn} with finite mean mx and finite variance σ2x. Denote the partial sum Sn as

Sn =

n∑ i=1

Define the new random sequence {yn} as

yn = Sn −nmx σx √ n

Then, the sequence {yn} converges in distribution to a Gaussian random variable with mean zero and variance 1.

The surprising part of the Central Limit Theorem is that the distribution of the individual random variables can be arbitrary. This is why Gaussian random variables are used so often in probabilistic analysis, since they approximately model sums of many independent effects.

We sketch a brief proof the the Central Limit Theorem using characteristic functions. We note that

yn = 1

σx √ n

n∑ i=1

(xi −mx)

is also a sum of independent, zero-mean random variables. Thus, its characteristic function is given by:

Φyn(w) = E[e jwyn] = E[e

jw 1 σx √ n

∑n i=1(xi−mx)]

= E[

n∏ i=1

e jw

xi−mx σx √ n ]

n∏ i=1

E[e jw

xi−mx σx √ n ]

= (E[e jw

x−mx σx √ n ])n (2.3)

where the last equalities follows from the independent, identically distributed assumption. Now, we need to expand the exponential in the expresion, since, for large n, the exponent is small, and thus the exponential can be approximated by its first few terms.

e jw

x−mx σx √ n ] ≈ 1 +

jw(x−mx) σx √ n

− w2(x−mx)2

2σ2xn + . . . (2.4)

Keeping only the first three terms, we have

E[e jw

x−mx σx √ n ] ≈ 1 +

jwE[x−mx] σx √ n

− w2E[(x−mx)2]

2σ2xn

≈ 1 − w2

2n (2.5)

because E[x−mx] = 0,E[(x−mx)2] = σ2x. Thus,

Φyn(w) ≈ (1 − w2

2n )n

and, taking limits as n →∞, we get lim n→∞

Φyn(w) = e −w2/2

which is the characteristic function of a zero-mean, unit variance Gaussian random variable.

50 CHAPTER 2. SEQUENCES OF RANDOM VARIABLES

2.3 Advanced Topics in Convergence

This subsection provides additional mathematical background on convergence of sequences of random vari- ables. In particular, concepts of sequences of events and Cauchy sequences are introduced.

A sequence of real numbers {xn} is a Cauchy sequence if, for any � > 0, there is an N(�) such that |xn −xm| < � for all n,m > N. It is a known property of the real line that all Cauchy sequences converge to a finite limit. The concept of Cauchy sequence can be generalized to any metric space, which consists of a set M and a distance metric d(x,y),x,y ∈ M, satisfying the following properties:

1. d(x,y) > 0 if y 6= x.

2. d(x,x) = 0

3. d(x,y) + d(y,z) ≥ d(x,z) for all x,y,z (Triangle inequality).

Then, a Cauchy sequence is such that for any � > 0, there is an N(�) such that d(xn,xm) < � for all n,m > N. However, for general metric spaces, Cauchy sequences are not guaranteed to converge to a limit in the space. The real numbers (and in general other Euclidean spaces) have additional properties which ensure convergence.

Cauchy sequences can be used to develop sufficient conditions for sure and almost-sure convergence. In particular, if almost everywhere, the sequence of real values {xn(ω)} is a Cauchy sequence, then it converges almost everywhere to a value. Denote that value by {x(ω)}; the question is whether the limit defined pointwise in this manner will be a random variable. The answer is affirmative, since we can write sets such as

{ω : x(ω) < a} = ∪n ∩k≥n {ω : xk(ω) < a} which are countable union and intersections of events, and thus become events themselves.

We can also consider sequences of events in a probability space, as follows. let {An} denote a sequence of events in (Ω,F,P). The sequence is said to be increasing if An ⊂ An+1, and decreasing if An+1 ⊂ An. It is monotone if it is either decreasing or increasing. For any monotone sequence, the limit is defined as

lim n→∞

An =

{ ∪∞n=1An if increasing ∩∞n=1An if decreasing

which is guaranteed to be an event. For general, non-monotone sequences, we define sup and inf limits as

lim sup n→∞

An = ∩∞n=1 ∪k≥n Ak

lim inf n→∞

An = ∪∞n=1 ∩k≥n Ak

The sup limit is the set of all outcomes which occur infinitely often, while the inf limit is the set of all outcomes which occur in all An except for a finite number. If the two coincide, we say the sequence has a limit.

One of the important properties of probability measures is that they are sequentially continuous with respect to limits of events. That is, if {An} converges, then

lim n→∞

P(An) = P( lim n→∞

An)

This has the following famous lemma that we discussed in the previous chapter as a consequence:

Theorem 2.6 (Borel-Cantelli Lemma) For an arbitrary sequence of events {An}, if

∑∞ n=1

P(An) < ∞, then

P(lim sup n→∞

An) = 0.

The Borel-Cantelli Lemma is used primarily in proving that certain properties occur with probability one. The proof is straightforward, as

P(lim sup n→∞

An) = P( lim n→∞

∪k≥nAk)

= lim n→∞

P(∪k≥nAk) ≤ lim n→∞

∞∑ k=n

P(Ak) = 0 (2.6)

2.3. ADVANCED TOPICS IN CONVERGENCE 51

which is zero due to the summability assumption. One of the key questions in convergence is determining conditions for when convergence in probability

is equivalent to mean-square convergence or almost sure convergence. The following theorem, due to Loeve, characterizes convergence in probability:

Theorem 2.7 Let {xn} be a sequence of random variables. Then

1. If the sequence converges almost surely, then it converges in probability to the same limit.

2. If the sequence converges in probability, then there is a subsequence {xnk} which converges almost surely to the same limit.

The first part of the theorem is straightforward. The second part is cumbersome to prove; if interested, see Loeve’s book on Probability Theory.

The Borel-Cantelli lemma is useful in establishing the following theorem due to Gnedenko, which gives sufficient conditions for a sequence of random variables to converge almost surely:

Theorem 2.8 Suppose that, for a sequence of random variables {xn}, for every positive integer r, we have

∞∑ n=1

P({ω : |xn −xm| ≥ 1/r}) < ∞

Then, the sequence converges almost surely.

Clearly, the sequence converges in probability, since the conditions imply

lim n→∞

sup m≥n

P({ω : |xn −xm| ≥ 1/r}) = 0

Thus, there is a limiting random variable x. Define

Arn = {ω : |xn −x| ≥ 2/r}

Since Arn ⊂{ω : max(|xn −xm|, |x−xm|) ≥ 1/r}, we have

P(Arn) ≤ P(|xn −xm| ≥ 1/r) + P(|x−xm| ≥ 1/r)

Letting m →∞, we have P(Arn) ≤ sup

m≥n P(|xn −xm| ≥ 1/r)

The conditions imply that the right-hand side is summable, so the left hand side must be summable too. The Borel-Cantelli lemma thus implies that P(lim supn→∞A

r n) = 0, so that, for every r, |xn −x| > 2/r for

at most a finite number of n, almost surely. If we define A = ∪∞r=1 lim supn→∞Arn, we see that P(A) = 0, and outside of A, we have limn→∞ |xn(ω) −x(ω)| = 0. Thus, we have almost sure convergence.

We can also provide conditions whereby convergence in probability also implies convergence in mean square sense. The following theorem is due to Loeve:

Theorem 2.9 If the sequence {xn} converges to x in probability, then it converges in mean-square sense if one of the following conditions holds:

1. limn→∞ E[|xn|2] = E[x2] < ∞ 2. The |xn|2 variables have a uniform expectation; that is, there exists, for every � > 0, a value K(�) such that∫

xn(ω)P(dω) < �

for any set B = {ω : |xn(ω) ≥ K(�)}.

To tie these concepts together, convergence in mean-square sense implies condition 2 in the theorem above. In particular, there are some simple conditions which can guarantee convergence in mean-square sense when the sequence already converges in probability:

52 CHAPTER 2. SEQUENCES OF RANDOM VARIABLES

1. supn E[|xn|2] = c < ∞.

2. |xn| < y for n > N, and E[y2] < ∞.

As a final topic in convergence, let’s focus on convergence in distribution. In particular, consider a sequence {xn} of random variables with probability distribution functions Pn(x), and characteristic functions Φn(w). The following results are standard, and can be found in most advanced probability books:

1. If, for every bounded, continuous function g : R → R, we have limn→∞E[g(xn)] = E[g(x)], then the sequence converges in distribution to x.

2. If, for every real number w, the characteristic function Φn(w) converges pointwise to Φ(w), the sequence converges in distribution to x.

3. If, for any two values x1,x2, we have Pn(x1) − Pn(x2) converges to P(x1) − P(x2), the sequence converges in distribution to x.

In particular, the equivalence between the convergence of distribution functions and the convergence of characteristic functions pointwise is known as the continuity theorem of probability. This name is derived from the fact that the 1-1 correspondence between distributions and characteristic functions is preserved by the limit operation. That is, the limit of the characteristic functions is the characteristic function of the limit.

Using the additional fact that, if the limit of characteristic functions is continuous at w = 0, then a distribution function corresponding to this limit exists, we have a new result: A necessary and sufficient condition for convergence in distribution, such that

lim n→∞

Pn(x) = P(x)

at all points of continuity of P(x), is that the corresponding sequence of characteristic functions converges to a characteristic function which is continuous at w = 0.

There is a special case where convergence in probability and distribution are equivalent: when the limit random variable is a constant, such as 0. In such cases, the limiting distribution is a step function, switching from zero to 1 instantly. Then,

lim n→∞

Pn(x) = u(x− c)

where u is the unit step function, and c is the limiting constant. Thus,

lim n→∞

P [|xn − c| > �] = 0

for any � > 0. This implies convergence in probability.

As a final result, consider when convergence in distribution implies convergence of the probability densities to the probability density of the limit. The following condition is sufficient: if Φn(w) and Φ(w) are absolutely integrable (i.e.

∫∞ −∞ |Φ(w)|dw < ∞), and

lim n→∞

∫ ∞ −∞ |Φn(w) − Φ(w)|dw < ∞

then the density functions pn(x) converge uniformly to the density p(x). The integrability assumption guarantees that the densities are bounded and continuous, and defined by the inverse Fourier transform. pointwise

2.4 Martingale Sequences

In the previous subsection, we discussed conditions whereby we could establish almost sure convergence for sequences which converge only in probability. In this subsection, we present a special class of sequences of random variables, called a martingale sequence, for which stronger results can be established.

2.4. MARTINGALE SEQUENCES 53

Definition 2.7 A sequence of random variables {xn} s called a martingale if

E[xn|x0, . . . ,xn−1] = xn−1

almost everywhere for all n > 1.

Thus, a martingale has the property that increments xn+1−xn are zero-mean, conditioned on xn. Martingales arise naturally in the study of sequences which are partial sums of independent random variables. For instance, in the law of large numbers, it is clear that the partial sums s(n) form a martingale, since

E[s(n)|s(0), . . . ,s(n− 1)] = E[xn + s(n− 1)|x0, . . . ,xn−1] = E[s(n− 1)|x0, . . . ,xn−1] = s(n− 1) (2.7)

By the above argument, any sequence with independent, zero-mean increments will form a martingale. Martingales have the following properties

1. E[xn] = E[x0].

2. E[xn+mxn] = E[x 2 n] if m ≥ 0.

3. E[xn(xn+m −xn)] = 0 for m > 0.

4. E[x2n+m] ≥ E[x2n]

5. E[(xn+m −xn)2] ≥ 0

6. For any m ≥ 0, the sequence yn = xn+m −xm is a zero-mean martingale.

All of the above are easily established from the martingale definition, by using the smoothing property of conditional expectations. To illustrate this, we demonstrate (3) above:

E[x2n+m] = E[(xn+m −xn + xn) 2]

= E[x2n] + E[(xn+m −xn) 2] − 2E[xn(xn+m −xn)]

≥ E[x2n] − 2E[xn(xn+m −xn)] (2.8)

To complete the demonstration, use the smoothing property to show

E[xn(xn+m −xn)] = E[E[xn(xn+m −xn)|xn]] = E[xn(E[xn+m|xn] −xn)] = 0 (2.9)

since E[xn+m|xn] = xn by the definition of martingales. The importance of martingales is that we can establish a useful bound on the convergence of a martingale

to its limit. The following theorem provides such a result, similar to the Chebyshev inequality for random variables:

Theorem 2.10 For a martingale {xn}, given any � > 0, any n, we have

[ max

0≤k≤n |xn| ≥ �

] ≤ E[x2n]

�2

A proof of this result goes as follows: Construct the sets Aj as

Aj = {ω : |xj| ≥ �, |x(k)| < � for k < j.}

Then,

{ max 0≤k≤n

|xn| ≥ �} = ∪ni=0Aj

54 CHAPTER 2. SEQUENCES OF RANDOM VARIABLES

Let Ij be the indicator function of Aj; then

E[x2n] ≥ E[x 2 n

n∑ j=1

Ij]

≥ E[ n∑ j=1

x2nIj]

≥ E[ n∑ j=1

(xn −xj + xj)2Ij]

≥ E[ n∑ j=1

x2jIj] + 2E[

n∑ j=1

xj(xn −xj)Ij] (2.10)

Now, using the smoothing property of conditional expectations, we have

n∑ j=1

xj(xn −xj)Ij] = ∑ j=1

E[E[xj(xn −xj)Ij|x(0), . . . ,xj]]

= 0 (2.11)

because the only random quantity in the inner expectation, conditioned on knowing x(0), . . . ,xj, is xn−xj, and the martingale property guarantees that this has zero conditional mean. Thus,

E[x2n] ≥ n∑ j=1

E[Ijx 2 j]

≥ n∑ j=1

�2E[Ij] =

n∑ j=1

�2P(Aj)

= �2P(∪0≤j≤nAj) = �2P( max 0≤k≤n

|xn| ≥ �) (2.12)

Based on the above bound, we can show the following theorem:

Theorem 2.11 (Martingale Convergence Theorem) Let {xn} be a martingale such that

E[x 2 n] ≤ C < ∞

for all n. Then, {xn} converges almost surely to a random variable x with finite variance.

Proof: Based on the martingale property, we know that E[x2n] is a monotone nondecreasing sequence of n, and is bounded above by the assumption, so that it has a limit. Since this has a limit, then

lim m,n→∞

E[(xn+m −xm)2] = 0

which shows immediately mean-square convergence. To show almost sure convergence, we use the martingale inequality to obtain

P [ max n≥k≥0

|xn+m −xm| ≥ �] ≤ E[(xn+m −xm)2]

so that lim m→∞

P [max k≥0 |xn+m −xm| ≥ �] = 0

Since probabilities are continuous as a function of events, this also implies

P [ lim m→∞

max k≥0 |xn+m −xm| ≥ �] = 0

which makes this a Cauchy sequence of numbers, almost surely, so that a limit random variable x exists, and convergence is almost sure.

2.5. EXTENSIONS OF THE LAW OF LARGE NUMBERS AND THE CLT 55

A simple application of the above result is the Strong Law of Large Numbers. Define

yn =

n∑ i=1

xi −mx i

(2.13)

This is a martingale, since it has independent increments. Furthermore,

E[y2n] =

n∑ i=1

σ2x i2

= σ2x

n∑ i=1

i2 (2.14)

which is bounded above, and thus satisfies the conditions of the martingale convergence theorem. Thus,

limn→∞yn a.e. = y for some random variable y with finite variance. Next, note that we can write the sequence

of partial sums in terms of the yn, since xi −mx = i(yi −yi−1), where y(0) = 0:

n∑ i=1

(xi −mx) = 1

n∑ i=1

iyi − 1

n∑ i=1

iyi−1

= yn + 1

n−1∑ i=1

(i− 1 − i)yi

= yn − 1

n−1∑ i=1

yi (2.15)

Taking limits,

lim n→∞

n∑ i=1

(xi −mx) = lim n→∞

yn − lim n→∞

n−1∑ i=1

yi = 0

almost surely, which proves the strong law of large numbers.

2.5 Extensions of the Law of Large Numbers and the CLT

Now that we have the mechanisms for analyzing convergence of random sequences, we can state the stronger versions of the Law of Large Numbers and the Central Limit Theorem. In particular, we want to relax the independent, identically distributed assumption which seemed to play such a crucial role. We begin by recalling the definition of the sequence of partial sums

s(n) = 1

n∑ i=1

(xi −E[xi])

A sufficient condition for convergence in probability to a limit is that limn→∞E[s(n) 2] = 0. Note that

this does not require independence, or identical distributions. If the random sequence {xn} has the property that each xn has finite variance σ

2 xn

, and the sequence is pairwise uncorrelated, then

E[s(n)2] = 1

n∑ i=1

σ2xn

A sufficient condition for the above is Kolmogorov’s condition that

lim n→∞

n∑ i=1

σ2xn/n < ∞

To extend this to almost sure convergence, we use the martingale convergence theorem argument as in the previous subsection. Assume that we have a sequence of independent random variables, satisfying the

56 CHAPTER 2. SEQUENCES OF RANDOM VARIABLES

Kolmogorov condition. Using the definition of yn in (2.13), equation (2.14) becomes

E[y2n] =

n∑ i=1

σ2xi i2

≤ ( n∑ i=1

σ2xi)(

n∑ i=1

i2 )

≤ K

n∑ i=1

σ2xi (2.16)

for some constant K, which is bounded as n →∞ by assumption. Thus, yn satisfies the conditions used to prove the strong law of large numbers above.

Similar relaxations of the central limit theorem are possible. Consider a sequence of random variables {xn} with finite mean mxn and finite variance σ2xn. Denote the partial sum Sn as

Sn =

n∑ i=1

(xi −mxi)

Let σ2n denote the covariance of Sn. The extensions of the central limit theorem state that the new random sequence {yn} as

yn = Sn σn

Then, the sequence {yn} converges in distribution to a Gaussian random variable with mean zero and variance 1.

One of the strongest extensions is due to Lindeberg, and is summarized below:

Theorem 2.12 If the elements of the sequence {xn} are independent, then the central limit theorem holds if

lim n→∞

∑n i=1

∫ |x−mxi|>�σn

(x−mxi) 2pi(x) dx

σ2n = 0

for any � > 0.

The Lindeberg conditon is also necessary as long as limn→∞σn = ∞. Assuming independence,

σ2n =

n∑ i=1

E[(xi −mxi) 2]

and, as long as the contributions of the individual terms (xi−mxi)2 vanish sufficiently slow, then limn→∞σn = ∞.

Proving the Lindeberg theorem is complex; Loeve has a proof in his book. Instead, we focus on a simpler result:

Theorem 2.13 If the elements of the sequence {xn} are independent, then the central limit theorem holds if

C1 ≤ E[(xn −mxn) 2 ] ≤ C2

for all n.

Note that this implies C2n ≥ σ2n ≥ C1n. The upper bound also implies that, as n →∞, the terms∫ |x−mxi|>�σn

(x−mxi) 2pi(x)dx

in the numerator in the Lindeberg condition are decaying to zero, while the denominator grows at least as nC2. Since there are only n terms in the numerator, their sum, divided by the denominator, will decay to 0. This establishes that the Lindeberg condition is satisfied. In practice, what the Lindeberg condition requires is that each term

xn−mxn σn

be uniformly small, so that the sum of the terms is not dominated by a finite subset of the terms, but instead is the sum of many individually negligible components.

2.6. LARGE DEVIATIONS 57

2.6 Large Deviations

We saw in the previous section that, when we have a sequence x1,x2, . . . of independent, identically dis- tributed random variables with finite mean mx, the partial sums Sn = x1 + . . . + xn have the property that, if a > mx,P[

Sn n ≥ a] → 0 as n → ∞, by the weak law of large numbers. If the xi have finite variance, one

can use the central limit theorem to compute the probabilties of small deviations of this average sum from its mean, where the deviations are written in terms of multiples of 1/

√ n. In this section, we are interested

in evaluating P [Sn n ≥ a] for fixed a as n → ∞, and in particular, how fast does this converge to zero with

increasing n. If a remains fixed, the deviation from the mean will be large relative to the variance of the scaled sum.

Define the moment generating function of xi as M(u) = E[e uxi]. Note that euxi is a non-negative random

variable for any u, so the expectation is well-defined, although it could take the value +∞. By independence, the moment generating function of Sn

n is:

E[e u n

(x1+...+xn)] = M( u

n )n

Using the Chernoff inequality of Section 1.12.3 on the variable Sn/n, we get

P [ Sn n ≥ a] ≤ e−uaM(

n )n = e−n(u

′a−ln M(u′))

for any u′ ≥ 0. To optimize the bound, one selects u′ to maximize au′ − ln M(u′). Note that, in general, ln M(u) is a convex function, and ln M(0) = 0, and d

du ln M(u)|u=0 = E[X]. Thus, as long as a > E[xi], the

maimum value will be well-defined, and given by:

`(a) = sup u≥0

ua− ln m(u)

The resulting Chernoff bound is

P [ Sn n ≥ a] ≤ e−n`(a), a > E[xi]

The Chernoff bound gives an upper bound on the average sum. The next result provides a lower bound:

Theorem 2.14 (Cramér’s theorem) Let E[xi] be finite, and let a > E[xi]. Then, there exists N� such that, for n > N�, we have

P [ Sn n ≥ a] ≥ e−n`(a)−n�, a > E[xi]

, Combining this with the Chernoff bound above, one gets

lim n→∞

n ln P

Sn n ≥ a] = −`(a)

In particular, if P [xi ≥ a] > 0, then `(a) is fiite, and

P[ Sn n ≥ a] = e−n`(a)−n�n

for some sequence �n ≥ 0 with limn→∞ �n = 0.

To prove the lower bound in the theorem, we make the additional assumption that the variables xi are bounded. This can be relaxed, but requires a longer argument. We also assume P [xi > a] > 0. Since the xi are bounded, this guarantees that M(u) is finite and infinitely differentiable. We define a change of probability measure for the individual random variables xi, with the new probability density function as

p̂xi(x) = pxi(x)e

M(u)

Note that this density integrates to 1, and is well-defined for all x, and depends on the parameter u. We assume the random variables xi remain independent under this new measure. Note that, under this new measure, the new expected value of xi is∫

xp̂xi(x)dx =

∫ xpxi(x)e

uxdx

M(u) =

du ln M(u)

58 CHAPTER 2. SEQUENCES OF RANDOM VARIABLES

Similarly, the new variance is ∫ x2p̂xi(x)dx− (

∫ xp̂xi(x)dx)

2 = d2

du2 ln M(u)

We denote the new measure on the sequence of random variables as Pu, and the original measure as P . Under the assumption that P [xi > a] > 0, there is a unique value that maximizes au− ln M(u). Let that

maximizing value be u∗, so `(a) = au∗− ln M(u∗). At u∗, the derivative must be zero, so d du

ln M(u)|u=u∗ = a, so the average value of the iid random variables xn under the new measure is a. Note now that, for b > a,

P [ Sn n ≥ a] =

∫ {ω:Sn≥na}

M(u∗)n

M(u∗)n e−u

∗Sneu ∗SndP

= M(u∗)n ∫ {ω:Sn≥na}

M(u∗)ne−u ∗SndPu∗

≥ M(u∗)n ∫ {ω:na≤Sn≤nb}

e−u ∗SndPu∗

≥ M(u∗)ne−u ∗nbdPu∗Pu∗[na ≤ Sn ≤ nb] = e−n`(a)−nu

∗(b−a)Pu∗[na ≤ Sn ≤ nb]

Note that, under Pu∗,E[xi] = a, so by the central limit theorem, limn→∞Pu∗[na ≤ Sn ≤ nb] = 1/2. So, for n large enough, we have Pu∗[na ≤ Sn ≤ nb] > 1/3. Thus, for n large enough,

P [ Sn n ≥ a] >

3 e−n`(a)−nu

∗(b−a) = e−n(`(a)+u ∗(b−a)+ ln 3

n )

establishing the lower bound, by picking b close to a.

Example 2.8 Suppose xi were independent, identically distributed exponential random variables with parameter λ = 2. Then,

M(u) =

∫ ∞ 0

2e ux e −2x

dx =

{ 2

2−u u < 2

∞ u ≥ 2

`(a) = max u

au− ln M(u) = max u<2

au + ln 2 −u− ln 2

Differentiating and setting the result to zero yields

a− 1

2 −u∗ = 0 → u∗ = 2 −

and `(a) = 2a− 1 − ln(2a),a ≥ 0.

2.7 Spaces of Random Variables

In order to get geometric insight into several of the basic operations used in this course, it is useful to understand how collections of random variables resemble the normal n-dimensional Euclidean spaces which we use in normal vector operations.

To that end, consider the collection of random variables with finite second moments defined on a proba- bility space (Ω,F,P). This includes random variables that are constant. It is clear that, if X,Y belong to this collection, then, for any real numbers a,b, then aX + bY is also in the collection, since

E[(aX + bY )2] = a2E[X2] + b2E[Y 2] + 2abE[XY ] ≤ a2E[X2] + b2E[Y 2] + 2|ab|(E[X2]E[Y 2])1/2 < ∞

Thus, like vectors, linear combinations of random variables with finite second moments are also in the same space. Also, we have addition and scalar multiplication defined in the collection of finite mean and variance random variables. Other properties of this collection, which are similar to vectors, include:

2.7. SPACES OF RANDOM VARIABLES 59

1. There is a zero random variable, which, when added to every other random variable, is an additive identity. Define the random variable X(ω) = 0 for all ω ∈ Ω. Then, for any other random variable Y , Y + y = y.

2. For every random variable X, there is a second random variable in the collection, Y , such that X +Y = 0. In esssence, define Y (ω) = −X(ω); it is clear that Y has finite mea, and variance.

The above properties guarantee that the collection of finite second moment random variables is a linear vector space, or a vector space for short. We now carry the analogy a little further: we define the concept of an inner product among vectors, or among random variables, as follows: Given any two finite second moment random variables X,Y , the inner product of X,Y , denoted by < X,Y >, is given by

< X,Y >= E[XY ]

Note that the above expectation is guaranteed to exist, and thus always assigns a real number to the inner product of two variables.

The inner product operation satisfies the following conditions:

1. < X,X >≥ 0

2. < aX + bY,Z >= a < X,Z > +b < Y,Z > for any real numbers a,b and random variables X,Y,Z.

3. < Z,aX + bY >= a < Z,X > +b < Z,Y >

Indeed, this almost statisfies the concepts of inner product operations in Euclidean space. The one difference is that, for Euclidean vectors, < X,X >> 0 if X 6= 0, whereas we cannot say that unequivocably for zero- mean random variables. In essence, we can have some random variables which are non-zero on sets of probability measure zero, for which < X,X >= 0.

The trouble is that our random variables are too many, and that two random variables can be essentially equivalent (equivalent almost everywhere), but treated as different random variables in our collection. For- mally, what we have to do is to define equivalence classes of random variables, where two random variables are said to be equivalent (written as X ≡ Y , or X = Y a.e.), if X = Y almost everywhere (i.e. except for a set of zero probability).

Thus, for the space of finite second moment equivalence classes of random variables, we have the property < X,X >> 0 if X 6= 0 a.e., which is the same property satisfied by the inner product in Euclidean spaces. Thus, the linear vector space of random variables also has an inner product structure defined on it; furthermore, the inner product defines a norm ‖X‖ =< X,X >1/2, and satisfies the triangle inequality:

< X + Y,X + Y >1/2≤< X,X >1/2 + < Y,Y >1/2

Another important property of this space was established above: given any Cauchy sequence of elements in this space {Xn} (that is, a sequence satisfying the Cauchy criterion), there is a random variable X which has finite second moment, to which the sequence converges in mean-square sense. Note that convergence in mean-square sense is equivalent to convergence in the norm defined above, as

‖X −Xn‖2 =< X −Xn,X −Xn >= E[(X −Xn)2]

which is the same metric used to define mean-square sense convergence. This important property implies that all of the limits of Cauchy sequences of finite second-moment random variables are also finite second moment random variables, so that this is a complete space. Such a vector space (a complete vector space with an inner product structure) is called a Hilbert space, and has mathematical properties very similar to standard n-dimensional Euclidean spaces.

An important subspace of this space is the space of Gaussian random variables with finite second moments. As before, this space is closed under linear operations of addition and scalar multiplication, since the sum of Gaussian random variables is also Gaussian. Furthermore, the space is also closed under limits of Cauchy sequences with the above metric, since these Cauchy sequences converge in mean-square sense, and thus in distribution, so that the limit will also be a Gaussian random variable. Thus, the space of Gaussian random variables with finite means and variances variance forms a closed subspace of the space of all random variables with finite second moments. The above spaces will be used extensively in the solution of estimation problems.

60 CHAPTER 2. SEQUENCES OF RANDOM VARIABLES

Chapter 3

Estimation of Parameters

3.1 Introduction

In this chapter we consider the problem of estimating or inferring the values of unknown quantities based on observation of a related set of random variables. The model of the general parameter estimation situation we are considering is depicted in Figure 3.1. The basic idea is that based on an observation y we want to estimate an unknown quantity x by using an estimation rule x̂(y). In particular, this model has three components:

1. A model of nature or the parameter space that generates x

2. A model of the observation process as represented by the density pY |X(y | x).

3. An estimation rule mapping each actual observation to a correpsonding estimate x̂(y).

Note that this model captures the essential elements of many problems in engineering and science, including: finding the location of a target based on radar observations, estimating the heart rate of a patient from electrical measurements, discerning O+ density in the atmosphere from brightness measurements, estimating depth in a scene from appearent motion. In all cases let us emphasize that the “estimation rule” is really nothing more than a function that maps each point in the observation space to a corresponding estimate. Thus it can be seen that estimation is closely related to detection. Indeed, the only difference is really in the nature of the variable being estimated. In particular, if the unknown x is discrete valued we generally call it a detection problem, while if x is continuously valued we say we are doing estimation.

P a r a m e t e r S p a c e O b s e r v a t i o n S p a c e

E s t i m a t o r x y

E s t i m a t i o n R u l e

( M a p p i n g o f e a c h y

t o e s t i m a t e )

p Y | X ( y | x )

$ ( )x y

Figure 3.1: Parameter Estimation Problem Components

The first element of our general model in Figure 3.1 is the unknown quantity (or quantities) whose estimate we desire. We denote this element by X (or by the vector X if there are a number of such quantities, though we will be lax in this regard). There are two common models for this unknown X which lead to two distinct, though related, approaches to estimation. In the first model, termed “Bayesian,” X is viewed as a random quantity, which leads to what are termed “Baysian” approaches to estimation. The other model views X as unknown, but nonrandom, and corresponds to what is variously known as non-random

62 CHAPTER 3. ESTIMATION OF PARAMETERS

parameter or “Fisher” estimation. Nonrandom parameter estimation is usually accomplished via maximum likelihood estimation. Our primary focus will be on Baysian approaches to estimation, which we discuss in Sections 3.3–3.6. We will discuss nonrandom parameter estimation, and in particular maximum likelihood estimation, in Section 3.9.

3.2 Quick Review of Random Vectors

In the remainder of this chapter, we assume we will be estimating a vector of unknowns, based on observations of another vector of unknowns. We breifly review some relevant aspects of random vectors. Recall from Section 1.9 that random vectors on a common probability space are defined as a collection of random variables. We use the notation

X =

  X1... XN

 

denote a vector of N random variables. The joint density function is given by

pX(x) = ∂N

∂x1 . . .∂xN PX(x).

We define the sample mean E[X] = mX as

E[X] =

 E[X1]... E[XN ]

 

Suppose Y is anoter random vector on the same probability space, with dimension M. The cross-correlation matrix of X and Y is given by the N × M matrix E[XY T ], which has as its ij-th entry E[XiYj]. The cross-covariance matrix, denoted by Cov(X,Y ) and also as ΣXY is given by

ΣXY = E [( X −mX

)( Y −mY

)T] = E

[ XY T

] −mXmTY .

Note that ( ΣXY

) ij

= E [ (Xi −mXi)

( Yj −mYj

)] = σXiYj

so that the cross-covariance matrix is a compact way of representing the collection of cross-covariance infor- mation between the collection of random variables in the two different random vectors. Like the covariance matrix, the cross-covariance matrix can be seen as the natural generalization of the cross-covariance between two random variables, extended to two random vectors.

Note that the cross-covariance of X with itself is referred to as the covariance of X, as Cov(X) or ΣX, as in Section 1.10. Expectation is a linear operator on its arguments, and covariance is a bilinear operator. Thus, they obey the following properties: Let A,B,C denote non-random matrices and b,d non-random vectors. Then,

E[AX + b] = AE[X] + b

Cov(X,Y ) = E[X(Y −E[Y )T ] = E[(XE[X)Y T ] = E[XY T ] −E[X)]E[Y )T ] E[(AX)(CY )T ] = AE[XY T ]CT

Cov(AX + b,CY + d) = ACov(X,Y )CT

Cov(AX + b) = ACov(X)AT

Cov(W + X,Y + Z) = Cov(W,Y ) + Cov(W,Z) + Cov(X,Y ) + Cov(X,Z)

Not every square matrix can be a covariance matrix of a random vector X. The following Lemma summarizes the required properties from Section 1.10:

Lemma 3.1 For a given random vector X, its correlation and its covariance matrix must be symmetric positive semidefinite matrices. Conversely, given any symmetric positive semidefinite matrix K, then K is the covariance matrix for some zero mean random vector X.

3.3. GENERAL BAYESIAN ESTIMATION 63

3.3 General Bayesian Estimation

In this section we focus on random parameter estimation wherein we model our unknown X as itself being random. Let us examine the three elements of an estimation problem in this light:

1. Parameter Model: As we discussed X is modeled as a random variable (or vector). Our model of nature is captured by the prior density pX(x) of the random vector X.

2. Observation Model: As figure 3.1 indicates, the observation model captures the relationship between the observed quantity Y and the unknown X. In a Bayesian problem, this relationship is given by the conditional density pY |X(y | x).

3. Estimation Rule: In a Bayesian setting the estimation rule is usually obtained by minimizing the expected value of a nonnegative cost function:

x̂ ∗ (y) = arg min

x̂() E [J(x̂(Y ),X)] (3.1)

where J(x̂,x) is the cost of estimating x̂ when x is the true value of the unknown and arg min denotes the quantity that achieves the minimum (vs the value of the minimum). From a modeling standpoint then this component reduces to choice of an appropriate cost function J(x̂,x). The quantity E [J(x̂(Y ),X)] is referred to as the Bayes risk for the problem.

Example 3.1 Suppose we wish to estimate the random variable X from noisy observations of the form:

Y = X 2

+ V, V ∼ N(0, 9) (3.2)

We wish the estimate to minimize the mean square error: E [ (x̂(Y ) −X)2

] . In the absence of any other information we

believe that X is Gaussian distributed with mean 2 and variance 4. Given this problem statement, our first problem element (the parameter model) is given by pX(x) = N(x; 2, 4). We

can derive our second element (the observation model) by using the description in (3.2). In particular, it is straightforward to show that (3.2) implies that pY |X(y | x) = N(y; x2, 9). Finally, since we want to minimize mean square error we would choose our cost function as the square error cost J(x̂,x) = (x̂−x)2

3.3.1 General Bayes Decision Rule

Before proceeding to a discussion of the specific estimators we obtain through various choices of the cost function J(x̂,x), we will derive a general expression for the Bayes decision rule which minimizes the the expected cost and the associated performance of this estimator. Recall that the Bayes estimator is derived as the minimizer of the expected cost. Thus we have:

x̂ ∗ (y) = arg min

x̂(y) E [J(x̂(Y ),X)] (3.3)

= arg min x̂(y)

E [ E [ J(x̂(Y ),X) | Y = y

]] (3.4)

= arg min x̂(y)

∫ E [ J(x̂(Y ),X) | Y = y

] pY (y) dy (3.5)

∫ arg min

x̂ E [ J(x̂,X) | Y = y

] pY (y) dy (3.6)

= arg min x̂ E [ J(x̂,X) | Y = y

] (3.7)

= arg min x̂

∫ J(x̂,x)pX|Y (x | y) dx (3.8)

In going from (3.3) to (3.4) we have used the properties of iterated expectation. In going from (3.5) to (3.6) we have used the fact that pY (y) is always positive and independent of x̂, so that minimization of the conditional expected value in (3.5) for each value of y is the same as minimization of the entire function for all values of y. Recall that the estimator is nothing more than a mapping of each observation to a corresponding estimate, thus what we are saying is that we can do this minimization independently for each y. This situation is similar to what we saw in the case of detection! We can go no further than the general expression (3.8) without knowing more about the cost function J(x̂,x). We will examine special cases shortly.

64 CHAPTER 3. ESTIMATION OF PARAMETERS

3.3.2 General Bayes Decision Rule Performance

Before examining specific examples of Bayes estimators let us examine the performance measures we can use for evaluating these estimators. To this end, let us define the estimation error of the estimate x̂ as: e ≡ X − x̂(Y ). Now one reasonable way to quantify performance is to focus on the behavior of this error. Note that e itself is a random variable, one which we want “close to zero.” We present four performance measures next.

Bias: The bias is defined as:

b ≡ E[e] = ∫ ∫ [

x− x̂(y) ] pX,Y (x,y) dxdy (3.9)

It is a measure of the average value of the error, which we want to be small. Note that since b is defined as an expected value over all the random quantities in the problem, it is just a deterministic number for an estimator. Thus it is easy to correct for if it is known. In particular, if b is known, we can create an “unbiased estimator” by just correcting our estimate: x̂(y) + b.

Error Covariance: The error covariance is defined by:

Λe ≡ E [ (e− b)(e− b)T

] = E

[ eeT ] − bbT (3.10)

It is a measure of the variability of the error. Certainly we would like this small. For example, for a Gaussian problem, if the bias is zero and the error covariance zero, then the error would be nonrandom and identically zero!

Mean Square Error: The mean square error or “MSE” is defined as:

MSE ≡ E [ eTe ]

= tr ( E [ eeT ])

= tr (

Λe + bb T )

(3.11)

where tr denotes the trace of the matrix (see Appendix C). The MSE is the average of the squared error, and so having it small is certainly good. From the last form of the MSE we can see that it depends on both the bias and the variance of the error. When the bias and the variance are both zero the MSE is also zero. In general, we may have to trade off between estimators with different bias and variance to get the smallest MSE. Finally, note that when the bias is zero the MSE is equal to the trace of the error covariance.

Expected Cost: One final and obvious measure of performance is the actual value of the expected cost itself: E[J(x̂(Y ),X)]. If you have chosen the cost J(x̂,x) to have meaning for you, then its average or expected value is certainly one reasonable measure of how well your estimator is doing. Note that in general the value of the expected cost is not obviously related to the MSE, the bias, and the variance (though we will see that for certain choices of the cost J(x̂,x) they are related).

3.4 Bayes Least Square Estimation

We will now start our examination of different choices for the cost function J(x̂,x) in the Bayesian approach to estimation and see with estimators these choices produce. Our first cost function is the square error cost:

JBLSE(x̂,x) = (x− x̂) T (x− x̂) = ‖x− x̂‖2 =

M∑ i=1

(xi − x̂i)2 (3.12)

This cost function is depicted in Figure 3.2 as a function of the error e =]x−x̂. Note that with this choice of cost function that E[JBLSE(x̂(Y ),X)] = E[e

Te] and thus this estimator is the minimum mean square error estimator (MMSE) by design! Now let us find the estimator. Using the expression (3.8) we obtain:

x̂BLSE(y) = arg min x̂

∫ J(x̂,x)pX|Y (x | y) dx arg min

x̂

∫ (x− x̂)T (x− x̂)pX|Y (x | y) dx (3.13)

= arg min x̂

∫ [ xTx− x̂Tx−xT x̂ + x̂T x̂

] pX|Y (x | y) dx = f(x̂)

3.4. BAYES LEAST SQUARE ESTIMATION 65

J ( e )

Figure 3.2: Square error cost function

Note that x̂ is a constant with respect to the integral. Now we can take the derivative of (3.13) with respect to x̂, set it equal to zero, and solve for x̂. We also need to check that the second derivative is positive at the solution to verify it is really a minimum.

∂

∂x̂

{∫ (x− x̂)T (x− x̂)pX|Y (x | y) dx

} =

∫ [0 −x−x + 2x̂] pX|Y (x | y) dx

= −2 ∫ xpX|Y (x | y) dx + 2x̂

= 0

∂2

∂x̂ 2

{∫ (x− x̂)T (x− x̂)pX|Y (x | y) dx

} = 2 > 0

The resulting estimate is given by:

x̂BLSE(y) =

∫ xpX|Y (x | y) dx = E

[ X | Y = y

] (3.14)

Thus the Bayes least square error estimate is just the conditional mean, i.e. the mean of the conditional density pX|Y (x | y). Note that this density is a function of the observed value y, so the estimate is as well. The Bayes least square error estimate (BLSE) is sometimes also referred to as the Bayes minimum mean square estimate (MMSE).

Let us now examine performance of the BLSE estimator. First let us find the bias:

b = E [ X − x̂BLSE(Y )

] = E[X] −E [E [X | Y ]] = 0 (3.15)

Thus the BLSE estimator is always unbiased. Independent of the prior density or the observation density the bias will always be zero. This is a very good thing.

Now let us examine the error covariance. This quantity is given by:

ΛBLSE = E [ (e− b)(e− b)T

] = E

[ eeT ]

= E [( X − x̂BLSE(Y )

)( X − x̂BLSE(Y )

)T] (3.16)

= E [ E {

(X −E[X | Y ]) (X −E[X | Y ])T | Y }]

(3.17)

= E [ ΛX|Y (Y )

] (3.18)

Thus, the BLSE error covariance is the expected value of the conditional covariance. Note that in general the conditional covariance is a function of y, the observation. Thus the expectation in (3.18) is over the random variable Y .

Now let us find the mean square error (MSE). Note that in this particular case (though not in general!) the MSE is the same as the value of the expected cost E[JBLSE(x̂,x)]. Since the bias is zero, the MSE is the same as the trace of the error covariance:

MSE = E[JBLSE(x̂,x)] = tr ( ΛBLSE

) = tr

( E [ ΛX|Y (Y )

]) (3.19)

Note that the MSE does not depend on y (due to the expectation operation) while the conditional covariance does in general.

In general, finding the BLSE and its associated performance is difficult, as the conditional density must first be found. In certain special cases it can be done however. Let us consider some examples.

66 CHAPTER 3. ESTIMATION OF PARAMETERS

Example 3.2 In this example we wish to estimate x by observing a related random variable y, where the random variables X and Y are jointly distributed with the density shown in Figure 3.3. This density is uniform over the depicted diamond shaped region. Note that this characterization provides all the information to find both a prior model for X (i.e. the marginal distribution px(X)) as well as the relationship between Y and X as given by pY |X(y | x).

- 1

1 2 1( | | )- y

p X , Y ( x , y ) = 1 / 2

x - 1 - ( 1 - | y | ) ( 1 - | y | )

p X | Y ( x | y )

Figure 3.3: BLSE Example

To find the BLSE estimate for this problem the quantity we need to find is the conditional density pX|Y (x | y), from which we can find E[x | y] and λx|y. To find pX|Y (x | y) we could use Bayes rule, but the geometry of the problem allows us to do this almost by inspection. Recall that pX|Y (x | y) will be a slice of the joint density pX,Y (x,y) parallel to the x-axis and scaled to have unit area. This conditional density is shown on the right in Figure 3.3 for any nontrivial value of y. Since the original density is “flat,” each slice will be flat, so all we really need to determine are the edges. The height follows from the constraint that the density has unit area.

Now, given this density it is easy to see that the BLSE, which is the mean of the density, is given by x̂BLSE(y) = E[x | y] = 0. Thus for this example the BLSE of x does not depend on y. Note that this is true in spite of the fact that X and Y are clearly dependent random variables! This may seem strange, but remember that the BLSE is based on the mean of the conditional density. Next let us find the conditional variance λx|y. From the density pX|Y (x | y) this variance is easily found from the box like form of the conditional density to be λx|y = (1 −y)2/3. Note that the conditional variance depends on y. Finally let us find the MSE for the BLSE estimator. This can be shown to be MSE = E

[ λx|y

] = 1/6. Note

the MSE is independent of y, as it must be.

Example 3.3 In this example we wish to estimate X by observing a related random variable Y , where the random variables X and Y are jointly distributed with density given by:

pX,Y (x,y) =

{ 6x 0 ≤ x ≤ y, 0 ≤ y ≤ 1 0 otherwise

(3.20)

Again, this characterization provides all the information to find both a prior model for X (i.e. the marginal distribution px(X)) as well as the relationship between Y and X as given by pY |X(y | x).

To find the BLSE estimate for this problem we need to find the conditional density pX|Y (x | y). By integrating pX,Y (x,y) with respect to y we can find the marginal density for y:

pY (y) =

∫ y 0

6x dx = 3y 2 , 0 ≤ y ≤ 1. (3.21)

Now we can use Bayes’ rule to obtain the conditional density:

pX|Y (x | y) = pX,Y (x,y)

pY (y) =

y2 , 0 ≤ x ≤ y (3.22)

The mean of the conditional density is now found as:

E[x | y] = ∫ ∞ −∞

xpX|Y (x | y) dx = ∫ y

2x2

y2 dx =

∣∣∣∣y 0

= 2

3 y (3.23)

Thus x̂BLSE(y) = 2 3 y.

Next let us find the conditional variance λx|y:

λx|y = E[x 2 | y] −E[x | y]2 =

∫ ∞ −∞

x 2 pX|Y (x | y) dx−E[x | y]

2 (3.24)

∫ y 0

2x3

y2 dx−

9 y

2 = y2

2 −

4y2

9 = y2

18 (3.25)

3.4. BAYES LEAST SQUARE ESTIMATION 67

Note that λx|y is a function of y. Finally, the mean square error is obtained as:

MSE = E [ λx|y

] =

∫ ∞ −∞

λx|ypY (y) dy =

∫ 1 0

3y4

18 dy =

30 (3.26)

Example 3.4 Suppose X and Y are related by the following joint density function:

pX,Y (x,y) =

{ 10x 0 ≤ x ≤ y2, 0 ≤ y ≤ 1 0 otherwise

(3.27)

To find the BLSE estimate for this problem we need to find the conditional density pX|Y (x | y). By integrating pX,Y (x,y) with respect to y we can find the marginal density for y:

pY (y) =

∫ y2 0

10x dx = 5y 4 , 0 ≤ y ≤ 1. (3.28)

Now we can use Bayes’ rule to obtain the conditional density:

pX|Y (x | y) = pX,Y (x,y)

pY (y) =

10x

5y4 =

y4 , 0 ≤ x ≤ y2 (3.29)

The mean of the conditional density is now found as:

E[x | y] = ∫ ∞ −∞

xpX|Y (x | y) dx = ∫ y2

2x2

y4 dx =

∣∣∣∣y 2

= 2

3 y

2 (3.30)

Thus x̂BLSE(y) = 2 3 y2. Note that this estimate is a nonlinear function of y in this case.

Next let us find the conditional variance λx|y:

λx|y = E[x 2 | y] −E[x | y]2 =

∫ ∞ −∞

x 2 pX|Y (x | y) dx−E[x | y]

2 (3.31)

∫ y2 0

2x3

y4 dx−

9 y

4 = y4

2 −

4y4

9 = y4

18 (3.32)

Note that λx|y is a function of y. Finally, the mean square error is obtained as:

MSE = E [ λx|y

] =

∫ ∞ −∞

λx|ypY (y) dy =

∫ 1 0

18 (5y

4 ) dy =

162 = 0.0309 (3.33)

Example 3.5 (Scalar Gaussian Case) Suppose we wish to estimate X by observing a related random variable Y , where the random variables X and Y are jointly Gaussian scalar random variables. Thus the random variables X and Y are completely characterized by their joint Gaussian distribution: [

X Y

] ∼ N

([ mx my

] ,

[ λx λxy λyx λy

]) (3.34)

Again, note that from this characterization a prior model for X could be found as well as the relationship between Y and X provided by the density pY |X(y | x).

To find the BLSE estimate for this problem we again need to find the conditional density pX|Y (x | y), from which we can find E[x | y] and λx|y. We proceed by manipulating the expression for the conditional density until we have it in a Gaussian form, then simply read off the mean and variance of this Gaussian.

pX|Y (x | y) = pX,Y (x,y)

pY (y) (3.35)

∝ pX,Y (x,y) (3.36)

= exp

{ −

[ x−mx y −my

]T [ λx λxy λyx λy

]−1 [ x−mx y −my

]} (3.37)

= exp

 −12

[ x−

( mx +

λxy λy

(y −my) )]2

( λx −

λ2xy λy

)   (3.38)

68 CHAPTER 3. ESTIMATION OF PARAMETERS

First we use Bayes rule to express the conditional density in terms of the joint density, scaled by pY (y). For a given y this is just a normalization. Thus we concentrate on just the exponential component of the density. The last line shows that the form of this density is again a Gaussian, with mean given by the numerator term in parentheses and variance given by the denominator term. We directly obtain the BLSE and the conditional variance as:

x̂BLSE(y) = E[x | y] = mx + λxy λy

(y −my) (3.39)

λx|y = λx − λ2xy λy

(3.40)

Let us make some interesting observations. First, note that the estimate is a linear function of the observation in the Gaussian case! This is not true in general. In addition, the conditional density λx|y is independent of y for the Gaussian case. Thus, for the Gaussian case the MSE is the same as the conditional density (and the error variance): MSE = E[λx|y] = λx|y. These are yet other ways in which Gaussians are special.

The structure of the BLSE for the Gaussian case has an intuitively pleasing form. For example, if the cross-correlation λxy between X and Y is zero, then X and Y are independent since they are jointly Gaussian. In this case, note that the BLSE reduces to the prior mean: x̂BLSE(y) = mx, which is independ of y. In this case, observations do not help us. Similarly, as the observations become more variable, i.e. as λy → ∞ we have that x̂BLSE(y) → mx. In other words, we ignore the data. Conversely, as the statistical tie between X and Y becomes larger (i.e. as λxy increases relative to λy) more weight is placed on the observation relative to the prior mean. Finally, note that the conditional variance (which is also the MSE and the error variance for this case) is reduced relative to the prior variance λx. In particular, we can write λBLSE = λx|y ≤ λx.

Example 3.6 (Vector Gaussian Case) Now let us examine the vector Gaussian case. In particular, suppose that X and Y are joinly Gaussian random vectors with mean vectors mx, my and covariance matrices Λx, Λy respectively and cross-covariance matrix Λxy. Again, we can manipulate the joint density to obtain an expression for the conditional density, and from this density we can find the conditional mean and variance needed to find the BLSE. Using Bayes’ rule the conditional density is given by:

pX|Y (x | y) = pX,Y (x,y)

pY (y) (3.41)

Assume that x is n-dimensional, and y is m-dimensional. In this case, the conditional probability density becomes:

pX|Y (x | y) = ( √

2π) −n

√ det Λy√

det

[ Λx Λxy ΛTxy Λy

] e −1/2

  x−mx y −my

 T

  Λx Λxy

ΛTxy Λy

 −1

  x−mx y −my

 

e−1/2(y−my) T Λ −1 y (y−my)

(3.42)

The constant in front of the exponential ratio can be ignored, since it is merely a normalization factor to insure that the resulting density integrates to 1. The important term to focus on is the ratio of the exponentials. In order to understand

this ratio, one needs a formula for the inverse of the joint covariance of [ x,y ]T

, which we do next. The following matrix identity will prove useful: [

I −ΛxyΛ−1y 0 I

][ Λx Λxy ΛTxy Λy

] =

[ Λx − ΛxyΛ−1y ΛTxy 0

ΛTxy Λy

] (3.43)

By inverting the block triangular matrices we obtain the formula:

[ Λx Λxy ΛTxy Λy

]−1 =

[ Λx − ΛxyΛ−1y ΛTxy 0

ΛTxy Λy

]−1 [ I −ΛxyΛ−1y 0 I

] (3.44)

[ ( Λx − ΛxyΛ−1y ΛTxy

)−1 0

−Λ−1y ΛTxy ( Λx − ΛxyΛ−1y ΛTxy

)−1 Λ−1y

][ I −ΛxyΛ−1y 0 I

] (3.45)

Thus, the desired inverse is given by:

[ Λx Λxy ΛTxy Λy

]−1 =

[ ( Λx − ΛxyΛ−1y ΛTxy

)−1 −(Λx − ΛxyΛ−1y ΛTxy)−1 ΛxyΛ−1y −Λ−1y ΛTxy

( Λx − ΛxyΛ−1y ΛTxy

)−1 Λ−1y + Λ

−1 y Λ

T xy

( Λx − ΛxyΛ−1y ΛTxy

)−1 ΛxyΛ

−1 y

] (3.46)

3.4. BAYES LEAST SQUARE ESTIMATION 69

With the above inverse, one can now compute the exponent of the exponential fraction in (3.42), as

e −1/2

  x−mx y −my

 T

  Λx Λxy

ΛTxy Λy

 −1

  x−mx y −my

 

e−1/2(y−my) T Λ −1 y (y−my)

(3.47)

= exp {

1/2(y −my) T

Λ −1 y (y −my) − 1/2(x−mx)

T (Λx − ΛxyΛ−1y Λ

T xy) −1

(x−mx)

+ (x−mx) T

(Λx − ΛxyΛ−1y Λ T xy) −1

ΛxyΛ −1 y (y −my)

− 1/2(y −my) T [ Λ −1 y + Λ

−1 y Λ

T xy(Λx − ΛxyΛ

−1 y Λ

T xy) −1

ΛxyΛ −1 y

] (y −my)

} (3.48)

= exp { −1/2(x−mx)

T (Λx − ΛxyΛ−1y Λ

T xy) −1

(x−mx) + (x−mx) T

(Λx − ΛxyΛ−1y Λ T xy) −1

ΛxyΛ −1 y (y −my)

− 1/2(y −my) T

Λ −1 y Λ

T xy(Λx − ΛxyΛ

−1 y Λ

T xy) −1

ΛxyΛ −1 y (y −my)

} (3.49)

= e −1/2(x−mx−ΛxyΛ

−1 y (y−my))

T (Λx−ΛxyΛ−1y Λ

T xy) −1

(x−mx−ΛxyΛ −1 y (y−my)) (3.50)

Thus the conditional density of x given y is again Gaussian (which it must be, since they are jointly Gaussian), with an exponent given by (3.50). Now we know this conditional Gaussian distribution must be of the following general form: form of

pX|Y (x | y) ∝ e −1/2(x−E[x|y])

T Λ −1 x|y(x−E[x|y]) (3.51)

By identifying similar terms between (3.50) and (3.51) we immediately find that:

E [ x | y

] = mx + ΛxyΛ

−1 y (y −my) (3.52)

Λx|y = Λx − ΛxyΛ −1 y Λ

T xy (3.53)

The first of these provides the BLSE for the vector Gaussian problem:

x̂BLSE(y) = E [ x | y

] = mx + ΛxyΛ

−1 y (y −my) (3.54)

Note the similarity between this expression and the BLSE for the scalar Gaussian case given in (3.40). Similarly, the conditional variance is given in (3.53), which does not depend on y! Again, for general distributions the conditional covariance will depend on the value observed, but for jointly Gaussian random vectors, this covariance is constant! As a result, for the vector Gaussian case the MSE is simply the trace of the conditional covariance:

MSE = tr [ Λx|y

] (3.55)

= tr [ Λx − ΛxyΛ−1y Λ

T xy

] (3.56)

The above formulas provide explicit equations for the BLSE estimator (which, recall is the MMSE estimator) and associated error for the case of jointly Gaussian random vectors. Note again that the estimator is a linear function of y.

Let us close by summarizing the properties of BLSE estimates:

• The BLSE estimate is the conditional mean E[x | y].

• The BLSE estimate is always unbiased.

• The BLSE estimate is always the MMSE estimate.

• In general, the BLSE is a nonlinear function of the data.

• For jointly Gaussian problems only, the BLSE estimate is linear and the conditional variance is inde- pendent of the data.

• In general, finding the BLSE estimate requires finding the conditional density, and thus is challenging.

70 CHAPTER 3. ESTIMATION OF PARAMETERS

3.5 The Orthogonality Principle for Least Squares Estimation

As discussed in Section 2.7, the space of random variables with finite second moments on a probability space (Ω,F,P) form a Hilbert space, which is a linear space that has an inner product. We denote this Hilbert space as L2(Ω,F,P), using notation from functional analysis to denote the random variables have finite first and second moments. This space has an inner product, defined as < X,Y >= E[XY ] and a metric ‖X‖ =

√ < X,X > =

√ E[X2].

Consider a random vector Y , of random variables in L2(Ω,F,P) as a set of observations. Define the space

V = {f(Y )for any function f() such that E[f(Y )2] < ∞}

Note the following:

• V ⊂ L2(Ω,F,P), as every random variable generated in V has finite second moment.

• V is a subspace, which contains the zero random variable, and linear combinations of elements in V are also in V.

• V is a closed subspace, in that, if V1,V2, . . . is a sequence of elements of V, and limn→∞Vn mss = V , then

V ∈V.

Note also the following: any estimator of X based on Y that has finite second moment is an element of V. Hence, we can pose the problem of Bayes Least Squares Estimation as follows: Find the element X∗ ∈V such that

‖X −X∗‖2 = min Z∈V ‖X −Z‖2

The solution has a very appealing geometric interpretation. The closest element will be the element where the error is orthogonal to the linear subspace V. This element will be the perpendicular projection of X onto the subspace V. Mathematically, this means

E[(X −X∗)g(Y )] = 0 for any function g() so that E[g(Y )2] < ∞

Using this characterization, it is easy to derive the form for the Bayes Least Square estimate, as

E[(X −X∗)g(Y )] = E[E[(X −X∗)g(Y )|Y ]] = E[E[(X −X∗)|Y ]g(Y )] = 0

for any function g() that maps into V. In particular, one such function could be g(Y ) = E[(X − X∗)|Y ], so the equality implies that E[(X − X∗)|Y ] = 0. Recalling that X∗ is a function of Y , this means that X∗ = E[X | Y ], the solution we obtained before.

$x B L S E

F u n c t i o n s o f y

Figure 3.4: Illustration of the projection theorem for BLSE.

We also get a nice expression for the minimum mean square error E[(X −X∗)2] using the orthogonality principle:

E[X2] = E[X∗2] + E[(X −X∗)2] ⇒ E[e2] = E[X2] −E[X∗2]

Interpreting Bayesian Least Squares estimation as an orthogonal projection has many interesting implica- tions. For instance, if V∞ were a closed linear subspace of V (such as functions that depend only a subset of

3.6. BAYES MAXIMUM A POSTERIORI (MAP) ESTIMATION 71

elements of the vector Y ), then the projection onto V1 is the projection onto V subsequently projected onto V1. This is a version of the iterated expectation equality, so There are other nice properties of projections that we will exploit subsequently.

E[X|V1] = E[E[X | V] | V1]

The above exposition focused on estimating a random variable X given a vector of observations Y . When we are estimating a vector of random variables X of dimension n, we are looking for a vector X∗ of elements of V such that

E[(X −X∗)T (X −X∗)] = n∑ k=1

E[(Xk −X∗k) 2] = min

Z∈Vn E[

n∑ k=1

E[(Xk −Zk)2]

Thus, it is like estimating each element of X individually, or projecting each Xi orthogonally onto V. The orthogonality conditions can now be written as

E[(X −X∗)g(Y )] = 0 for any function g() so that E[g(Y )2] < ∞

and the optimal solution as X∗ = E[X | Y ]. The resulting minimum mean square error is

n∑ i=1

(Xi −X∗i ) 2 =

n∑ i=1

E[X2i ] −E[X ∗2 i ] = E[X

TX] −E[X∗TX∗]

3.6 Bayes Maximum A Posteriori (MAP) Estimation

In this section we will examine Bayes’ estimation with a different choice of cost function. In particular, we now focus on the “uniform cost” function given by:

JMAP(x̂,x) =

{ 1 |xi − x̂i| ≥ �for some i 0 |x− x̂| < � for all i (3.57)

This cost function is depicted in Figure 3.5 as a function of a scalar error e = x − x̂. Note that this cost function treats all errors as equally bad, no matter how large. It is reminiscent of the “0-1” cost structure we saw in our study of detection. You might expect that we will obtain similar estimates, and you will not be disappointed.

J ( e )

Figure 3.5: Uniform or MAP cost function.

Now let us find the Bayesian estimator corresponding to the uniform cost function. As before we start with the general Bayes’ estimator definition (3.8) and go from there:

x̂MAP(y) = arg min x̂

∫ JMAP(x̂,x) pX|Y (x | y) dx (3.58)

= arg min x̂

∫ x |maxk |xk−x̂k|≥�}

pX|Y (x | y) dx (3.59)

= arg min x̂

[ 1 −

∫ {x | |xk−x̂k|<�}∀k

pX|Y (x | y) dx

] (3.60)

Now the integral in (3.60) is over an infinitely small “gap” around x̂, so that the overall expression is minimized by placing the gap centered at x̂ at the maximum of the conditional density. The geometric

72 CHAPTER 3. ESTIMATION OF PARAMETERS

E J y|

J x x( , $ )

p X | Y ( x | y )

Figure 3.6: Illustration of geometry behind MAP estimate derivation.

situation is depicted in Figure 3.6. The shaded area depicts the right hand side in (3.58). Note that this is given by the area of the conditional density minus the area centered around x̂. We minimize the shaded area by placing the “gap” at the peak of the conditional density. This observation yields:

x̂MAP(y) = arg maxx pX|Y (x | y) (3.61)

Thus we have the result that the optimal Bayes’ estimate corresponding to the uniform cost structure in (3.57) is the value of x that maximizes the conditional density pX|Y (x | y). Since this density can be thought of as the density obtained for x after having observed Y = y, this conditional density is often referred to as the “posterior density,” and for this reason the corresponding estimate is referred to as the “Maximum A Posteriori” or MAP estimate. Note that whereas the BLSE estimate was the mean of pX|Y (x | y), the MAP estimate is the peak or “mode” of this density. Evidentally the conditional density pX|Y (x | y) plays a key role in both estimators. Note that, in general the mean and mode of a density can be different, so the BLSE and MAP estimates will be different in general.

The definition of the MAP estimate given in (3.61) is the fundamental one. For differentiable densities we can characterize the potential maxima of the density as the locations where the derivative is zero and the second derivative is negative. This approach leads to a common characterization of the MAP estimate we will derive next. First, using Bayes’ rule and the monotonic properties of the natural logarithm we can rewrite the MAP estimate as follows:

x̂MAP(y) = arg maxx pX|Y (x | y) = arg max

pX,Y (x,y)

pY (y) = pY |X(y | x)pX(x)

pY (y) (3.62)

= arg max x

pY |X(y | x)pX(x) = arg max x

ln [ pY |X(y | x)pX(x)

] (3.63)

= arg max x

( ln [ pY |X(y | x)

] + ln

[ pX(x)

]) (3.64)

Now the MAP estimate is obtained as the maximum of the expression in parentheses in (3.64). A necessary condition for the maximum is that the derivative of this expression with respect to x be zero. Thus the MAP estimate must satisfy the following equation (if it exists!):

∂ ln [ pY |X(y | x)

] ∂x

+ ∂ ln

[ pX(x)

] ∂x

∣∣∣∣∣ x=x̂MAP (y)

= 0 (3.65)

This equation is sometimes referred to as the “MAP equation.” Before proceeding to some examples, note that, unlike the BLSE estimator, there are no particularly nice

general expressions for either the bias or the variance of the MAP estimator. In particular, the general MAP estimator can be biased and will not be the MMSE estimator. So why do MAP estimation? One reason is that the structure of the estimator given as expressed in (3.63) rationally and, some would argue, naturally combines both an observation model (the term pY |X(y | x)) and a prior model pX(x). Another reason is that often maximizing a function is easier than averaging, which requires weighted integration of some sort. This idea of finding a solution by maximizing a function appears throughout engineering and outside of stochastic concerns. Let us look at some examples of MAP estimation.

Example 3.7 In this example let us revisit the problem of Example 3.2, but this time seek the MAP estimate. We again need the conditional density pX|Y (x | y), which is already given in Figure 3.3. Now for the MAP estimate we seek the value of x at

3.6. BAYES MAXIMUM A POSTERIORI (MAP) ESTIMATION 73

which this density is maximum. Inspection of Figure 3.3 will show that the maximum is found for a whole range of values of x! Thus for this case the MAP estimate is not unique and x̂MAP (y) is any x in the interval [−(1 −|y|), (1 −|y|)]. Note that any x in this range will produce the same value for the expected risk or cost. While this is true, also note that different choices of this value will have different MSEs in general. For example, the BLSE is one consistent choice of x, which will then have minimum MSE. Suppose instead for x̂MAP (y) that we always choose right end of the interval so that x̂MAP (y) = (1 −|y|). The MSE for this latter choice is:

MSE = E [ (x− x̂MAP (y))2

] =

∫ ∞ −∞

(x− (1 −|y|))2 pX,Y (x,y) dxdy (3.66)

∫ 1 0

∫ 1−y −1+y

(x− (1 −y))2 1

2 dxdy +

∫ 0 −1

∫ 1+y −1−y

(x− (1 + y))2 1

2 dxdy (3.67)

= 2

3 (3.68)

Note that this MSE is larger than the MSE of 1/6 we found for the BLSE estimator.

Example 3.8 For this example let us revisit the problem of Example 3.3, but again seek the MAP estimate. We have already calculated the conditional density in (3.22), which is given by:

pX|Y (x | y) = 2x

y2 , 0 ≤ x ≤ y (3.69)

The maximum of this conditional density occurs at x = y. Therefore:

x̂MAP(y) = y (3.70)

Note that in this case the MAP estimate is unique and it is different from the BLSE estimate. Now we know the BLSE is an unbiased estimator. What about the MAP estimate? The bias for this example can be

found as:

b = E [ x− x̂MAP(y)

] = E[x] −E[y] (3.71)

∫ ∞ −∞

xpX(x) dx− ∫ ∞ −∞

y pY (y) dy =

∫ 1 0

x 6x(1 −x) dx− ∫ 1

y 3y 2 dy =

2 −

4 = −

4 (3.72)

Thus the bias is not necessarily 0 for the MAP estimate. We can again show that the mean square error is higher than that for the BLSE estimator by direct calculation:

MSE = E [ (x− x̂MAP (y))2

] =

∫ ∞ −∞

(x− x̂MAP (y))2 pX,Y (x,y) dxdy (3.73)

∫ 1 0

∫ y 0

(x−y)2 6x dxdy (3.74)

= 1

2 (3.75)

which is greater than the MSE of 1/30 associated with the BLSE estimate which we found in (3.26).

Example 3.9 For this example let us revisit the problem of Example 3.4, but again seek the MAP estimate. We have already calculated the conditional density in (3.29), which is given by:

pX|Y (x | y) = 2x

y4 , 0 ≤ x ≤ y2 (3.76)

The maximum of this conditional density occurs at the endpoint of the interval x = y2. Therefore:

x̂MAP(y) = y 2

(3.77)

Note that in this case the MAP estimate is unique and it is different from the BLSE estimate. Also note that since the maximum is at the endpoint of the interval, equation (3.65) cannot be used to find the density in this case.

We can again show that the mean square error is higher than that for the BLSE estimator by direct calculation:

MSE = E [ (x− x̂MAP (y))2

] =

∫ ∞ −∞

(x− x̂MAP (y))2 pX,Y (x,y) dxdy (3.78)

∫ 1 0

∫ y2 0

( x−y2

)2 10x dxdy (3.79)

= 5

54 = 0.0926 (3.80)

74 CHAPTER 3. ESTIMATION OF PARAMETERS

which is greater than the MSE associated with the BLSE estimate which we found in (3.33).

Example 3.10 (Gaussian Case) Let us now examine the problems of Examples 3.5 and 3.6 with regard to the MAP estimate. Recall that the conditional density for a jointly Gaussian problem is again a Gaussian density, and e.g. is proportional to the expression in given in (3.38). Being a Gaussian, the conditional density has a single maximum which occurs at the same place as its mean. Therefore in the case of jointly Gaussian densities the MAP and BLSE estimates are identical with identical MSE!

x̂MAP(y) = x̂BLSE(y) = mx + λxy λy

(y −my) (3.81)

More generally, with a little thought we can see that the BLSE and MAP estimates will be the same whenever pX|Y (x | y) is symmetric and unimodal (i.e. has a single maximum), since in these cases the mean and maximum of the density are one and the same.

Example 3.11 (Gaussian Problems with Linear Observations) Let us now focus on a particularly important problem: MAP estimation for problems with linear observations, Gaussian densities, and vector state. In particular, assume we have the following general problem, wherein our noisy vector observation y is linearly related to our quantity of interest x, which itself is Gaussian:

y = Cx + w, w ∼ N(0,R) (3.82) x ∼ N(0,Q) (3.83)

where y = [y1, · · · ,yN ]T , x = [x1, · · · ,xN ]T , w = [w1, · · · ,wN ]T , R is the covariance matrix of the observation noise, w is independent of x, and Q is the covariance matrix of x.

In this problem we can see with a bit of thought that x and y will be jointly Gaussian random vectors, so the results of Example 3.10 apply and we know the MAP estimate will be given by the conditional mean. Using the fomulas (3.52) and (3.53) we obtain:

x̂MAP = QC T ( CQC

T + R

)−1 y

ΛMAP = Λx|y = Q−QC T ( CQC

T + R

)−1 CQ

While this result is certainly correct, we may also derive an alternative expression for the MAP estimate based directly on the MAP equation. This alternative result is widely used, so worth deriving. To this end note that pY |X(y|x) = N(y; Cx,R), since w is independent of x, thus:

x̂ MAP

= arg max x

pY |X(y|x) pX(x) = arg max x

ln[pY |X(y|x)] + ln[pX(x)] (3.84)

= arg max x −(y −Cx)TR−1(y −Cx) −xTQ−1x (3.85)

= arg min x

∥∥y −Cx∥∥2 R−1

+ ‖x‖2 Q−1 (3.86)

= arg min x

y T R −1 y − 2xTCTR−1y + xTCTR−1Cx + xTQ−1x (3.87)

In going from (3.84) to (3.85) we have simply inserted the densities in question and eliminated any constants not affecting the optimizations. In going from (3.85) to (3.86) we have simply switched from a maximization to a minimization by eliminating the leading minus sign and we have written the quadratic forms in (3.85) as weighted norms. Note that when the MAP problem is written in the form (3.86) we can easily see its relationship to least square minimization. The expression (3.87) is obtained by multiplying out the quadratic forms in (3.85) or (3.86).

Now, a necessary condition for our solution is that the derivative of (3.87) be zero at the MAP estimate:

∂

∂x

[ y T R −1 y − 2xTCTR−1y + xTCTR−1Cx + xTQ−1x

]∣∣∣∣ x̂MAP

= 0 (3.88)

=⇒ −2CTR−1y + 2CTR−1Cx̂ MAP

+ 2Q −1 x̂

MAP = 0 (3.89)

where in going from (3.88) to (3.89) we have made use of the rules of vector calculus. Finally, we obtain that the MAP estimate must satisfy the following set of so called normal equations:(

C T R −1 C + Q

−1 ) x̂

MAP = C

T R −1 y (3.90)

Note that since the problem is Gaussian, this is also the Bayes and the LLSE estimate.

3.6. BAYES MAXIMUM A POSTERIORI (MAP) ESTIMATION 75

Note that we have specified the MAP estimate implicitly as the solution of a set of linear equations. For very large problems (as arise, for example, in image processing), the computational cost of explicitly inverting the left hand side of (3.90) is prohibitive – with a cost of O(N3). As a result, in these cases this set of equations are usually solved iteratively using a method such as conjugate gradient. For many such problems the left hand side of (3.90) is very sparse, which is well suited to such iterative techniques.

We can also obtain an alternate expression for the estimation error covariance matrix ΛMAP associated with the MAP estimate.

E [ (x− x̂

MAP )(x− x̂

MAP ) T ]

= Q−E [( C T R −1 C + Q

−1 )−1

C T R −1

(Cx + w) x T

] −E

[ x ( x T C T

+ w T ) R −1 C ( C T R −1 C + Q

−1 )−1]

[( C T R −1 C + Q

−1 )−1

C T R −1

(Cx + w) (Cx + w) T R −1 C ( C T R −1 C + Q

−1 )−1]

= Q− ( C T R −1 C + Q

−1 )−1

C T R −1 CQ−QCTR−1C

( C T R −1 C + Q

−1 )−1

+ ( C T R −1 C + Q

−1 )−1

C T R −1 [ CQC

T + R

] R −1 C ( C T R −1 C + Q

−1 )−1

= Q− ( C T R −1 C + Q

−1 )−1

C T R −1 CQ−QCTR−1C

( C T R −1 C + Q

−1 )−1

+ ( C T R −1 C + Q

−1 )−1

C T R −1 CQ

[ C T R −1 C + Q

−1 ]( C T R −1 C + Q

−1 )−1

= Q− ( C T R −1 C + Q

−1 )−1

C T R −1 CQ−QCTR−1C

( C T R −1 C + Q

−1 )−1

+ ( C T R −1 C + Q

−1 )−1

C T R −1 CQ

= Q−QCTR−1C ( C T R −1 C + Q

−1 )−1

= [ Q ( C T R −1 C + Q

−1 ) −QCTR−1C

]( C T R −1 C + Q

−1 )−1

= [ QC

T R −1 C + I −QCTR−1C

]( C T R −1 C + Q

−1 )−1

= ( C T R −1 C + Q

−1 )−1

Thus we have that

ΛMAP = ( C T R −1 C + Q

−1 )−1

(3.91)

The inverse of the error covariance for the MAP estimate is thus given by:

Λ −1 MAP = C

T R −1 C︸︷︷︸

Info in Obs

+ Q −1︸︷︷︸

Prior Info

(3.92)

The inverse of a covariance (i.e. the inverse of the variability) can reasonably be taken as a measure of information. Indeed, such inverse covariances are referred to as “information matrices.” We can thus see that the information in the estimate after observing data is composed of two parts, as indicated in (3.92). The prior information and the information in the observation. The interesting thing is that these two pieces of information simply add!

Example 3.12 (Gaussian Problems with Nonlinear Observations) Here we consider a case often arising in practice, that of a nonlinear observation model with additive Gaussian noise and a Gaussian prior model. In particular, suppose we have the following model for our observation y and unknown x:

y = H(x) + v, v ∼ N (0,R) (3.93) x ∼ N (0,Q) (3.94)

where we assume that the noise v is independent of x. Now pY |X(y | x) is Gaussian. In particular, we have that:

ln [ pY |X(y | x)

] = −

( y −H (x)

)T R −1 (

y −H (x) )

+ constant (3.95)

Thus:

∂

∂x ln [ pY |X(y | x)

] = y

T R −1 Hx (x) −HT (x) R−1Hx (x) (3.96)

76 CHAPTER 3. ESTIMATION OF PARAMETERS

where [Hx (x)]ij = ∂Hi(x) ∂xj

is the matrix of partial derivatives of the vector function H (x) with respect to its arguments.

Continuing, pX(x) is also Gaussian, so that we have:

∂

∂x ln [pX(x)] = −xTQ−1 (3.97)

Combining (3.96) and (3.97) using the MAP equation (3.65) we obtain an equation for the MAP estimate for this case:

y T R −1 Hx (x̂MAP ) −H

T (x̂MAP ) R

−1 Hx (x̂MAP ) − x̂

T MAPQ

−1 = 0 (3.98)

=⇒ HTx (x̂MAP ) R −1 H (x̂MAP ) + Q

−1 x̂MAP = H

T x (x̂MAP ) R

−1 y (3.99)

In general this represents a set of nonlinear equations, since H depends on x. If the mapping happens to be linear, so that H(x) = Hx, this yields:(

H T R −1 H + Q

−1 ) x̂MAP = H

T R −1 y (3.100)

which is a set of linear equations for the MAP estimate. Compare this solution with that obtained for the LLSE estimate with a linear observeration model in Section 3.8 (recall the LLSE estimate is the same as the MAP estimate under a Gaussian assumption).

Implicit Gaussian Prior Models

It is sometimes convenient in MAP estimation with Gaussian models to specify our prior model for x implicitly rather than explicitly as in (3.83). Such a situation arises, for example, when the elements of x are related by a dynamic equation. Our problem statement in such cases can be taken to be:

y = Cx + w, w ∼ N(0,R) (3.101) Lx = v, v ∼ N(0,Qv) (3.102)

where L is a matrix and v a zero-mean Gaussian process with covariance Qv. Note that our “standard” prior model for x, as given by its covariance matrix, is implicitly specified through (3.102). Assuming that L is invertible, it is a simple matter to obtain the explicit model of x given such an implicit model as:

x ∼ N ( 0,L−1QvL

−T) = N (0,[LTQ−1v L]−1) (3.103) Given this prior model and our solution in (3.90), we see that the MAP estimate for an implicit prior model must satisfy: (

CTR−1C + LTQ−1v L ) x̂

MAP = CTR−1y (3.104)

Note, in particular, that the normal equations can be formed without the need of inverting L. The corre- sponding estimation error covariance is given by:

ΛMAP = ( CTR−1C + LTQ−1v L

)−1 (3.105)

You may wonder why we would care about implicit specification of prior models as in (3.102). One case of interest arises when x is specified through an autoregressive model driven by white noise. For example, suppose the elements xi of x are specified as the output of the following AR model:

xn = axn−1 + vn, vn ∼ N(0, 1) (3.106) x0 = v0, v0 ∼ N(0, 1) (3.107)

Then this structure implies the following model for x: 

1 0 · · · 0 −a 1

−a 1 . . .

. . .

−a 1

 

︸︷︷︸ L

 

x0 x1 ...

xN−1 xN

 

︸︷︷︸ x

 

v0 v1 ...

vN−1 vN

 

︸︷︷︸ v

(3.108)

3.7. BAYES ABSOLUTE ERROR ESTIMATION 77

where v ∼ N(0,I). As can be seen, the matrix L captures the elements of the implicit AR model of the process. Indeed, it is a relatively simple matter given an arbitrary AR model to specify the elements of L. In particular, for an p-th order AR model L will have exactly p nonzero bands, consisting ones on the diagonal and the AR coefficients along the sub-diagionals.

Given the results in (3.103), and the fact that the vector v is white, it is now a simple matter to specify associated covariance matrix for the entire 1st order AR process as:

Q = ( LTL

)−1 =

 

1 + a2 −a −a 1 + a2 −a ©

−a 1 + a2 −a . . .

. . . . . .

 

−1

(3.109)

Finally, as noted above, in forming the normal equations (3.104), since all we use is Q−1, we need never explicitly invert L. In addition, the matrix LTL has a highly sparse and banded structure, which will be reflected in the structure of the normal equations. Such structure, which will be associated with any AR model, is typical of a host of problems appearing in science and engineering.

Let us close this section by summarizing what we have learned about MAP estimates:

• The MAP estimate is the conditional mode: arg maxx pX|Y (x | y).

• The MAP estimate may be biased.

• The MAP estimate is not necessarily the MMSE estimate.

• The MAP estimate may not be unique.

• In general, the MAP estimate is a nonlinear function of the data.

• For jointly Gaussian problems, the MAP estimate is the same as the BLSE estimate, and in this case is a linear estimate and MMSE.

• In general, finding the MAP estimate requires finding the conditional density, and thus is challenging.

• More generally, when pX|Y (x | y) is symmetric and unimodal, the MAP and BLSE estimates coincide.

3.7 Bayes Absolute Error Estimation

Another choice of the cost function in the Bayes’ estimation approach is the absolute error cost, given by:

JMAE(x̂,x) =

n∑ k=1

|xk − x̂k| (3.110)

Now let us find the Bayesian estimator corresponding to this absolute error cost function. As always we start with the general Bayes’ estimator definition (3.8) and go from there:

x̂MAE(y) = arg min x̂

∫ JMAE(x̂,x) pX|Y (x | y) dx (3.111)

= arg min x̂

[∫ ∞ −∞

n∑ k=1

|xk − x̂k|pX|Y (x | y) dx

] (3.112)

= arg min x̂

[ n∑ k=1

∫ ∞ −∞ |xk − x̂k|pXk|Y (xk | y) dxk

] (3.113)

= n∑ k=1

arg min x̂k

[∫ x̂k −∞

(x̂k −xk)|pXk|Y (xk | y) dxk + ∫ +∞ x̂k

(xk − x̂k) pXk|Y (xk | y) dxk

] (3.114)

78 CHAPTER 3. ESTIMATION OF PARAMETERS

Now we take the derivative of the expression in brackets in (3.114) and set it to zero to find the estimate.

∂

∂x̂k

[∫ x̂k −∞

(x̂k −xk) pXk|Y (xk | y) dxk + ∫ +∞ x̂k

(xk − x̂k) pXk|Y (xk | y) dxk

] =(3.115)

∫ x̂k −∞

pXk|Y (xk | y) dxk − ∫ +∞ x̂k

pXk|Y (xk | y) dxk = 0 (3.116)

This implies that the optimal minimum absolute error estimate must satisfy the following constraint:∫ x̂ −∞

pXk|Y (xk | y) dx = ∫ +∞ x̂k

pXk|Y (xk | y) dx (3.117)

Inspection of this constraint will show that this is just the definition of the median of the posterior density pXk|Y (x | y)! We have thus shown the following result:

(x̂k)MAE(y) = median of pXk|Y (xk | y) (3.118)

To summarize the development to this point, we have that the BLSE estimate is the mean of pX|Y (x | y), the MAP estimate is the peak or “mode” of this density, and the MAE estimate is the median of this density. Evidentally the conditional density pX|Y (x | y) plays a key role in all these estimators. Note that, in general the mean, mode, and median of a density can be different, so the BLSE, MAP, and MAE estimates will be different in general. But if this density is symmetric and unimodal, then they are all the same.

3.8 Bayes Linear Least Square (LLSE) Estimation

In general (for non-Gaussian problems), the BLSE or MAP estimates are nonlinear functions of the data. Further, finding these estimates requires calculation of the posterior density. For these reasons, finding and implementing these estimates can be difficult in practice. As a result, we now modify our approach a bit. In particular, we will restrict our attention to estimators that are a linear function of the data (strictly speaking, an affine form). If we seek the estimator of this class that minimizes the square error cost function used in the BLSE estimate (i.e. minimize E

[ JBLSE(x̂,x)

] over the class of linear estimators), we will see that

we obtain an estimate that only requires knowledge of second-order statistics, rather than the conditional density. This resulting estimate is termed the Linear Least Square Estimate (LLSE). In summary, we will focus on the BLSE cost function and its mean value, but will restrict the form of the our estimator to linear functions of the data.

One easy way to derive the LLSE estimator is to use the orthognality principle. Given observations Y with finite second moments, the space of all linear random variables Z = aTY + b, for some constant vector a, scalar b, forms a closed linear subspace V1 of L2(Ω,F,P), which is also a subset of V discussed previously. That is because it must yield a random variable with finite second moment given by

E[Z2] = aTE[Y Y T ]a + 2baTE[Y ] + b2 < ∞

Hence, the optimal LLSE estimator X∗ of a random variable X given observations of Y must satisfy the orthogonality principle:

E[(X −X∗)Z] = 0 for all Z ∈V1 Since any such Z = aTY + b, and X∗ = a∗TY + b∗, we have:

E[(X −X∗)Z] = E[XY T ]a + bE[X] −a∗TE[Y Y T ]a− b∗aTE[Y ] − ba∗TE[Y ] − bb∗

Expand the second moments above into covariances and producst of means, as

E[(X −X∗)Z] = Cov(X,Y )a + E[X]E[Y T ]a−a∗TCov(Y )a−a∗TE[Y ]E[Y T ]a + bE[X] − b∗aTE[Y ] − ba∗TE[Y ] + bb∗

= (Cov(X,Y ) −a∗TCov(Y ) + E[X]E[Y T ] −a∗TE[Y ]E[Y T ] − b∗E[Y T ])a + (E[X] −a∗TE[Y ] − b∗)b− 0

3.8. BAYES LINEAR LEAST SQUARE (LLSE) ESTIMATION 79

Note that a and b are arbitrary, and thus their coefficients must be zero for the equality to hold for all values of a and b. Thus,

E[X] −a∗TE[Y ] − b∗ = 0 ⇒ b∗ = E[X] −a∗TE[Y ] Cov(X,Y ) −a∗TCov(Y ) + E[X]E[Y T ] −a∗TE[Y ]E[Y T ] − b∗E[Y T ] = 0 = Cov(X,Y ) −a∗TCov(Y ) + (E[X] −a∗TE[Y ] − b∗)E[Y T ] = Cov(X,Y ) −a∗TCov(Y ) = 0 ⇒ a∗T = Cov(X,Y )Cov(Y )−1

where we used the optimal value of b∗ to simplify the above equation. Thus, grouping the terms in the above solution, we obtain the LLSE estimate:

x̂LLSE(y) = E[X] + Cov(X,Y )Cov(Y ) −1(y −E[Y ])

For estimating a random vector X, we estimate each of its components as above based on Y . Note that the above estimate has the same structure for each component (it is based on the same Y ), with small variations. Thus,

x̂1LLSE(y) = E[X1] + Cov(X1,Y )Cov(Y ) −1(Y −E[Y ])

x̂2LLSE(y) = E[X2] + Cov(X2,Y )Cov(Y ) −1(Y −E[Y ])

...

x̂nLLSE(y) = E[Xn] + Cov(Xn,Y )Cov(Y ) −1(Y −E[Y ])

Stacking these estimates into a vector, the resulting vector estimate is:

x̂LLSE(y) = E[X] + Cov(X,Y )Cov(Y ) −1(Y −E[Y ])

We can further compute the statistics of the errror e = X − x̂LLSE(Y ), as

E[e] = E[X] −E[X] −Cov(X,Y )Cov(Y )−1(E[Y ] −E[Y ]) = 0

showing that the estimation error is unbiased. Furthermore, the error covariance is:

E[eeT ] = E[eXT ]

= E[(X −E[X] −Cov(X,Y )Cov(Y )−1(Y −E[Y ])]XT ] = Cov(X) −Cov(X,Y )Cov(Y )−1Cov(Y ,X) = Cov(X) −Cov(X,Y )Cov(Y )−1Cov(X,Y )T

where the first equality follows from the orthogonality of the error and the estimate. One can also derive these formulas using calculus instead of the orthogonality principle, as follows: The

general form of the LLSE estimate is:

x̂LLSE(y) = C T 1 y + c2 (3.119)

for some constant matrices C1 and c2. As before, we need to choose the constants C1 and c2 to minimize the mean square error cost criterion:

E [ JBLSE(x,C

T 1 y + c2)

] (3.120)

= E [( x−CT1 y − c2

)T ( x−CT1 y − c2

)] (3.121)

= E [ xTx

] − 2E

[ xTCT1 y

] − 2cT2 E [x] + E

[ yTC1C

T 1 y ]

+ 2cT2 C T 1 E

[ y ]

+ cT2 c2 (3.122)

= E [ tr ( xxT

)] − 2E

[ tr ( CT1 yx

T )] − 2cT2 mx + E

[ tr ( CT1 yy

TC1 )]

+ 2cT2 C T 1 my + c

T 2 c2 (3.123)

= tr ( E [ xxT

]) − 2tr

( CT1 E

[ yxT

]) − 2cT2 mx + tr

( CT1 E

[ yyT

] C1 )

+ 2cT2 C T 1 my + c

T 2 c2 (3.124)

= tr ( E [ xxT

]) − 2tr

( CT1 E

[ yxT

]) − 2cT2 mx + tr

( CT1 E

[ yyT

] C1 )

+ 2tr ( CT1 myc

T 2

) + cT2 c2 (3.125)

80 CHAPTER 3. ESTIMATION OF PARAMETERS

where in going from (3.122) to (3.123) and from (3.124) to (3.125) we have used that fact that xTy = tr(yxT ) for the trace of a matrix (see Appendix C). Now to minimize this expression with respect to C1 and c2 we take derivatives with respect to these two quantities (using the rules in Appendix C) and set them equal to zero:

∂

∂C1 E [ JBLSE(x,C

T 1 y + c2)

] = −2E

[ yxT

] + 2E

[ yyT

] C1 + 2myc

T 2 = 0 (3.126)

∂

∂c2 E [ JBLSE(x,C

T 1 y + c2)

] = −2mx + 2C

T 1 my + 2c2 = 0 (3.127)

Solving (3.127) for c2 yields:

c2 = mx −C T 1 my (3.128)

Substituting this into (3.126) yields for following equation which the optimal C1 must satisfy:

0 = −2E [ yxT

] + 2E

[ yyT

] C1 + 2my

( mx −C

T 1 my

)T (3.129)

= −2 ( E [ yxT

] −mym

T x

) + 2

( E [ yyT

] −mym

T y

) C1 (3.130)

= −2Λyx + 2ΛyC1 (3.131)

Solving for C1 yields:

C1 = Λ −1 y Λyx (3.132)

Substituting the expressions for C1 and c2 into the definition of the LLSE form we obtain for the vector LLSE estimate:

x̂LLSE(y) = C T 1 y + c2 =

( Λ−1y Λyx

)T y + mx −

( Λ−1y Λyx

)T my (3.133)

= mx + Λ T yxΛ

−1 y

( y −my

) (3.134)

Since ΛTyx = Λxy we obtain for the LLSE estimate in the vector case:

x̂LLSE(y) = mx + ΛxyΛ −1 y

( y −my

) (3.135)

The corresponding error covariance can be obtained by direct substitution as:

ΛLLSE = E [( x− x̂LLSE(y)

)( x− x̂LLSE(y)

)T] = Λx − ΛxyΛ−1y Λ

T xy (3.136)

Using the properties of the trace, we see that the MSE is just the trace of the error covariance ΛLLSE:

MSE = E [( x− x̂LLSE(y)

)T ( x− x̂LLSE(y)

)] = tr

( ΛLLSE

) (3.137)

Again note that these expressions only depend on the means, covariances, and cross-covariances of the underlying random variables. As for the scalar case, the formula for the LLSE is the same as that obtained for the vector Gaussian case of Example 3.6. Now let us examine some examples:

Example 3.13 In this example we return to Example 3.2, but this time seek the LLSE estimate. Let us apply the LLSE formula (3.135). We need the second-order quantities mx, my, λxy, λx, and λy. The means mx and my are zero by symmetry of the density. The covariances are obtained as:

λy = λx =

∫ ∞ −∞

y 2 pY (y) dy =

∫ 0 −1 y

2 (1 + y) dy +

∫ 1 0

y 2 (1 −y) dy =

6 (3.138)

λxy =

∫ ∞ −∞

xy pX,Y (x,y) dxdy = 2

∫ 1 0

∫ 1−y 0

2 dxdy − 2

∫ 1 0

∫ 1−y 0

2 dxdy = 0 (3.139)

where we have used the symmetry of the density in obtaining the expression for λxy. Thus we obtain for the LLSE:

x̂LLSE(y) = 0 (3.140)

as before. Using the formula for the MSE we obtain

MSE = λx − λ2xy λy

= 1

6 −

1/6 =

6 (3.141)

3.8. BAYES LINEAR LEAST SQUARE (LLSE) ESTIMATION 81

Example 3.14 For this example let us revisit the problem of Example 3.3, but seek the LLSE estimate. Note that the BLSE for this example happened to be linear. Since the LLSE is nothing more than the minimum MSE estimator restricted to have a linear form, the BLSE and the LLSE are the same for this example. In other words, if the BLSE happens to be linear, we certainly cannot find a different linear estimator with lower MSE! Even though we know the answer, let us find the LLSE via the formula.

Again, we need the second order quantities mx, my, λxy, λx, and λy:

my =

∫ ∞ −∞

ypY (y) dy =

∫ 1 0

3y 3 dy =

4 (3.142)

mx =

∫ ∞ −∞

xpX(x) dx =

∫ 1 0

x6x (1 −x) dx = 1

2 (3.143)

λy =

∫ ∞ −∞

y 2 pY (y) dy −m2y =

∫ 1 0

3y 4 dy −

16 =

80 (3.144)

λx =

∫ ∞ −∞

x 2 pX(x) dx−m2x =

∫ 1 0

x 2

6x (1 −x) dx− 1

4 =

20 (3.145)

λxy =

∫ ∞ −∞

xy pX,Y (x,y) dxdy −mxmy = ∫ 1

∫ y 0

6x 2 y dxdy −

4 =

40 (3.146)

Thus we obtain for the LLSE:

x̂LLSE(y) = mx + λxy λy

(y −my) = 1

2 +

1/40

3/80

( y −

) =

3 y (3.147)

as before. Using the formula for the MSE we obtain

MSE = λx − λ2xy λy

= 1

20 −

(1/40)2

3/80 =

30 (3.148)

Example 3.15 For this example let us revisit the problem of Example 3.4. Again, we need the second order quantities mx, my, λxy, λx, and λy:

my =

∫ ∞ −∞

ypY (y) dy =

∫ 1 0

5y 5 dy =

6 (3.149)

mx =

∫ ∞ −∞

xpX,Y (x,y) dxdy =

∫ 1 0

∫ y2 0

x10x dxdy = 10

21 (3.150)

λy =

∫ ∞ −∞

y 2 pY (y) dy −m2y =

∫ 1 0

5y 6 dy −

( 5

)2 =

252 (3.151)

λx =

∫ ∞ −∞

x 2 pX(x) dxdy −m2x =

∫ 1 0

∫ y2 0

x 2

10x dxdy − (

)2 =

18 − (

)2 =

98 (3.152)

λxy =

∫ ∞ −∞

xy pX,Y (x,y) dxdy −mxmy = ∫ 1

∫ y2 0

10x 2 y dxdy −

21 =

12 −

21 =

252 (3.153)

Thus we obtain for the LLSE:

x̂LLSE(y) = mx + λxy λy

(y −my) = 10

21 +

5/252

( y −

) = y −

14 (3.154)

as before. Using the formula for the MSE we obtain

MSE = λx − λ2xy λy

= 5

98 −

(5/252)2

5/252 =

1764 = 0.0312 (3.155)

Note that this MSE is worse than that obtained by the optimal minimum MSE nonlinear BLSE estimator of Example 3.4 – but not much worse.

Example 3.16 (Scalar Gaussian Case) Let us now examine the problem of Example 3.5 with regard to the LLSE. For this jointly Gaussian problem we can immediately see that the LLSE estimate is identical to the BLSE estimate!

82 CHAPTER 3. ESTIMATION OF PARAMETERS

Example 3.17 (Vector Gaussian Case) Finally, we have the vector Gaussian case. As for the scalar Gaussian case, we can immediately see that the LLSE estimate is identical to the BLSE estimate. Thus for jointly Gaussian problems we have the interesting result that the BLSE, MAP and LLSE estimators are all the same.

Example 3.18 In this example we examine the following problem: Let the random vector z be linearly related to the random vector x as follows:

z = Fx + Hw + c (3.156)

where c is deterministic, E[w] = mw, and w is correlated with x. Find the LLSE estimate of z based on observation of y in terms of the second-order statistics of x and y.

As always for LLSE estimates we need to find the second order quantities: mz, Λzy, and Λz:

mz = E [z] = Fmx + Hmw + c (3.157)

Λz = E [ zz T ] −mzm

T z (3.158)

= FE [ xx

T ] F T

+ FE [ xw

T ] H T

+ HE [ wx

T ] F T

(3.159)

+HE [ ww

T ] H T −Fmxm

T x F

T −Fmxm T wH

T −Hmwm T x F

T −Hmxm T x H

T (3.160)

= FΛxF T

+ HΛwH T

+ FΛxwH T

+ HΛ T xwF

T (3.161)

Λzy = E [ zy T ] −mzm

T y (3.162)

= E [ (Fx + Hw + c) y

T ] − (Fmx + Hmw + c) m

T y (3.163)

= FΛxy + HΛwy (3.164)

Thus the LLSE estimate is given by:

ẑLLSE(y) = mz + ΛzyΛ −1 y

( y −my

) (3.165)

= Fmx + Hmw + c + [FΛxy + HΛwy] Λ −1 y

( y −my

) (3.166)

= F [ mx + ΛxyΛ

−1 y

( y −my

)] + H

[ mw + ΛwyΛ

−1 y

( y −my

)] + c (3.167)

= Fx̂LLSE(y) + HŵLLSE(y) + c (3.168)

Note that the LLSE estimate of z can be written in terms of the LLSE estimate of x and w! The corresponding error covariance for this case is given by:

Λ z,LLSE = Λz − ΛzyΛ

−1 y Λ

T zy (3.169)

= FΛxF T

+ FΛxwH T

+ HΛ T xwF

T + HΛwH

T − [FΛxy + HΛwy] Λ−1y [FΛxy + HΛwy] T

(3.170)

= F (

Λx − ΛxyΛ−1y Λ T xy

) F T

+ H (

Λw − ΛwyΛ−1y Λ T wy

) H T

(3.171)

+FΛxy [ I − Λ−1y Λ

T wy

] H T

+ H [ I − ΛwyΛ−1y

] Λ T xwF

T (3.172)

= FΛ x,LLSEF

T + HΛ

w,LLSEH T

+ FΛxy [ I − Λ−1y Λ

T wy

] H T

+ H [ I − ΛwyΛ−1y

] Λ T xwF

T (3.173)

where Λ x,LLSE is the error covariance associated with the LLSE of x based on y and Λw,LLSE is the error covariance

associated with the LLSE of w based on y. Thus we can also express the error covariance in terms of the LLSE estimate of x and w for this example. We will use these expressions in our study of the Kalman filter later in the notes.

Linear Observation Model Here we examine a case that often is used in practice. In particular, suppose that y and x are related by the following linear observation equation:

y = Hx + v (3.174)

where x is a random vector with mean mx and covariance matrix Q and v is a zero-mean random vector uncorrelated with x with covariance matrix R. The linear observation model in (3.174) is a common one in engineering practice. Let us find the LLSE estimate for this case. As usual, we need to find the quantities second-order quantities my, Λxy, and Λy:

my = Hmx (3.175)

Λy = E [( y −my

)( y −my

)T] = HQHT + R (3.176)

Λxy = E [ (x−mx)

( y −my

)T] = E

[ (x−mx)

( Hx + v −my

)T] = QHT (3.177)

3.8. BAYES LINEAR LEAST SQUARE (LLSE) ESTIMATION 83

Thus we obtain for the LLSE estimate and the associated error covariance ΛLLSE:

x̂LLSE(y) = mx + QH T ( HQHT + R

)−1 ( y −Hmx

) (3.178)

ΛLLSE = Q−QH T ( HQHT + R

)−1 HQ (3.179)

As usual, the MSE is the trace of the error covariance. There are a number of alternate forms associated with the LLSE for this case that are of particular interest. The first is the following alternative form for the error covariance:

Λ−1 LLSE

= Q−1 + HTR−1H (3.180)

The inverse of a covariance is often interpreted as a measure of information. In fact, such a covariance inverse is sometimes termed an “information matrix.” With this interpretation, we see that (3.180) states that the total information after the incorporation of a measurement equals the prior information Q−1 plus the information HTR−1H available in the measurement. To verify (3.180) we must show that the following identity is true: (

Q−QHT [ HΣxH

T + R ]−1

HQ )( Q−1 + HTR−1H

) = I (3.181)

Multiplying out the terms on the left hand side, we find:

I −QHT [ HΣxH

T + R ]−1

H + QHTR−1H −QHT [ HQHT + R

]−1 HQHTR−1H (3.182)

= I + QHT [ HQHT + R

]−1 ( −I +

[ HQHT + R

] R−1 −HQHTR−1

) H (3.183)

= I (3.184)

so that (3.180) is verified. Secondly, we note that there is an alternate expression for the gain term multiplying the observations in

(3.178). This gain term is given in (3.178) by:

K = QHT [ HQHT + R

]−1 (3.185)

The alternate form is given by:

K = ΛLLSEH TR−1 =

( HTR−1H + Q−1

)−1 HTR−1 (3.186)

where ΛLLSE is the error covariance. Give this equivalence, we see that the LLSE estimate (which is also the MAP estimate for the Gaussian case) must satisfy the following implicit relationship, termed the normal equations : (

HTR−1H + Q−1 )( x̂LLSE(y) −mx

) = HTR−1

( y −my

) (3.187)

Note that the matrix on the LHS of this equation is inverse of the error covariance, i.e. the “information matrix” for the problem!

To verify the alternate form for the gain K we proceed as follows

ΛLLSEH TR−1 =

( Q−QHT

[ HQHT + R

]−1 HQ

) HTR−1 (3.188)

= QHT [ HQHT + R

]−1 ( HQHT + R−HQHT

) R−1 (3.189)

= QHT [ HQHT + R

]−1 (3.190)

Finally, we can obtain another alternate expression for the error covariance ΛLLSE. To this end, note that we can write the estimation error in the following form:

e = x− x̂LLSE(y) (3.191) = (I −KH) (x−mx) −Kv (3.192)

From this form we find our alternate expression for ΛLLSE:

ΛLLSE = E [ eeT ]

= (I −KH) Q (I −KH)T + KRKT (3.193)

84 CHAPTER 3. ESTIMATION OF PARAMETERS

Geometric Characterization of LLSE Estimates: Before leaving LLSE estimation we present an ex- tremely important characterization of the LLSE estimate based on geometric notions. Specifically, x̂LLSE(y) is the unique linear function of y such that the error e = x − x̂LLSE(y) is zero mean (i.e. unbiased) and uncorrelated with any linear function of the data y. That is, an equivalent characterization of the LLSE is that it is the estimator that satisfies the following two conditions:

Unbiased: E [ x− x̂LLSE(y)

] = 0

Error ⊥ Data: E {[ x− x̂LLSE(y)

] g(y)

} = 0 for all linear functions g(·).

This geometric situation is depicted in Figure 3.7. The idea is that the optimal estimate is that linear function of the data which has no correlation with the error. Intuitively, if correlation remained between the error and the estimate, there would remain information in the error of help in estimating the signal that we should have extracted. Note that this geometric condition implies that the error is orthogonal (i.e. uncorrelated with) both the data itself (which is obviously a trival function of the data) as well as the LLSE estimate (which is clearly a linear function of the data).

S p a c e o f L i n e a r

F u n c t i o n s o f y

x L L S E ^

e = x - x L L S E ^

Figure 3.7: Illustration of the projection theorem for LLSE.

Let us close by summarizing the properties of LLSE estimates:

• The LLSE estimate is the minimum MSE estimate over all linear functions of the data.

• The LLSE estimate is always unbiased.

• The associated error covariances satisfy: 0 ≤ ΛBLSE ≤ ΛLLSE ≤ Λx

• The LLSE estimate equals the BLSE estimate for the jointly Gaussian case.

• The LLSE estimate only requires knowledge of second-order properties.

3.9 Nonrandom Parameter Estimation

In our discussion of Bayes or random parameter estimation we modeled the unknown parameter X as a random variable or vector. This was our “model of nature.” In many cases it is not realistic to model X in this way. For example, if we are attempting to estimate the bias of a coin or the orientation of a target, these quantities are not random, but they are still unknown. This leads us to a different model of nature better matched to such problems. In particular, we model X as an unknown but nonrandom parameter, so X is just a constant. This seemingly minor change impacts all the elements of our estimation problem. Let us examine the three elements of any estimation problem in this light:

1. Parameter Model: As we just discussed X is now modeled as an unknown deterministic parameter.

2. Observation Model: This is given by pY |X(y | x). Since X is nonrandom, pY |X(y | x) is now a parameterized density.

3.9. NONRANDOM PARAMETER ESTIMATION 85

3. Estimation Rule: As we discuss in greater detail below, the direct approach to finding a good estimator as the minimizer of a criterion, such as we took in the Bayes case, will present problems. Basically, these can be traced to the fact that we can no longer average over X, since it is no longer random. Instead we take the approach of proposing an estimator and then evaluating its performance. In particular, we will evaluate candidate estimators on the basis of their bias, variance, and mean square error.

In the Bayes case we found our estimators by minimizing the expected value of a cost E[J(x̂(y),x)]. Since this expected value was over the randomness in both X and Y , for a given cost structure J(·, ·) and a given estimator x̂(y) the quantity E[J(x̂(y),x)] was a constant – i.e. each estimator produced a single cost, and we could simply search for the one with the smallest cost. Suppose we try this approach in the nonrandom case, e.g. for the square error cost J(x̂(y),x) = (x̂(y) − x)2. Since the only randomness in the problem is with respect to Y a rational approach is to try and find the minimum of this cost averaged with respect to the parameterized density pY |X(y | x). This yields:

x̂∗ = arg min x̂

∫ ∞ −∞

(x̂−x)2 pY |X(y | x) dy (3.194)

= x (3.195)

Thus the optimal estimate is just the unknown parameter x itself! Note, this result may seem strange, but recall that the estimator is just a mapping from the data y to a corresponding estimate x̂(y), thus in performing the minimization in (3.194) we are really asking the question “what is the best value of x to assign this particular value of the observation y to.” Clearly, since x is fixed, it is the best value! This is right, but not very useful since we are assuming we do not know its value1. What we will do instead is to look at the behavior of the estimation error e(y) = x− x̂(y) and see if we can find estimators with desirable error behavior. The three measures of error behavior we will be interested in are the bias, the variance, and the mean square error (MSE). We examine each of these in the nonrandom context next.

Bias: In the nonrandom case, the bias b(x) of an estimator x̂(y) is given by:

b(x) = E[e | X = x] = E[x− x̂(y) | X = x] = x− ∫ ∞ −∞

x̂(y) pY |X(y | x) dy (3.196)

Unlike the random parameter case, we cannot average over the prior density of X. The consequence is that b(x) is in general a function of X! In particular, we can define 3 broad cases of bias behavior:

1. b(x) = 0 for all values of X = x. In this case we can say that the estimate is unbiased.

2. b(x) = c where c is a constant independent of X. Here the estimator has constant bias. If the constant c is known, we can always obtain an unbiased estimator by simply subtracting c from the estimate.

3. b(x) = f(x) for some function f(·). In this, the general case, the bias is unknown (since X itself is unknown) and we cannot simply subtract it out to obtain an unbiased estimate.

Clearly, we desire estimators whose error is small on average, i.e. who have small bias.

Variance: Having small bias is not enough. The average behavior of an estimator may be good, yet its its variability may be high. What we also would like is for the variance of the estimate to be small so we are confident that on any particular instance the estimate is close to the true value. For the nonrandom case the error covariance matrix is given by:

Λe(x) = Cov [e,e] = E [ [e− b(x)][e− b(x)]T

] = E

[ eeT ] − b(x)bT (x) (3.197)

This provides a measure of the spread of the error. Again, Λe(x) is a function of X in general.

1This argument will hold true for any cost which is nonnegative and zero when x̂(y) = x.

86 CHAPTER 3. ESTIMATION OF PARAMETERS

MSE: The last measure of estimator quality we will be concerned with is the mean square error or MSE. This is given by:

MSE = E [ eTe ]

= tr ( E [ eeT ])

= tr [ Λe(x) + b(x)b(x)

T ]

(3.198)

Thus we see that the MSE is a function of both the variance and the bias. In particular, we do not simply want to minimze the variance if this will lead to a large bias. For example, we could take as our estimate a constant C independent of y. The variance Λe(x) of this estimator would be identically zero, but the bias would be x−C and could be large.

In general, we seek unbiased estimators of minimum variance. Unfortunately, there is no straightforward procedure that leads to minimum variance unbiased estimators in the nonrandom case, thus we have to define an estimator and see how well it works. In this search for good estimators it is useful to know how good any unbiased estimator can do. Then we have a yardstick against which to measure a given estimator. We provide such a bound next.

3.9.1 Cramer-Rao Bound

Let x̂(y) be any unbiased estimate of the deterministic but unknown (scalar) parameter X and let λe(x) =

E[(x− x̂(y))2] be its associated error covariance (which is also the mean square error since its unbiased). The Cramer-Rao Bound is a bound on the estimation error covariance of any unbiased estimate of the

deterministic but unknown parameter X. The result is as follows:

Theorem 3.1 (Cramer-Rao Bound) If x̂(y) is any unbiased estimate of the deterministic parameter X, and λe(x) = E[(x − x̂(y))2] is its associated error covariance, then:

λe(x) ≥ 1

Iy(x) (3.199)

where Iy(x) is given by:

Iy(x) = E

[( ∂

∂x ln pY |X(y | x)

)2∣∣∣∣∣X = x ]

(3.200)

= −E [ ∂2

∂x2 ln pY |X(y | x)

∣∣∣∣X = x ]

(3.201)

The quantity Iy(x), which plays a central role in the CRB is called the Fisher information in y about x. Any unbiased estimator that achieves the CRB is termed efficient. While we do not discuss them here, there are also vector forms of the CRB and extensions to account for biased estimators.

Let us now prove the CRB and its two alternate expressions. Let x̂(y) be any unbiased estimate of x and define the error in this estimate as e(y) = x̂(y) − x. The error e is a random variable itself since it depends on y. In particular, since the estimate in question is unbiased we know that E[e] = 0. Note also that E[e2] = λe(x), the error variance associated with the estimate x̂(y). Since the estimate is unbiased we have:

E[e] =

∫ ∞ −∞

(x̂(y) −x)pY |X(y | x) dy = 0 (3.202)

Now differentiating with respect to x and using the chain rule we obtain:

∂

∂x

∫ ∞ −∞

(x̂(y) −x) pY |X(y | x) dy = ∫ ∞ −∞

[ (x̂(y) −x)

∂

∂x pY |X(y | x) −pY |X(y | x)

] dy (3.203)

= 0 (3.204)

Now the second term in the integral integrates to 1 so we have:∫ ∞ −∞

(x̂(y) −x) ∂

∂x pY |X(y | x) dy =

∫ ∞ −∞

(x̂(y) −x) pY |X(y | x) ∂

∂x ln pY |X(y | x) dy (3.205)

= 1 (3.206)

3.9. NONRANDOM PARAMETER ESTIMATION 87

Now since the expression is equal to 1 we can square it:

1 =

(∫ ∞ −∞

(x̂(y) −x) pY |X(y | x) ∂

∂x ln pY |X(y | x) dy

)2 (3.207)

(∫ ∞ −∞

[ (x̂(y) −x)

√ pY |X(y | x)

][√ pY |X(y | x)

∂

∂x ln pY |X(y | x)

] dy

)2 (3.208)

≤ [∫ ∞ −∞

(x̂(y) −x)2 pY |X(y | x) dy ][∫ ∞

−∞ pY |X(y | x)

[ ∂

∂x ln pY |X(y | x)

]2 dy

] (3.209)

= λe(x) E

[( ∂

∂x ln pY |X(y | x)

)2∣∣∣∣∣X = x ]

(3.210)

= λe(x) Iy(x) (3.211)

where the inequality follows from the Schwartz inequality for functions, which states that:(∫ ∞ −∞

f1(y) f2(y) dy

)2 ≤ (∫ ∞ −∞

f21 (y) dy

)(∫ ∞ −∞

f22 (y) dy

) Thus we see that:

λe(x) ≥ 1

Iy(x) (3.212)

as desired. The second form of the CRB can be shown as follows. Observe that:∫ ∞

−∞ pY |X(y | x) dy = 1 (3.213)

Differentiating once with respect to x we obtain∫ ∞ −∞

∂

∂x pY |X(y | x) dy =

∫ ∞ −∞

pY |X(y | x) ∂

∂x ln pY |X(y | x) dy = 0 (3.214)

Now differentiating this resulting expression with respect to x yields:

0 = ∂

∂x

∫ ∞ −∞

pY |X(y | x) ∂

∂x ln pY |X(y | x) dy (3.215)

∫ ∞ −∞

pY |X(y | x) ∂2

∂x2 ln pY |X(y | x) dy +

∫ ∞ −∞

∂

∂x pY |X(y | x)

∂

∂x ln pY |X(y | x) dy (3.216)

∫ ∞ −∞

pY |X(y | x) ∂2

∂x2 ln pY |X(y | x) dy +

∫ ∞ −∞

pY |X(y | x) [ ∂

∂x ln pY |X(y | x)

]2 dy (3.217)

= E

{ ∂2

∂x2 ln pY |X(y | x)

∣∣∣∣X = x }

+ E

{[ ∂

∂x ln pY |X(y | x)

]2∣∣∣∣∣X = x }

(3.218)

Where we have used the fact that:

pY |X(y | x) ∂

∂x ln pY |X(y | x) =

∂

∂x pY |X(y | x) (3.219)

Thus from (3.218) we see that:

{ ∂2

∂x2 ln pY |X(y | x)

∣∣∣∣X = x }

= −E

{[ ∂

∂x ln pY |X(y | x)

]2∣∣∣∣∣X = x }

(3.220)

which demonstrates the equivalance.

88 CHAPTER 3. ESTIMATION OF PARAMETERS

Now from the definition of the Schwarz inequality on which the CRB is based, equality in the CRB holds if and only if:

x̂(y) −x = k(x) ∂

∂x ln pY |X(y | x) (3.221)

for some k(x) > 0. Now when equality holds in the CRB the variance of both sides of (3.221) must be the same. The variance of the left hand side is given by Λe(x) = 1/Iy(x) while the variance of the right hand side is given by k2Iy(x), thus k(x) = 1/Iy(x). This implies that x̂(y) is an efficient estimate if and only if

x̂(y) = x + 1

Iy(x)

∂

∂x ln pY |X(y | x) (3.222)

Now since the left hand side of (3.222) is only a function of y, an unbiased efficient estimator will exist if and only if the right hand function is independent of x. This give us the following result:

Theorem 3.2 (Existance of Efficient Estimator) An unbiased efficient estimator of the nonrandom parameter x exists if and only if

x + 1

Iy(x)

∂

∂x ln pY |X(y | x) (3.223)

is independent of x, where Iy(x) is the Fisher information in y about x.

The expression (3.223) does not depend on knowledge of a particular estimator and is computable. This gives us a way of telling if an efficient estimator exists for a given situation without needing to know the estimator.

Example 3.19 Suppose we have the following measurement:

y = hx + w w ∼ N(0,r) (3.224)

and consider the estimator of x given by:

x̂(y) = y

h (3.225)

The bias of this estimator is given by:

E[x̂(y)] = E [ y

] = E

[ x +

] = x (3.226)

So the estimator is unbiased. Now consider the error variance:

λ(x) = Var(x̂(y)) = E [ (x− x̂(y))2

] = E

[ w2

] =

h2 (3.227)

Now lets compute the CRB and see how good the estimator is. First note that:

pY |X(y | x) = N(y; hx,r) (3.228)

Therefore:

ln(pY |X(y | x)) = − ln (√

2πr ) −

(y −hx)2

2r (3.229)

=⇒ ∂2

∂x2 [ ln pY |X(y | x)

] = −

r (3.230)

=⇒ Iy(x) = −E [ ∂2

∂x2 ln pY |X(y | x)

] = h2

r (3.231)

=⇒ λ(x) ≥ 1

Iy(x) =

h2 (3.232)

So we see that the given estimator is efficient

3.9. NONRANDOM PARAMETER ESTIMATION 89

Example 3.20 In this example suppose we want to estimate x from the following nonlinear observation:

y = h(x) + w w ∼ N(0,r) (3.233)

With some calculation we can show that

Iy(x) =

( ∂h(x)

∂x

)2 r

(3.234)

Computing the criterion (3.222) yields:

x + 1

Iy(x)

∂

∂x ln pY |X(y | x) = x +

y −h(x) ∂h

∂x

(3.235)

For an efficient unbiased estimator to exist this expression must be indepent of x. Suppose h(x) = x3. In this case we find:

x + y −h(x) ∂h

∂x

= x + y −x3

3x 2

= 2

3 x +

x2 (3.236)

This is a function of x so we can tell that no efficient estimator exists when h(x) = x3.

3.9.2 Maximum-Likelihood Estimation

One reasonable approach to estimation in the nonrandom parameter case is the maximum likelihood method. In general, we denote the function pY |X(y | x) viewed as a function of x as the likelihood function.

Definition 3.1 (Maximum-Likelihood Estimate) The maximum likelihood estimate x̂ML(y) is that value of x for which the likelihood function is maximized:

x̂ML(y) = arg max x

pY |X(y | x) (3.237)

Let us give a graphical interpretation to this maximization. As shown in Figure 3.8(a), for each value of x we obtain a density for y. In this case pY |X(y | x) is viewed as a function of y for each fixed x. At a particular observation y = y0 we can imagine finding the value of these densities as we change x. The figure shows the values of the densities for two such values of x (x1 and x2) for a given observation. If we now plot these values as a function of x we obtain the graph shown in Figure 3.8(b). In this case pY |X(y0 | x) is viewed as a function of x with y = y0 fixed. Note that pY |X(y0 | x) (i.e. the function plotted in (b)) is not a density, but rather a graph of the density values as the parameter is changed. For example there is no requirement that pY |X(y0 | x) integrated over x sum to one. The ML estimate is the maximum of this graph. Finally, in Figure 3.8(c) we plot pY |X(y | x) as a function of both x and y. In this view, the ML estimate is the maximum of the corresponding surface in the x direction for the given observed value of y = y0.

In practice, we often work with the logarithm of the likelihood function ln pY |X(y | x) which is called the log likelihood function. If the maximum is interior to the range of x and the log likelihood function has a continuous first derivative, then the ML estimate must satisfy the following ML equation:

∂ ln pY |X(y | x) ∂x

∣∣∣∣ x=x̂ML(y)

= 0 (3.238)

To show a fundamental and important property of ML estimates, consider the condition for an efficient estimate given in (3.222). Now if x̂(y) is any efficient estimate then it must satisfy (3.222) evaluated at any value of x. Suppose we use x = x̂ML(y). From the definition (3.238) we see that when x = x̂ML(y) the second term in (3.222) is equal to zero. Thus:

x̂(y) = x̂ML(y) (3.239)

so that if an efficient unbiased estimator exists it must be an ML estimator.

90 CHAPTER 3. ESTIMATION OF PARAMETERS

p Y | X ( y | x 1 ) p Y | X ( y | x 2 )

y y 0

$x M L

p Y | X ( y 0 | x )

x x 2x 1

(a) (b)

$x M L

p y x Y X M L| ( | $ )

p Y | X ( y 0 | x )

x 2

x 1 y

y 0

(c)

Figure 3.8: Interpretation of the ML Estimator: (a) pY |X(y | x) viewed as a function of y for fixed values of x, (b) pY |X(y | x) viewed as a function of x for fixed y, (c) pY |X(y | x) viewed as a function of both x and y. For a given observation y0, x̂ML(y) is the maximum with respect to x for the given y = y0.

Example 3.21 Consider the problem of Example 3.19 again. Let us find the ML estimator for this case:

ln ( pY |X(y | x)

) = − ln

(√ 2πr

) −

(y −hx)2

2r (3.240)

=⇒ ∂

∂x

[ ln pY |X(y | x)

] =

2hy − 2h2x 2r

= 0 (3.241)

=⇒ x̂ML(y) = y

h (3.242)

Thus the estimator we examined in Example 3.19 was really the ML estimator. We already know it is unbiased and efficient.

Example 3.22 Suppose the observation y ≥ 0 is given by an exponential distribution with parameter x:

pY |X(y | x) = 1

x e −y/x

(3.243)

where x ≥ 0. The maximum likelihood estimate is obtained from:

∂

∂x

[ ln pY |X(y | x)

] =

∂

∂x

[ − ln x−

] = −

x +

x2 = 0 (3.244)

=⇒ x̂ML(y) = y (3.245)

Now the bias of this estimate is given by:

E[x− x̂(y)] = E[x−y] = 0 (3.246)

so the ML estimate is unbiased. Next lets find the variance:

λ(x) = Var(x̂(y)) = E [ (x− x̂(y))2

] = E

[ (y −x)2

] = x

2 (3.247)

3.9. NONRANDOM PARAMETER ESTIMATION 91

since the variance of the exponentially distributed random variable y is x2. Note that the error variance is a function of x in this problem. Now lets compute the CRB:

Iy(x) = E

[( ∂

∂x ln pY |X(y | x)

)2] = E

[ (y −x)2

] = x2

x4 =

x2 (3.248)

where we have again used the fact that the variance of y is x2. We find that:

λ(x) = x 2

= 1

Iy(x) (3.249)

and thus the ML estimate is also efficient.

3.9.3 Comparison to MAP estimation

Finally let us compare MAP and ML estimation. Recall from our treatment of the detection problem that these two forms of detection were closely related. Specifically, the ML detection rule was presented as a special case of MAP detection wherein the prior probabilities of the hypotheses were the same. We will see that, despite their differences, a similar tie can be made in the estimation context. To this end, consider the estimation problem of Example 3.19 again, where we desire an estimate of x based on the observation:

y = hx + w w ∼ N(0,r) (3.250)

We have already found the ML estimate and estimation error variance for this problem in (3.242) and (3.227). Now suppose that in addition we have the following prior information on x:

x ∼ N(mx,λx) (3.251)

In this case the MAP estimate is given by:

x̂MAP (y) = mx + hλx

h2λx + r (y −hmx) =

[ r/h2

λx + r/h2

] ︸︷︷︸ Prior Weight

mx︸︷︷︸ Prior Est

[ λx

λx + r/h2

] ︸︷︷︸ Obs Weight

h︸︷︷︸ x̂ML(y)

(3.252)

We can see that the MAP estimate is composed of two parts. A part due to the prior mx and a part that corresponds precisely to the ML estimate y/h. These two parts are weighted according to their relative reliability. In particular, suppose λx → ∞, then we see that x̂MAP (y) → x̂ML(y). That is, as the prior information goes to zero the MAP estimate approaches the ML estimate. These observations are summarized in Table 3.1.

This discussion has focussed on comparison of MAP and ML approaches for a particular example. More generally we can compare the MAP and ML equations that the corresponding estimates must satisfy:

MAP Equation ML Equation[ ∂ ln pY |X(y | x)

∂x + ∂ ln pX(x)

∂x

]∣∣∣∣ x=x̂MAP (y)

= 0

[ ∂ ln pY |X(y | x)

∂x

]∣∣∣∣ x=x̂ML(y)

= 0 (3.253)

We can directly see that x̂MAP (y) → x̂ML(y) as pX(x) become flatter and flatter over all of x so that ∂/∂x → 0. Again, this implies there our prior information is going to zero as well, since the pdf for X becomes uniformly distributed over its entire range. Thus we see, as in the detection problem, that, for the same observation model, as the prior becomes more uniform in the MAP case, the MAP estimate approaches the ML estimate.

92 CHAPTER 3. ESTIMATION OF PARAMETERS

MAP Estimation ML Estimation

y = hx + v, v ∼ N(0,r) x ∼ N(mx,λx)

y = hx + v, v ∼ N(0,r)

x̂MAP (y) =

[ r/h2

λx + r/h2

] ︸︷︷︸ Prior Weight

mx︸︷︷︸ Prior Est

[ λx

λx + r/h2

] ︸︷︷︸ Obs Weight

h︸︷︷︸ x̂ML(y)

x̂ML(y) = y

( 1

λMAP

) ︸︷︷︸ Total Info

( 1

λx

) ︸︷︷︸

Prior Info

( 1

λML

) ︸︷︷︸ Obs Info

λML = r

=⇒ λMAP = 1

1 λx

+ 1 λML

≤ λML

Table 3.1: Comparison of MAP and ML Estimation for a particular example.

Chapter 4

Recursive LLSE: The Kalman Filter

4.1 Introduction

In this Chapter we will study the recursive computation of the LLSE estimate of a sequence of random vectors. Such recursive computation of the LLSE is the centerpiece of the Kalman Filter, which is used throughout science and engineering. We will restrict our attention to the discrete-time case for simplicity. The flavor of the results for the continuous-time case is similar, but the theoretical development is more complicated. We will start by studying the simpler problem of recursively estimating a (static) random vector. Using the insights we develop there we will tackle the case of recursively estimating a random process. Throughout this chapter we will be concerned with LLSE estimates, which are also the MMSE estimates for the Gaussian case.

4.2 Historical Context

Before proceding to the mathematical developments leading to the Kalman filter, it is again useful to first consider the historical context for its development. Let us begin with a brief history of Rudolf Kalman, the inventor of the Kalman filter.

Rudolf Kalman was born May 19, 1930 in Budapest, Hungary and is currently a Professor of Mathematics (Emeritus) at the Swiss Federal Institute of Technology (ETH) in Switzerland. He emigrated from Hungary to the United States with his family towards the end of the World War II. He received his bachelor’s degree in 1953 and his masters degree in 1954, both in Electrical Engineering from MIT. His master’s thesis topic was the behavior of the solutions of second-order difference equations. He continued his stidues at Columbia University, where he received his Sc.D. in 1957 working on problems related to control theory. In 1958 he joined the Research Institute for Advanced Study (RIAS) where he worked from 1958 to 1964. It was during this time that he developed the Kalman filter.

Thus, Kalman was actively involved with the Kalman filter development during the late 1950’s and early 1960’s. This was the start of the computer revolution and Kalman’s view of the LLSE estimation problem must be understood in this context. In particular, Kalman derived his solution by viewing the problem in state-space form, which lead him to an associated dynamic definition of the optimal filter as an algorithm. Contrast this to the explicit expression of the filter provided by Wiener. By taking such a state- space view, non-stationarities in the underlying processes could be dealt with as well. Note that Kalman’s solution would not have helped the engineers of Wiener’s time, had they even had it, because it requires a computer to implement. Thus, beyond his discovery of the filter bearing his name, Kalman’s contribution was showing the field a different way of conceptualizing what is meant as a “solution” to the LLSE problem – a conceptualization matched to the implementational paradigm of the times.

4.3 Recursive Estimation of a Random Vector

As a preamble, note that, in this Chapter and the remaining chapters, we use lower case notation to refer to random variables as well as their values. It will be clear by the context whether we mean the value or the random variable itself. This is consistent with the notation used in stochastic processes texts.

94 CHAPTER 4. RECURSIVE LLSE: THE KALMAN FILTER

We begin our treatment by examining the simpler problem of estimating a random vector based on observation of a discrete-time vector random process (i.e. a series of random vectors). In particular, consider the following LLSE estimation problem. Let y

0 ,y

1 , . . . be a sequence of random vectors, and let x̂k denote the

LLSE estimate of x based on observation of y 0 ,y

1 , . . . ,y

k . Let Σk denote the corresponding error covariance

of this estimate (so that e.g. tr(Σk) = MSE). What we would like to do is to develop a recursive procedure for computing x̂k+1 and Σk+l from the previous estimate x̂k, Σk, the new observation yk+1 and their joint

second-order statistics. Ideally, we would like to use only the new measurement to perform this update.

Discrete-time Innovations Process We will proceed by using the discrete-time innovations process. We saw the value of using an innovations approach in our treatment of the Wiener filter, and a similar approach will aid us here. To this end, let

ek = x− x̂k (4.1)

Then E [ek] = 0, E

[ eke

T k

] = Σk (4.2)

Note that from the geometric characterization of the LLSE estimate we have that the error is uncorrelated with all the observations in the past:

E [ eky

T j

] = 0, for all j = 0, 1, . . . ,k. (4.3)

We can restate our original problem as follows: Compute the LLSE estimate of

x = x̂k + ek (4.4)

given the information in the vector

Y =

 

y 0 y

1 ...

y k+1

  (4.5)

To solve this problem we can apply the analysis in Example 3.18. In doing this note that x̂k is a deterministic linear function of y

0 , y

1 , . . ., y

k so that its LLSE estimate based on these vectors is just the

function itself. Further, thanks to (4.3), the error ek is uncorrelated with x̂k. Therefore, applying (3.168), we have that:

x̂k+1 = x̂k + êk(Y ) (4.6)

where êk(Y ) is the LLSE estimate (MMSE estimate if the random variables are jointly Gaussian) of ek based on Y . We can write an explicit expression for this estimate as

êk(Y ) = ΣeY Σ −1 Y (Y −mY ) (4.7)

Note that, thanks to the orthogonality properties in (4.3), we have

ΣeY = E{ekY T} =

( 0 0 · · · 0 E

[ eky

T k+1

] ) (4.8)

However, this is not enough to guarantee that (4.7) is a function only of y k+1

(minus its mean). Indeed, if

y 0 , y

1 , . . ., y

k+1 are all correlated, then ΣY is a full matrix, and in general êk(Y ) is a function of all these

measurements. Suppose, however that y

k+1 is uncorrelated with y

0 ,y

1 , . . . ,y

k . Then,

ΣY =

 

Cov

    y

0 y

1 ... y k

  ,   y

0 y

1 ... y k

    0

0 Σyk+1

  (4.9)

4.3. RECURSIVE ESTIMATION OF A RANDOM VECTOR 95

where Σyk+1 = Cov

( y k+1

,y k+1

) (4.10)

Then, from (4.6)-(4.10), we have that

x̂k+1 = x̂k + E [ eky

T k+1

] Σ−1yk+1

( y k+1 −myk+1

) (4.11)

Also, in this case, from (3.173), (4.8)–(4.10) and a bit of algebra, we obtain:

Σk+1 = Σek+1 = Σk −E [ eky

T k+1

] Σ−1yk+1E

[ eky

T k+1

]T (4.12)

A simple way to recognize the correctness of the above formula is to notice that, from (4.11), we have

ek+1 = ek −E [ eky

T k+1

] Σ−1yk+1

( y k+1 −myk+1

) (4.13)

and that the two terms in the right hand side of the above equation are uncorrelated because of (4.3). Thus, the covariance of the left hand side is the sum of the covariances of the two terms on the right-hand side; that is,

Σk = Σk+1 + E [ eky

T k+1

] Σ−1yk+1E

[ eky

T k+1

]T (4.14)

which yields (4.12). The consequences of (4.11), (4.12) are substantial:

• In (4.11), we have the recursive equation we desire. The updated estimate x̂k+1 equals the previous estimate x̂k plus the estimate of the previous estimation error based only on the latest measurement y k+1

(it is here where the lack of correlation with the previous data is needed).

• Indeed, this lack of correlation has reduced our problem to the standard static estimation formula. Specifically, if we regard x̂k as our prior mean, then equations (4.11), (4.12) are exactly the same as (3.135), (3.136)! That is, we use our latest measurement to estimate the remaining random portion, ek, of x and reduce the covariance according to the standard LLSE estimation formula for estimating ek based on yk+1.

While this is quite nice, we don’t usually have the luxury of having an uncorrelated measurement sequence. However, what we can imagine doing in this case is the following. First note that if ν = Gy+b is an invertible transformation of y with G and b deterministic, the information content of y and ν are identical, and so are the LLSEs based on the two vectors. Now, suppose we can find such a transformation on the measurement sequence y

0 ,y

1 , . . . ,y

k+1 so that:

• For each k the transformation is such that νk is a linear function of y0,y1, . . . ,yk, and the map y

0 , . . . ,y

k → ν0, . . . ,νk is invertible.

• The νj,j = 0, . . . ,k form an uncorrelated sequence of random variables.

Then, since x̂k is the LLSE of x based on either y0, . . . ,yk or ν0, . . . ,νk, and thanks to (4.11), (4.12) and the lack of correlation of the sequence of νj, we have

x̂k+1 = x̂k + E [ ekν

T k+1

] Σ−1νk+1

[ νk+1 −mνk+1

] (4.15)

Σk+1 = Σk −E [ ekν

T k+1

] Σ−1νk+1E

[ ekν

T k+1

]T (4.16)

The process ν0, . . . ,νk is known as the innovations process, and can be obtained as a result of the basic properties of LLSE. Specifically, let ŷ(k|k−1) denote the LLSE of the vector y

k based on observation of the

vectors y 0 , . . ., y

k−1 , and define

ν0 = y0 −my0 (4.17) νk = yk − ŷ(k|k − 1), k = 1, . . . (4.18)

96 CHAPTER 4. RECURSIVE LLSE: THE KALMAN FILTER

Then νk is obviously a function of y0, . . . ,yk. To show that there is no loss of information, so that we can recover y

0 , . . . ,y

k from the innovations sequence ν0, . . . ,νk, note that this is obviously true for k = 0.

Proceeding by induction, assume that it is also true for all j ≤ k − 1. Then, ŷ(k|k − 1) is also the least squares estimate of y

k based on ν0, . . . ,νk−1, so it is a linear function of the past innovations, and, from

(4.18) we have y k

= νk + ŷ(k | k − 1) (4.19)

which shows that it is true for j = k as well. Note finally that since νk is the estimation error in estimating y k

based on y 0 , . . . ,y

k−1 , it is also uncorrelated with y

0 , . . . ,y

k−1 and thus with ν0, . . . ,νk−1. Thus, the

sequence ν0, . . . ,νk satisfies the conditions we were looking for, and is a zero-mean innovations process. Let us make several comments. Note first that the computation of the νk involves the solution of a

sequence of LLSE problems. Thus, for the computational efficiency of (4.15)–(4.16) to be of real value, the computation of these LLSE must be simple. While this is not always the case, it is true for the very important class of models discussed in this chapter. Finally, we note that the procedure which we have described for constructing the innovations process is the well-known Gram-Schmidt Orthogonalization Procedure for obtaining a set of orthogonal vectors from a set of linearly independent vectors. Indeed, there are strong geometric interpretations associated with the construction of the innovations sequence.

Another insight which we obtain from the innovations sequence is that it represents a particular fac- torization of the covariance vector of the observations. To illustrate this, consider a random vector y with components y1, . . . ,yp, and with covariance Σy = (Σij). In this case, constructing the innovations one com- ponent at a time, and using (3.135), (3.136) and the uncorrelated property of the innovations, we have the following:

ν1 = y1, Σν1 = Σ11

ν2 = y2 −a21ν1, Σν2 = Σ22 −a 2 21Σν1 (4.20)

and more generally,

νk = yk −ak1ν1 − . . .−ak,k−1νk−1, Σνk = Σkk −a 2 k1Σν1 − . . .−a

2 k,k−1Σνk−1 (4.21)

where akj = Σykνj Σνj

,j = 1, . . . ,k − 1. These coefficients can be computed recursively, as:

Σykν1 = Σk1

Σykνj = Σkj −aj1Σk1 − . . .−aj,j−1Σk,j−1, j = 1, . . . ,k − 1. (4.22)

Note that his procedure uses the elements of the matrix Σy to construct a lower triangular matrix

A =

 

1 0 0 · · · 0 a21 1 0 · · · 0 a31 a32 1 · · · 0

. . . ...

... ...

ap1 ap2 ap3 · · · 1

  (4.23)

so that   y1... yp

  = A

  ν1... νp

  (4.24)

Taking covariances of both sides of (4.24) we have

Σy = AΣνA T (4.25)

Σν = diag ( Σν1, . . . , Σνp

) (4.26)

Equation 4.25 yields an LDU (lower-triangular, diagonal, upper-triangular) factorization of Σy. Since the lower and upper triangular parts of the factorization are transposes of each other, this factorization is known

4.4. THE DISCRETE-TIME KALMAN FILTER 97

as a Cholesky factorization in linear algebra. Note that the matrix A is trivially invertible, since it has 1 as its diagonal elements. Once this factorization is available, the inverse of Σy can be computed directly, as

Σ−1y = ( AT )−1

Σ−1ν A −1 (4.27)

Note also that the matrix A−1 is also lower-triangular. What we have so far is a way, based on the innovations associated with data sequence, to recursively

estimate a random vector. In the next section we will use these results to find the LLSE estimate a discrete- time stochastic process from such a data sequence. The solution this problem is the famed Kalman filter.

4.4 The Discrete-Time Kalman Filter

In this section, we use the innovations theory to solve an estimation problem for a stochastic process evolving in discrete time. The exposition in this section is entirely in terms of a vector-valued stochastic process x, observed from another vector-valued stochastic process y. The exposition is presented in terms of finding the MMSE estimate for Gaussian random vectors. However, recall that the same algorithm must obtain the LLSE estimator for any process, given the first-order and second-order statistics of the process.

Now to develop the DT Kalman Filter, consider the dynamic system:

x(t + 1) = A(t)x(t) + B(t)u(t) + G(t)w(t) (4.28)

y(t) = C(t)x(t) + v(t) (4.29)

where x(t) ∈ Rn,y(t) ∈ Rp,u(t) is a known input, and w(t),v(t) are independent, zero-mean, Gaussian white noise proceses, with

E[w(t)wT (s)] = Q(t)δ(t−s) (4.30) E[v(t)vT (s)] = R(t)δ(t−s) (4.31) E[w(t)vT (s)] = 0 (4.32)

Thus, our assumptions imply that the random vectors w(t),w(s) are uncorrelated for t 6= s. Similarly v(t),v(s) are uncorrelated for t 6= s, and w(t),v(s) are uncorrelated for any s,t.

In addition, assume that the initial conditions x(t0) are Gaussian, with mean mx(t0) and covariance Px(t0), and suppose that the initial conditions are independent of the process noise w(t) and the measure- ment noise v(t) for all times t = 0, . . .. We also assume that the measurement noise covariance is positive definite (R(t) > 0), and thus invertible (i.e. there are no perfect observations). Note that the assumption that x(t0),w(t),v(t) are Gaussian implies that all of the random variables are Gaussian. In this context, independence and uncorrelatedness are equivalent concepts. Thus, extension of the innovations concept to this problem is equivalent to requiring that the innovations be independent.

We will use the following notation throughout the development:

x̂(t | s) = LLSE of x(t) given y(τ), τ ≤ s (4.33) e(t | s) = x(t) − x̂(t | s) (4.34) P(t | s) = E

[ e(t | s)e(t | s)T

] (4.35)

We are interested in developing a recursive approach for computation of x̂(t | t − 1) or x̂(t | t). This is the problem of performing optimal causal filtering. The solution of this problem is the Kalman Filter. The discrete-time Kalman filter is often presented as a series of three steps. In particular, each step can be considered as an estimation sub-problem in its own right. We consider each of these subproblems next and through their solution find the solution to the overall Kalman filter.

4.4.1 Initialization

First we have an initialization step:

x̂(t0|t0 − 1) = mx(t0) (4.36) P(t0|t0 − 1) = Px(t0) (4.37)

That is, before any measurements are taken, the best estimates are based solely on the prior information.

98 CHAPTER 4. RECURSIVE LLSE: THE KALMAN FILTER

4.4.2 Measurement Update Step

We start the actual estimation process by updating the estimate from the previous step to take into account the observation at the current time. That is:

Suppose we have: x̂(t | t− 1), P(t | t− 1) And we observe: y(t)

Now compute: x̂(t | t), P(t | t)

The solution of the update step is just a direct application of the analysis in the preceding subsections. Specifically, let us define the innovations

ν(t) = y(t) − ŷ(t | t− 1) (4.38)

Since v(t) is uncorrelated with y(0), . . ., y(t− 1) and with x(t), we can readily compute

ŷ(t | t− 1) = C(t)x̂(t|t− 1), ν(t) = C(t)e(t | t− 1) + v(t) (4.39)

and

Pν(t) = E [ ν(t)ν(t)T

] = C(t)P(t | t− 1)CT (t) + R(t) (4.40)

Writing x(t) = e(t | t− 1) + x̂(t|t− 1), and using the fact that the innovations are zero-mean, we can then apply the estimation formula for Gaussian random vectors with Gaussian observations to obtain the update relations:

x̂(t | t) = x̂(t | t− 1) + E [ e(t | t− 1)ν(t)T

] P−1 ν(t)

ν(t) (4.41)

P(t | t) = P(t | t− 1) −E [ e(t | t− 1)ν(t)T

] P−1 ν(t)

E [ e(t|t− 1)ν(t)T

]T (4.42)

Furthermore,

E [ e(t | t− 1)ν(t)T

] = E

[ e(t | t− 1) (C(t)e(t|t− 1) + v(t))T

] = E

[ e(t | t− 1)e(t | t− 1)T

] CT (t) + E

[ e(t | t− 1)v(t)T

] = P(t|t− 1)CT (t) (4.43)

where we have used the definition of P(t|t−1) and the fact that the measurement noise v(t) is uncorrelated (independent) of x(t) and y(t0), . . . ,y(t− 1). Substituting (4.43) into (4.41), (4.42) yields the following set of equations for the update step:

x̂(t|t) = x̂(t|t− 1) + P(t|t− 1)CT (t) [ C(t)P(t|t− 1)CT (t) + R(t)

]−1 [ y(t) −C(t)x̂(t|t− 1)

] (4.44)

P(t|t) = P(t|t− 1) −P(t|t− 1)CT (t) [ C(t)P(t|t− 1)CT (t) + R(t)

]−1 C(t)P(t|t− 1) (4.45)

Notice that this step simply updates the current estimate to take into account the new observation – the dynamic equation is not used.

4.4.3 Prediction Step

Now we perform a prediction step, where we use the dynamic equation to generate the best estimate at time t + 1 based only on the data up to time t. That is, we solve the following subproblem:

Suppose we have: x̂(t | t), P(t | t)

Now compute: x̂(t + 1 | t), P(t + 1 | t)

4.4. THE DISCRETE-TIME KALMAN FILTER 99

The solution of the prediction step is simple to derive, since by assumption w(t) is independent of (uncorrelated with) y(0), . . .y(t), and B(t)u(t) is deterministic. Thus,

x̂(t + 1|t) = E [ x(t + 1) | y(0), . . .y(t)

] (4.46)

= E [ A(t)x(t) + Bu(t) + G(t)w(t) | y(0), . . .y(t)

] = A(t)E

[ x(t) | y(0), . . .y(t)

] + Bu(t) (4.47)

= A(t)x̂(t | t) + B(t)u(t) (4.48)

Similarly, we can obtain an expression for the error e(t + 1|t) as

e(t + 1 | t) = x(t + 1) − x̂(t + 1 | t) = A(t)e(t | t) + G(t)w(t) (4.49)

Note that the two terms in the above equation are independent by assumption, because the process noise w(t) is independent of all past and current values of the state x and past and current values of the measurement noise v. Thus, we obtain for the predicted error covariance:

P(t + 1 | t) = E [ e(t + 1 | t)e(t + 1 | t)T

] = A(t)P(t | t)AT (t) + G(t)Q(t)GT (t) (4.50)

Together we have for the prediction step:

x̂(t + 1|t) = A(t)x̂(t | t) + B(t)u(t) (4.51) P(t + 1 | t) = A(t)P(t|t)AT (t) + G(t)Q(t)GT (t) (4.52)

Note this is simply a one-step prediction and does not involve the observation.

4.4.4 Summary

Combining these steps we obtain the DT Kalman Filter, which we summarize here for convenience. First, the process in question is described by the following autoregressive dynamic equation and observation equation:

x(t + 1) = A(t)x(t) + B(t)u(t) + G(t)w(t) (4.53)

y(t) = C(t)x(t) + v(t) (4.54)

where u(t) is a known input, and w(t), v(t) are independent, zero-mean, white noise processes, with

E[w(t)wT (s)] = Q(t)δ(t−s) (4.55) E[v(t)vT (s)] = R(t)δ(t−s) (4.56) E[w(t)vT (s)] = 0 (4.57)

where we assume that R(t) > 0. Also the second order statistics of the initial condition are given by:

E [x(t0)] = mx(t0), E [ (x(t0) −mx(t0)) (x(t0) −mx(t0))

T ]

= Px(t0) (4.58)

and the initial conditions are independent of the process noise v(t) and the measurement noise w(t) for all times t = 0, . . .. The Kalman filter for this system is given by:

Initialization:

x̂(t0|t0 − 1) = mx(t0) (4.59) P(t0|t0 − 1) = Px(t0) (4.60)

Update Step:

x̂(t|t) = x̂(t|t− 1) + P(t|t− 1)CT (t) [ C(t)P(t|t− 1)CT (t) + R(t)

]−1 [ y(t) −C(t)x̂(t|t− 1)

] (4.61)

P(t | t) = P(t | t− 1) −P(t | t− 1)CT (t) [ C(t)P(t | t− 1)CT (t) + R(t)

]−1 C(t)P(t | t− 1) (4.62)

100 CHAPTER 4. RECURSIVE LLSE: THE KALMAN FILTER

Prediction Step: x̂(t + 1 | t) = A(t)x̂(t | t) + B(t)u(t) (4.63)

P(t + 1 | t) = A(t)P(t | t)AT (t) + G(t)Q(t)GT (t) (4.64)

Note that the Kalman filter has an intuitively appealing structure: the filter mimics the noise-free dynamics for prediction (cf (4.63)) and corrects for the difference between the observations y(t) and the best prediction of y(t) based on the preceding data (cf (4.61)).

4.4.5 Additional Points

First there are several alternate forms for these equations. In particular, we have for the update step:

x̂(t | t) = x̂(t | t− 1) + K(t)ν(t) (4.65)

K(t) = P(t|t− 1)CT (t) [ C(t)P(t|t− 1)CT (t) + R(t)

]−1 (4.66)

= P(t|t)C(t)R−1(t) (4.67)

where K(t) as defined above is the Kalman gain. Also there is the equivalent formula for the error covariance:

P(t|t) = [I −K(t)C(t)] P(t|t− 1) [I −K(t)C(t)]T + K(t)R(t)KT (t) (4.68)

Sometimes this form is preferred from a numerical standpoint, since it represents the error covariance as the sum of two positive semi-definite quantities (which must be positive semi-definite) rather than as the difference of two such quantities (which could, with e.g. roundoff, become indefinite).

In addition, for the update step there is the following equivalent “information” form of the error covariance update:

P−1(t|t) = P−1(t|t− 1) + CT (t)R−1(t)C(t) (4.69)

This last form emphasizes some important insights about the Kalman filter. In particular, note that in the prediction step (4.64) it is the covariances or uncertainty that is additive. This makes sense since in this step we have no observation but are only taking into account the effect of the dynamic equation, which is increasing the uncertainty in the problem do to the additive noise w(t). In contrast, consider the update step. In this step, we are accounting for the influence of an observation. By viewing the inverse of a covariance as a measure of information, we can see from (4.69) that it is information that is additive. In other words, the uncertainty increases during the prediction step and decreases during the update step. The structure of this relationship has deep consequences for our ability to efficiently implement the Kalman filter, which are beyond the scope of this course.

Finally, also note that the recursion (4.62), (4.64) for the error covariance does not depend on the data y(t) and thus can be precomputed. Thus, the gain P(t|t− 1)CT (t)[C(t)P(t|t− 1)CT (t) + R(t)]−1 can also be precomputed. Equations (4.62), (4.64) are referred together as the discrete-time Riccati equation.

4.4.6 Example

We now consider an example.

Example 4.1 The underlying process x(t) is zero mean with its covariance structure implicitly described by the following autoregressive model and observation equation:

x(t + 1) = 0.8x(t) + w(t) (4.70)

y(t) = x(t) + v(t) (4.71)

where w(t) and v(t) are zero mean, wide-sense stationary white noise processes, uncorrelated with each other with covari- ance functions Kww(t) = 0.36δ(t) and Kvv(t) = 1δ(t), respectively. We are also given that E[x(0)] = 0 and E[x

2(0)] = 1, where x(0) is independent of w(t) and v(t). Our goal is to find the Kalman filter for this problem – that is, find the linear least square estimate of x(t) based on the statistical initial condition and the data y(t) observed up to time t. Note that this is a filtering problem, in that the estimate only uses data up to the current time.

We simply apply the Kalman filtering equations in Section 4.4.4 with A = 0.8, B = 0, C = 1, G = 1:

4.4. THE DISCRETE-TIME KALMAN FILTER 101

Initialization:

x̂(0|− 1) = mx(0) = 0 (4.72) P(0|− 1) = Rxx(0) = 1 (4.73)

Update Step:

x̂(t|t) = x̂(t|t− 1) + [

P(t|t− 1) P(t|t− 1) + 1

] ︸︷︷︸

Kalman Gain K(t)

[y(t) − x̂(t|t− 1)] (4.74)

P(t|t) = P(t|t− 1)

P(t|t− 1) + 1 (4.75)

Prediction Step:

x̂(t + 1|t) = 0.8x̂(t|t) (4.76)

P(t + 1|t) = (0.8)2P(t|t) + 0.36 (4.77)

xhat

0 1 2 3 4 5 6 7 8 9 10 −2.5

−2

−1.5

−1

−0.5

0.5

1.5

x( t| t)

,x (t

+ 1

|t )

Estimate, Data, and Truth

Figure 4.1: Kalman Filtering Example: Estimate

In Figure 4.1 we show an example where a “true” process and noisy observations were generated according to the description of (4.70). This noisy signal was then used as input to the Kalman filtering equations (4.72)–(4.77) to obtain an estimate. The figure shows the original signal in green and the noisy observations in red. In blue are shown both the predicted estimate x̂(t + 1|t) and the updated estimate x̂(t|t) after the current observation is taken into account. The predicted estimates lead from one discrete time point to the next using the system dynamics to propogate the estimate

102 CHAPTER 4. RECURSIVE LLSE: THE KALMAN FILTER

forward. This part of the filter corresponds to the piecewise linear parts of the blue curve. The predicted estimate is following open loop decay provided by the system dynamics, and thus in the absence of observations will decay to zero as (0.8)t, which is why all these linear segments are “pointed” to 0.

At each time point the predicted estimate is then corrected to take into account the observation at that time. This correction exhibits itself as the “jumps” in the estimated signal at each time. For this example, where we are observing the points themselves in noise, this will tend to pull the estimate in the direction of the current observation, as can be seen in the figure. Overall, we can view the situation as follows: we have two estimates at each time – one just prior to the observation (that has not seen it yet) and one just after the observation (which has taken it into account). It is not surprising that they are different!

0 1 2 3 4 5 6 7 8 9 10 0.3

0.4

0.5

0.6

0.7

0.8

0.9

P (t

|t ),

P (t

+ 1

|t )

Evolution of Estimation Error Variance

0 1 2 3 4 5 6 7 8 9 0.3

0.4

0.5

0.6

0.7

0.8

0.9

P (t

|t ),

M S

Evolution of P(t|t) (which is MSE)

(a) (b)

0 1 2 3 4 5 6 7 8 9 0.36

0.38

0.4

0.42

0.44

0.46

0.48

0.5

K (t

)

Evolution of of Kalman Gain

(c)

Figure 4.2: Kalman Filtering Example: Covariance and Gain

Now in Figure 4.2 we show the associated error covariance values and the corresponding Kalman gain. First consider Figure 4.2(a), which shows both the predicted error covariance P(t + 1|t) and the updated error covariance P(t|t). Can you guess from intuition which must be which? In the prediction step we have no new data, but are only taking the dynamics into account. Since the dynamic equation is driving by a noise process, our uncertainty will increase. Thus the linear segments connecting one discrete time step with the next are due to the prediction step and the larger value at each time is evidentially P(t + 1|t). Next, the observation is taken into account during the update step. The observation helps the estimate, removing uncertainty and accounting for the drop in error covariance at each time. Thus the lower value at each time must be P(t|t). Overall we see the ongoing battle between the dynamics pumping uncertainty into the problem and the observations taking it out, as exhibited in the sawtooth nature of the waveform. In such a display it is difficult to see if any sort of steady state is being reached.

In Figure 4.2(b) we have displayed just the updated covariance P(t|t), which is just the lower envelop of the curve in Figure 4.2(a). This is the mean square error for the problem. Note that this value is appearently approaching a steady

4.4. THE DISCRETE-TIME KALMAN FILTER 103

state as time going on. In Figure 4.2(c) we have displayed the corresponding Kalman gain as a function of time. It too appears to be approaching a steady state value as time evolves.

The above discussion raises the question of what happens in steady state. If a steady state exists then we would expect that

P = P(t + 1|t) = P(t|t− 1) (4.78)

where P denotes the steady state predicted covariance. Note that we are not saying that P(t + 1|t) will equal P(t|t), which it is clear from Figure 4.2(a) will never be the case! Now substituting the expression for the updated error covariance P(t|t) in (4.75) into the expression for the predicted error covariance P(t + 1|t) in (4.77) we obtain:

P(t + 1|t) = (0.8)2 P(t|t− 1)

P(t|t− 1) + 1 + 0.36 (4.79)

=⇒ P = (0.8)2 P

P + 1 + 0.36 (4.80)

where we have made the substitution for the steady state error covariance in obtaining the last equation. Solving this equation for P we obtain:

P = 0.6 (4.81)

Substituting this steady state value of the error covariance into the expression for the Kalman gain, we obtain a corresponding steady state value for the Kalman gain:

K = P

P + 1 = 0.375 (4.82)

We can also find the value for the mean square error in steady state:

MSE = P(t|t) = P

P + 1 = 0.375 (4.83)

We can now find the filter in steady state:

x̂(t|t) = x̂(t|t− 1) + K [y(t) − x̂(t|t− 1)] (4.84) = (1 −K)0.8x̂(t− 1|t− 1) + Ky(t) (4.85) = 0.8(1 − 0.375)x̂(t− 1|t− 1) + 0.375y(t) (4.86)

= 1

2 x̂(t− 1|t− 1) + 0.375y(t) (4.87)

Notice that this is nothing more than linear time invariant system with input y(t) (i.e. the data) and output x̂(t|t) (i.e. the estimate). Since it is an LTI system we can use transform techniques to find its system function, which relates the input and output:

Hkf (z) = 0.375

1 − 1 2 z−1

(4.88)

104 CHAPTER 4. RECURSIVE LLSE: THE KALMAN FILTER

Chapter 5

Detection Theory

In this chapter we start our investigation of detection theory, also referred to as hypothesis testing or decision theory. Our goal in these problems is to estimate or infer the value of an unknown “state of nature” based on noisy observations. A general model of this process is shown in Figure 5.1. Nature generates an unknown output H. By convention, we call this output a hypothesis. This outcome generated by nature then probabilistically affects the quantities Y that we are allowed to observe. Based on the uncertain observation Y , we must design a rule to decide what the unknown hypothesis was. In the theory of detection the set of possible hypotheses is taken to be discrete. When the set of possibilities is continuous we are in the realm of estimation, which is discussed in Chapter 3. From Figure 5.1 we see that we will need three components in our model:

1. A model of generation processes that creates H – i.e. a model of nature.

2. A model of the observation process.

3. A decision rule D(y) that maps each possible observation y to an associated decision.

In general, the first two elements are set by “nature” or the restrictions of the physical data gathering situation. For example, if we are trying to decide whether a tumor is cancerous or not, the true state of the tumor is decreed by processes outside of our control and the uncertainty or noise in the observations may arise from the physical processes in generating an X-Ray image. It is generally the last element, decision rule design, where the engineer plays the strongest role. Such decision rules can be of two types: deterministic and random. A deterministic decision rule always assigns the same decision or estimate to the same observation – i.e. when a given observation is seen the same decision is always made. In particular, deterministic decision rules can be viewed as a simple partitioning or labeling of the space of observations into disjoint regions marked with the decision corresponding to each observation, as shown in Figure 5.2. In the case of random decision rules, different decisions may arise from the same observation – i.e. when the same observation is made twice, two different decision outcomes are possible. Such random decision rules play an important role when the observed quantity y is discrete in nature. In general, however, our emphasis will be on the design of deterministic decision rules.

P h e n o m e n o n /

E x p e r i m e n t

T r a n s m i s s i o n /

M e a s u r e m e n t

P r o c e s s

P r o c e s s i n g /

D e c i s i o n R u l e /

E s t i m a t o r

$ HH Y

p ( H ) p ( Y | H )

Figure 5.1: Detection problem components.

In Section 5.1 we discuss in detail the case that arises when there are only two possible hypotheses, termed binary hypothesis testing. In Section 5.4 we discuss the more general case of M hypotheses. Throughout this chapter we focus on the case of detection based on observations of random variables. In Chapter 11 we examine the more complicated case of detection based on observations of random processes.

106 CHAPTER 5. DETECTION THEORY

D e c l a r e H 1

D e c l a r e

H 0

O b s e r v a t i o n

S p a c e

Figure 5.2: Illustration of a deterministic decision rule as a division of the observation space into disjoint regions, illustrated here for the case of two possibilities.

5.1 Bayesian Binary Hypothesis Testing

In this section we consider the simplest case when there are only two possible states of nature or hypotheses, which by convention we label as H0 and H1. This situation is termed “binary hypothesis testing” and the H0 hypothesis is usually termed the “null hypothesis,” due to its typical association with the absence of some quantity of interest. The binary case is of considerable practical importance, as well as having a long and rich history. To give a flavor of the possibilities, let us examine a few examples before proceeding to more detailed developments.

Example 5.1 (Communications) Consider the following simplified version of a communication system, where a source broadcasts one bit, (either 0 or 1) The transmitter encodes this bit by a voltage, which is either 0 or E, depending on the bit. The receiver observes a noisy version of the transmitted signal, where the noise is additive, and is represented by a random variable w with zero-mean, variance σ2, and Gaussian distribution. The receiver knows the nature of the signal E, the statistics of the noise σ2, and the apriori probability p(k) that the bit sent was k, where k = 0, 1. The receiver must take the received signal, y, and map this using a rule D(y) into either 0 or 1, depending on the value of r. The problem is to determine the decision rule for which the probability of receiver error is minimized.

In the above example there are two possible hypotheses, H0 and H1, only one of which can be true. These hypotheses correspond to whether the transmitted bit was 0 or 1. There is a probabilistic relationship between the observed variable y and the hypotheses Hi. In particular, the observed variable is y = w for hypothesis H0, and y = E +w for hypothesis H1. The decision rule divides the space of possible observations into two disjoint decision regions, Z0 and Z1, such that, whenever an observation falls into Zi, the decision that Hi is the correct hypothesis is made. In the example, these regions correspond to the values of y for which D(y) = 0 and the values of y for which D(y) = 1. These decision regions are established to maximize an appropriate criterion of performance, corresponding to the probability of a correct decision.

Consider other examples:

Example 5.2 (Radar) A simple radar system makes a scalar observation y to determine the absence or presence of a target at a given range and heading. If a target is present (hypothesis H1), the observed signal is y = E + w, where E is a known signal level, and w ∼ N(0,σ2). If no target is present (hypothesis H0), then only noise is received y = w. Find the decision rule for maximizing the probability of detecting the target, given a bound on the probability of false alarm.

Example 5.3 (Quality Control) At a factory, an automatic quality control device is used to determine whether a manufactured unit is satisfactory (hypothesis H0) or defective (hypothesis H1), by measuring a simple quality factor q. Past statistics indicate that one out of every 10 units is defective. For satisfactory units, q ∼ N(2,σ2), whereas for defective units, q ∼ N(1,σ2). The quality control device is set to remove all units for which q < t, where t is a threshold to be designed. The problem is to determine the optimal threshold setting in order to maximize the probability of detecting a defect, subject to the constraint that the probability of removing a satisfactory unit is at most 0.005.

5.1. BAYESIAN BINARY HYPOTHESIS TESTING 107

All of the above examples illustrate the problem of binary hypothesis testing. We will develop the relevant theory next.

5.1.1 Bayes Risk Approach and the Likelihood Ratio Test

We are now interested in obtaining “good” decision rules for the binary hypothesis testing case. A rational and common approach is to minimize a cost function given our models of the situation. Building on the development of the introduction, the elements of this approach in the binary case are:

1. Model of Nature: In the binary case there are only two possibilities, denoted as H0 and H1. Our knowledge of these possibilities is captured by the prior probabilities Pi = Pr(H = Hi). Note that P1 = 1 −P0.

2. Observation Model: As figure 5.1 indicates, the observation model captures the relationship between the observed quantity y and the unknown hypothesis H. This relationship is given by the conditional densities pY |H(y | Hi).

3. Decision Rule: Our decision rule D(y) is obtained by minimizing the average cost, called the “Bayes risk.” Let Cij denote the cost of deciding hypothesis D(y) = Hi when hypothesis Hj is true, then the Bayes risk of the decision rule is given by:

E [ CD(y),H

] =

1∑ i=0

1∑ j=0

CijPr (D(y) = Hi,Hj true) (5.1)

Note that the outcome of deciding Hi in (5.1) is random, even if the decision rule is deterministic, because y itself is random. Thus the expectation in (5.1) averages over both the randomness in the true hypothesis Hj (i.e. the randomness in the state of nature) as well as the randomness in the observation, and thus decision outcome (i.e. the randomness in the data).

There are two key assumptions in the Bayes risk approach to the hypothesis testing problem which is formulated above. First, apriori probabilities of each hypothesis occurring Pi can be determined. Second, decision costs Cij can be meaningfully assigned. Under these two assumptions, the Bayes risk hypothesis testing problem above is well posed. Clearly, the key is the minimization of the Bayes risk E

[ CD(y)

] .

Let us now focus on finding the decision rule that minimizes the Bayes risk. Recall (Figure 5.2) that a deterministic decision rule D(y) is nothing more than a division of the observation space Rn into disjoint decision regions Z0 and Z1 such that when y ∈ Zi our decision is Hi. Thus finding a deterministic decision rule in the binary case is simply a matter of figuring out which region to assign each observation to. Combining this insight with Bayes rule we proceed by rewriting the Bayes risk as follows:

E [ CD(y)

] = E

[ E [ CD(y) | y

]] =

∫ E [ CD(y) | y

] pY (y) dy (5.2)

Now pY (y) is always non-negative and the value of E [ CD(y) | y

] only depends on the decision region to

which we assign the particular value y, so we can minimize (5.2) by minimizing E [ CD(y) | y

] for each value

of y. Thus, the optimal decision is to choose the hypothesis that gives the smallest value of the conditional expected cost E

[ CD(y) | y

] for the given value of y.

Now the conditional expected cost is given by:

E [ CD(y) | y

] =

1∑ i=0

1∑ j=0

CijPr (D(y) = Hi,Hj true | y) (5.3)

But Pr (D(y) = Hi,Hj true | y) will either equal 0 or Pr (Hj true | y) for a deterministic decision rule, since the decision outcome given y is non-random! In particular, for a given observation value y, the expected value of the conditional cost if we choose to assign the observation to H0 is given by:

If D(y) = H0: E [ CD(y)=H0 | y

] = C00 pH|Y (H0 | y) + C01 pH|Y (H1 | y) (5.4)

where pH|Y (Hi | y) denotes Pr (Hi true | Y = y). Similarly, the expected value of the conditional cost if we assign this value of y to H1 is given by:

If D(y) = H1: E [ CD(y)=H1 | y

] = C10 pH|Y (H0 | y) + C11 pH|Y (H1 | y) (5.5)

108 CHAPTER 5. DETECTION THEORY

Given the discussion above, the optimal thing to do is to make the decision that results in the smaller of the two conditional costs. We can compactly represent this comparison and its associated decision rule as follows:

C00pH|Y (H0 | y) + C01pH|Y (H1 | y) H1 ≷ H0

C10pH|Y (H0 | y) + C11pH|Y (H1 | y) (5.6)

where H1 ≷ H0

denotes choosing H1 is the inequality is > and choosing H0 if the inequality is <. The decision

rule given in (5.6) represents the optimal Bayes risk decision rule in its most fundamental form. Now from Bayes rule we have that:

pH|Y (Hi | y) = pY |H(y | Hi)pH(Hi)

pY (y) (5.7)

Substituting (5.7) into (5.6) and dividing through by pY (y) we obtain:

(C01 −C11) P1pY |H(y | H1) H1 ≷ H0

(C10 −C00) P0pY |H(y | H0) (5.8)

which expresses the optimum Bayes risk decision rule in terms of the prior probabilities Pi and the data “likelihoods” pY |H(y | Hi). Note that the expressions (5.7) and (5.8) are valid for any assignments of the costs Cij.

If we further make the reasonable assumption that errors are more costly than correct decisions, so that

(C01 −C11) > 0 (5.9) (C10 −C00) > 0 (5.10)

we can rewrite the optimum Bayes risk decision rule D(y) in (5.8) as follows:

L(y) = [ pY |H(y | H1) pY |H(y | H0)

] H1 ≷ H0

(C10 −C00) P0 (C01 −C11) P1

≡ η (5.11)

The consequences of (5.11) are considerable and we will take some time to discuss them. First, examining (5.11) we see that the form of the optimal Bayes risk decision rule is to compare the ratio L(y), which is termed the likelihood ratio, to a threshold, which is given by η. The value of this threshold is determined, in general, by both the prior probabilities and the assigned cost structure, both of which are known at the outset of the problem (i.e. involve prior knowledge). The test (5.11) is called a likelihood ratio test or LRT, for obvious reasons, and thus all optimal decision rules (in the Bayes risk sense) are LRTs (with perhaps different thresholds). Thus, while as engineers we may disagree on such details as the assignment of costs and prior probabilities, the form of the optimal test (i.e. the data processing) is always the same and given by the LRT. Indeed, while the threshold η can be set by choosing costs and prior probability assignments, it is also possible to view it simply as a tunable parameter.

Second, examining (5.11) we see that the data or observations enter the decision only through the likelihood ratio L(y). Because it is a function of the uncertain observation y, it is itself a random variable. Since this scalar function of the data is all that is needed to perform the optimal test, it is a sufficient statistic for the detection problem. That is, instead of making a decision based on the original observations y, it is sufficient to make the decision based only on the likelihood ratio, which is a function of y.

Finally, note that the sufficient statistic L(y) is a scalar. Thus the LRT is a scalar test, independent of the dimension of the observation space. This means we can make a decision in the binary case by making a single comparison, independent of whether we have 1 observation or 1 million.

Before moving on to look at special cases we note that there is another form of (5.11) that is sometimes used. In particular, taking logarithms of both sides of (5.11) does not change the inequality and results in the following equivalent test:

ln [L(y)] H1 ≷ H0

[ (C10 −C00) P0 (C01 −C11) P1

] (5.12)

The quantity on the left hand side of (5.12) is called the log-likelihood ratio, and as we will see, is conveniently used in Gaussian problems.

5.1. BAYESIAN BINARY HYPOTHESIS TESTING 109

5.1.2 Special Cases

Let us now consider some common special cases of the Bayes risk and the associated decision rules corre- sponding to them.

MPE cost assignment and the MAP rule

Suppose we use the following cost assignment:

Cij = 1 −δij (5.13)

where δij = 1 if i = j and δij = 0 if i 6= j. Then the cost of all errors (C10 = C01 = 1) are the same and there is no cost for correct decisions (C00 = C11 = 0). In this case, the Bayes risk is given by:

E [ CD(y)

] = C00Pr [Decide H0, H0 true] (5.14)

+C01Pr [Decide H0, H1 true]

+C10Pr [Decide H1, H0 true]

+C11Pr [Decide H1, H1 true]

= Pr [Decide H0, H1 true] + Pr [Decide H1, H0 true] (5.15)

= Pr [Error]

Thus the optimal detector for this cost assignment minimizes the probability of error. The corresponding decision rule is termed the minimum probability of error (MPE) decision rule and is given by:

pY |H(y | H1) pY |H(y | H0)

H1 ≷ H0

P0 P1

(5.16)

Since pY |H(y | Hi)Pi = pH|Y (Hi | y)pY (y) we can rewrite the MPE decision rule (5.16) in the following form:

pH|Y (H1 | y) H1 ≷ H0

pH|Y (H0 | y) (5.17)

This decision rule says that for minimum probability of error choose the hypothesis whose posterior prob- ability is higher. This is termed the Maximum aposteriori probability or MAP rule. Thus we see that the MAP rule is also the MPE rule independent of prior probabilities.

The ML rule

Now suppose we again use the MPE cost criterion with Cij = 1−δij, but also have both hypotheses equally likely apriori so that P0 = P1 = 1/2. In this case we essentially have no prior preference for one hypothesis over the other. With these assignments we can see that the threshold in (5.11) is given by η = 1 so that the decision rule becomes:

pY |H(y | H1) H1 ≷ H0

pY |H(y | H0) (5.18)

In this case the decision rule is to choose the hypothesis that gives the higher likelihood of the observation. For this reason this rule is called the maximum likelihood or ML rule

Scalar Gaussian Detection

Here we consider the problem of deciding which of two possible Gaussian distributions a single scalar obser- vation comes from. In particular, under hypothesis Hi the observation is distributed according to:

pY |H(y | Hi) = N (y; mi,σi) = 1√

2πσ2i e −1

(y−mi) 2

σ2 i (5.19)

These two possibilities are depicted in Figure 5.3. The likelihood ratio for this case is given by:

110 CHAPTER 5. DETECTION THEORY

m 1

m 0

p Y | H ( y | H 1 ) p Y | H ( y | H 0 )

Figure 5.3: General scalar Gaussian case

L(y) =

  (

1√ 2πσ21

) e −(y−m1)

2σ2 1(

1√ 2πσ20

) e −(y−m0)

2σ2 0

  H1≷H0 η (5.20)

Now taking natural logs of both sides as in (5.12) and rearranging terms results in the following form of the optimal decision rule:

− (y −m1)2

2σ21 +

(y −m0)2

2σ20

H1 ≷ H0

( σ1 σ0 η

) (5.21)

Same Variances, Different Means: Let us consider some special sub-cases. First, suppose σ0 = σ1 = σ and m1 > m0. In this case the Gaussian distributions have the same variance but different means and the task is to decide whether the observation came from the Gaussian with the greater or lesser mean. After simplification, (5.21) can be reduced to the following form:

y H1 ≷ H0

m0 + m1 2

+ σ2 ln(η)

(m1 −m0) ≡ Γ (5.22)

This situation is depicted in Figure 5.4. There are some interesting things to note about this result. First there are two decision regions separated by Γ. In general, the boundary between the decision regions is an adjusted threshold, which takes into account both the costs and the prior probabilities. For example, if we consider the ML rule (i.e. the MPE cost structure with equally likely hypotheses), then Γ = (m0 + m1)/2 and the boundary between decision regions is halfway between the means. In particular, in this case η = 1 and we can write the decision rule in the form:

‖y −m0‖2 H1 ≷ H0

‖y −m1‖2 (5.23)

which says to choose the hypothesis “closest” to the corresponding mean. If, however, instead, we use the MPE cost structure, but P1 > P0 the decision boundary will move closer to m0, since we expect to see the H1 case more frequently. In any case, the data processing is linear. This will not always be the case.

D e c l a r e H 1D e c l a r e H 0 G

m 0 m 1

p Y | H ( y | H 1 )p Y | H ( y | H 0 )

Figure 5.4: Scalar Gaussian case with equal variances

5.1. BAYESIAN BINARY HYPOTHESIS TESTING 111

Different Variances, Same Means: Now consider what happens if we instead suppose σ0 < σ1 and m1 = m0 = 0. In this case the Gaussian distributions have the same mean, but different variances and the task is to decide whether the observation came from the Gaussian with the greater or lesser variance. After simplification, (5.21) can be reduced to the following form:

y2 H1 ≷ H0

( σ21σ

2 0

σ21 −σ20

) ln

( σ1 σ0 η

) ≡ Γ′ (5.24)

This situation is depicted in Figure 5.5. Note that the decision regions are no longer simple connected segments of the real line. Further, the decision rule is a nonlinear function of the observation y.

D e c l a r e H 1D e c l a r e H 0

G 1 / 2

D e c l a r e H 1

- G 1 / 2

p Y | H ( y | H 1 )

p Y | H ( y | H 0 )

Figure 5.5: Scalar Gaussian case with equal means

5.1.3 Examples

Let us consider some examples.

Example 5.4 (Radar) Consider the radar example, Example 5.2, discussed earlier. This is really just a scalar Gaussian detection problem. The likelihood ratio for this example is given by:

L(y) = e −(y−E)2

2σ2

e −(y)2 2σ2

= e 2Ey−E2

2σ2 (5.25)

Thus, the optimal decision rule is given by:

e 2Ey−E2

2σ2 H1 ≷ H0

η (5.26)

Taking logarithms of both sides means that the new decision rule can be restated as:

y H1 ≷ H0

2 + σ2 ln(η)

E (5.27)

In the case that the cost criterion is minimum probability of error (MPE) so that C00 = C11 = 0,C01 = C10 = 1, and the probability of each hypothesis is apriori equal (P0 = P1 = 1/2), we have that η = 1. Note that the optimal detection test in this case is to compute which mean the measurement is closer to! This is just an example of the scalar Gaussian detection problem treated above.

Example 5.5 (Multiple Observations) Consider the radar detection example, except that N independent pulses are sent out, so that a vector of measurements is collected. This is the typical situation in radar systems, where multiple pulses are processed to improve the signal-to-noise ratio and thus obtain better detection performance. We assume that each pulse provides a measurement yi, where

yi =

{ ni if hypothesis H0 is true (no target present)

E + ni if hypothesis H1 is true (target present)

112 CHAPTER 5. DETECTION THEORY

and ni is a set of independent, identically distributed N(0,σ 2) random variables. In this case, the likelihood ratio is given

by:

L(y) = pY1,···YN|H(y1, · · · ,yN | H1) pY1,···YN|H(y1, · · · ,yN | H0)

N∏ i=1

e −(yi−E)

2σ2

e −(yi)

2σ2

N∏ i=1

e 2Eyi−E

2σ2 = e

  N∑ i=1

 −NE2

2σ2 (5.28)

By again taking logs of both sides the decision rule can be reduced to:

N∑ i=1

yi H1 ≷ H0

2 + σ2 ln(η)

NE (5.29)

Comparing with (5.27), the effect of using the extra measurements is to reduce the measurement covariance by a factor of N1/2.

Before, we said that the likelihood ratio was a sufficient statistic. It may not be the simplest sufficient statistic however. Whenever there is a function of the data, g(y) such that the likelihood ratio can be computed strictly from g(y), this value is also a sufficient statistic. Thus sufficient statistics are not unique.

In the above example, it is clear that the sample mean, 1 N

∑N i=1 yi, is a sufficient statistic for the detection

problem; note that this is a linear function of the measurement vector y and much simpler than the likelihood ratio L(yi) in (5.28).

Example 5.6 Assume that, under hypothesis H0, we have a vector of N observations y, with independent, identically distributed N(0,σ20 ) components yi. Under hypothesis H1, we have a vector of N observations y, with independent, identically dis- tributed N(0,σ21 ) components yi. Thus, the two hypothesis correspond to multiple observations of independent identically distributed random variables with the same mean but different covariances. The likelihood ratio is given by:

L(y) = e −

∑N i=1

y2 i

2σ2 1

(2πσ2 1 )N/2

e −

∑N i=1

y2 i

2σ2 0

(2πσ2 0 )N/2

= σN1 σN2

e −

∑N i=1

y2 i

2σ2 1

∑N i=1

y2 i

2σ2 0 (5.30)

Again, after taking logs the optimal decision rule can be rewritten in terms of a simpler test, as:

N∑ i=1

y 2 i

H1 ≷ H0

2 σ20σ

2 1

σ21 −σ20 ln

( η

1/N σ1 σ0

) (5.31)

Clearly, a sufficient statistic for this problem is the quadratic function of the measurements: 1 N

∑N i=1

y2i .

Before proceeding to another section, consider a problem which does not involve Gaussian random variables.

Example 5.7 Assume that we observe a random variable y which is Poisson distributed with mean m0 when H0 is true, and with mean m1 when H1 is true. Thus the likelihoods are given by:

pY |H(y | Hi) = m y i e −mi

k! (5.32)

Note that the measurements are discrete-valued; thus, the likelihood ratios will involve probability distributions rather than densities. The likelihood ratio is given by:

L(y) = pY |H(y | H1) pY |H(y | H0)

= m y 1e −m1

m y 0e −m0

(5.33)

Thus, the optimal decision rule can be written as:

y H1 ≷ H0

(m1 −m0) + ln(η) ln ( m1 m0

) (5.34)

5.2. PERFORMANCE AND THE RECEIVER OPERATING CHARACTERISTIC 113

5.2 Performance and the Receiver Operating Characteristic

In the discussion so far we have focused on the form of the optimal test and on the nature of the data processing involved. We have found that the optimum Bayes risk test is the likelihood ratio test, where a function of the data (the likelihood ratio) is compared to a threshold. Let us now turn our attention to characterizing the performance of decision rules in general and LRT-based decision rules in particular. To aid in this discussion let us define the following standard terminology, arising from classical radar detection theory:

PF ≡ Pr(Choose H1 | H0 True) = Probability of False Alarm (called a “Type I” Error) PD ≡ Pr(Choose H1 | H1 True) = Probability of Detection PM ≡ Pr(Choose H0 | H1 True) = Probability of Miss (called a “Type II” Error)

The quantity PF is the probability that the decision rule will declare H1 when H0 is true, while PD is the probability that the decision rule will declare H1 when H1 is true and PM is the probability that the decision rule will declare H0 when H1 is true. Note carefully that these are conditional probabilities!

Now there are two natural metrics to evaluate the performance of a decision rule. The first metric is the expected value of the cost E[CD(y)], i.e. the value of the Bayes risk. Let us examine this cost in more detail. Following (5.14), and using Bayes rule and the definitions of the conditional densities PF , PM , and PD above, the Bayes risk can is given by:

E [ CD(y)

] = C00Pr [Decide H0 | H0] P0 + C01Pr [Decide H0 | H1] P1 (5.35)

+C10Pr [Decide H1 | H0] P0 + C11Pr [Decide H1 | H1] P1 = C00(1 −PF )P0 + C01(1 −PD)P1 + C10PFP0 + C11PDP1 = C00P0 + C01P1︸︷︷︸

Fixed Cost

+ (C10 −C00) P0PF − (C01 −C11) P1PD︸︷︷︸ Varies as function of decision rule

Note that this cost has two components. The first component is independent of the decision rule used, is based only on the “prior” components of the problem, and represents a fixed cost. The second component varies as a function of the decision rule (e.g. as the threshold η of the LRT is varied). In particular, of the elements in this second component it is PF and PD that will vary as the decision rule is changed. Thus, from a performance standpoint, we can say that E[CD(y)] can be expressed purely as a function of PF and PD (where we assume Cij and Pi are fixed).

A second natural performance metric of decision rules is the probability of error Pr[error]. Starting from (5.14) and again using Bayes rule and the definitions of PF , PM , and PD we find:

Pr [Error] = Pr [Decide H0, H1 true] + Pr [Decide H1, H0 true] (5.36)

= PMP1 + PFP0

= (1 −PD)P1 + PFP0

Again, the parts of this expression that will vary as the decision rule is changed are PD and PF . Thus, we can also express Pr[Error] as a function of just PD and PF (again, assuming Cij and Pi are fixed).

Let us summarize the development thus far. Given any decision rule we can determine its performance (i.e. either its corresponding Bayes risk E

[ CD(y)

] or its Pr[Error]) by calculating PD and PF for the decision

rule. Further, we know that “good” decision rules (i.e. those optimal in the Bayes risk sense) are likelihood ratio test – i.e. they compare the likelihood ratio to a fixed threshold to make their decision. The only undetermined quantity in a LRT is its threshold. Given this discussion it seems reasonable to limit ourselves to consideration of LRT decision rules and to calculate PD and PF for every possible value of the threshold η. Given this information, we have essentially characterized every possible “reasonable” decision rule. This information may be conveniently and compactly represented as graph of PD(η) versus PF (η) – that is, a plot of the points (PF ,PD) as the parameter η is varied. Such an important plot for a decision rule has a special name – it is called the Receiver Operating Characteristic or ROC for the detection problem. An illustration of a ROC is given in Figure 5.6.

Let us emphasize some features of the ROC. First, note that the threshold η is a parameter along the curve. Thus any one point on the ROC corresponds to a particular choice of threshold (and vice versa). The ROC itself does not depend on the costs Cij or the apriori probabilities Pi. These terms can be used,

114 CHAPTER 5. DETECTION THEORY

10 P

P D

P F ( h 0 )

P D ( h 0 ) h = h 0

Figure 5.6: Illustration of ROC.

however, to determine a particular threshold, and thus a particular operating point corresponding to the optimal Bayes risk detector. Finding appropriate values of these costs and densities can be challenging, and the ROC allows us to characterize the performance of all possible optimal detectors.

The key challenge in generating the ROC for a particular problem is finding the quantities PD and PF as a function of a threshold parameter. To this end, note that a general LRT decision rule can always be expressed in the following form:

`(y) H1 ≷ H0

Γ (5.37)

where `(y) is a sufficient statistic for the detection problem and Γ is a corresponding threshold. The sufficient statistic might be the original likelihood ratio L(y) = pY |H(y | H1)/pY |H(y | H0) or it might be a simpler function of the observations, as we saw in the radar example. The important thing is that it completely captures the influence of the observations. Note that `(y) is itself a random variable, since it is a function of y.

Now we can express PD and PF as follows:

PD = Pr(Choose H1 | H1 True) (5.38)

∫ {y|Choose H1}

pY |H(y | H1) dy (5.39)

∫ `>Γ

pL|H(` | H1) d` (5.40)

PF = Pr(Choose H1 | H0 True) (5.41)

∫ {y|Choose H1}

pY |H(y | H0) dy (5.42)

∫ `>Γ

pL|H(` | H0) d` (5.43)

The expressions (5.39) and (5.42) express the probabilities in terms of quantities in the space of the obser- vations, i.e. in terms of the likelihoods. The expressions (5.40) and (5.43) express the probabilities in terms of quantities in the space of the test statistic and its densities. Both expressions are correct, and the choice of which to use is usually based on convenience, as we will see. Note that the region of integration (i.e. the set of values of y or ` used in calculation) is the same for both PD and PF , it is just the densities used that are different. We illustrate these ideas with an example.

Example 5.8 (Scalar Gaussian Detection) Consider again the problem of determining which of two Gaussian densities of scalar observation comes from. In particular, suppose y is scalar and distributed N(0,σ2) under H0 and distributed N(E,σ

2) under H1. We have seen in (5.27) that the optimal decision rule was:

`(y) = y H1 ≷ H0

2 + σ2 ln(η)

E = Γ

5.2. PERFORMANCE AND THE RECEIVER OPERATING CHARACTERISTIC 115

In this case `(y) = y so the observation space is the same as the space of the test statistic and it is easy to see that `(y) will be a Gaussian random variable under either hypothesis. In particular, we have:

pL|H1 (` | H1) = N ( `; E,σ

2 ) )

(5.44)

pL|H0 (` | H0) = N ( `; 0,σ

2 ) )

(5.45)

Now we can combine these densities with (5.40) and (5.43) to find PD and PF as we vary Γ from (−∞,∞), which is the range of Γ which results from variations in η. Explicitly, we have

PD =

∫ ∞ Γ

pL|H1 (` | H1) d` (5.46)

∫ ∞ Γ

1 √

2πσ2 e −(`−E)

2σ2 d`

PF =

∫ ∞ Γ

pL|H0 (` | H0) d` (5.47)

∫ ∞ Γ

1 √

2πσ2 e − `

2σ2 d`

These calculations of PD and PF are illustrated in Figure 5.7. Since these probabilities depend on the integral of Gaussian

densities, we can express them in terms of the standard Q function Q(x) = 1 2π

∫∞ x e−z

2/2dz as follows:

PD = Q

( Γ −E σ

) (5.48)

PF = Q

( Γ

) (5.49)

p H p y H L H Y H| |

( | ) ( | ) 0 00 0 l =

p H p y H L H Y H| |

( | ) ( | ) 1 11 1 l =

l , y 0 E

D e c l a r e H 1D e c l a r e H 0

P F

P D

Figure 5.7: Illustration of PD and PF calculation.

Note that for this Gaussian detection example the performance of the detection rule really only depends on the separation of the means of the test statistic `(y) under each hypothesis relative to the variance of the test statistic under each hypothesis – i.e. the normalized “distance” between the conditional densities. This relative or normalized distance is often an important indicator of the difficulty of a detection problem. As a result, this idea has been formalized in the definition of the so called “d2 statistic”:

d2 ≡ (E [` | H1] −E [` | H0])

2√ Var (` | H1) Var (` | H0)

(5.50)

The quantity d2 can be seen to be a measure of the normalized distance between two hypotheses. In general, larger values of d2 correspond to easier detection problems.

Example 5.9 (Scalar Gaussian Detection) Let us continue Example 5.8. Note that:

d 2

= E2

σ2 (5.51)

116 CHAPTER 5. DETECTION THEORY

which is a measure of the relative separation of the means under each hypothesis. Further we can express PD and PF in terms of d as follows:

PF = Q

( Γ

) PD = Q

( Γ

σ −d )

(5.52)

Larger values of d result in higher values of PD for a given value of PF .

5.2.1 Properties of the ROC

If we examine the expressions for PD and PF for Example 5.8 in more detail we can see that the corre- sponding ROC will possess a number of properties. First, PD ≥ PF for all thresholds Γ or η. In addition, limΓ→−∞PD = limΓ→−∞PF = 1. At the other extreme, limΓ→+∞PD = limΓ→+∞PF = 0. Finally, PD ≤ 1 and PF ≤ 1. Thus, the sketch in Figure 5.6 reasonably reflects this ROC. More interestingly, these proper- ties (and others) are true for general ROC curves, and not just for the present example. We discuss these properties of the ROC next, starting with those we have just seen for our Gaussian example. We consider general likelihood ratio tests with threshold η as given in (5.11).

Property 1. The points (PF ,PD) = (0, 0) and (PF ,PD) = (1, 1) are always on the ROC. To see this, suppose we set the threshold η = 0. In this case since the densities are non-negative, the decision rule will always select H1. In this case, PD = PF = 1. At the other extreme, assume the threshold η = +∞. In this case the hypothesis H0 is always selected1. Since H0 is always selected PF = 0 and PD = 0.

Property 2. The ROC is the boundary between what is achievable by any decision rule and what is not. In particular, the (PF ,PD) curve of any detection rule (including detection rules that are not LRTs) cannot lie in the shaded region shown in Figure 5.8.

Now, it is straightforward to see that we cannot get better PD for a given PF than that achieved by the LRT for the problem, since that would imply a detection rule resulting in lower Bayes risk (which would contradict our finding that the optimal Bayes risk decision rule is a LRT). What is perhaps less immediately obvious is that no decision rule can perform worse than the performance corresponding to the “reflection” of the ROC below the 45 degree line. The detector with this maximally bad performance is obtained by simply switching the decision regions for each value of η (and thus is doing the worst thing to do for every threshold). The reason is simple – if it were possible to design a decision rule with arbitrarily bad performance, than by just exchanging the decision regions we could obtain a decision rule with arbitrarily good performance. Note that the result of swapping the decision regions is that PD ⇒ 1 −PD and PF ⇒ 1 −PF .

10 P

P D

R O C o f L R T

P D ( h 0 ) h = h 0

P F ( h 0 )

S l o p e = h 0

Figure 5.8: Illustration ROC properties.

Property 3. For a LRT with threshold η, the slope of the (continuous) ROC at the corresponding (PF (η),PD(η)) point is η.

1Note that the only way that H1 would be selected is if we had an observation such that pY |H(y | H0) = 0. However, for such observations there is no possibility of a false alarm, since those value cannot be generated under H0!

5.2. PERFORMANCE AND THE RECEIVER OPERATING CHARACTERISTIC 117

To show this, first note that we may express PD as follows:

PD =

∫ {y|L(y)>η}

pY |H(y | H1) dy = ∫ {y|L(y)>η}

L(y) pY |H(y | H0) dy (5.53)

∫ ∞ η

Z pL|H0 (Z | H0) dZ (5.54)

Now, differentiating (5.54) with respect to η we obtain:

dPD(η)

dη = −η pL|H0 (η | H0) (5.55)

Now we also know that

PD =

∫ ∞ η

pL|H1 (L | H1) dL (5.56)

PF =

∫ ∞ η

pL|H0 (L | H0) dL (5.57)

Differentiating these expressions with respect to η we also obtain:

dPD dη

= −pL|H1 (η | H1) (5.58)

dPF dη

= −pL|H0 (η | H0) (5.59)

Now equating (5.55) to (5.58) we obtain the result that:

pL|H1 (η | H1) pL|H0 (η | H0)

= η (5.60)

Finally, the slope of the ROC is given by the derivative of PD with respect to PF :

dPD dPF

dPD dη

dPF dη

= −pL|H1 (η | H1) −pL|H0 (η | H0)

= η (5.61)

which shows the result.

This property is illustrated in Figure 5.8. Note that a consequence of this property is that the ROC has zero slope at the point (PF ,PD) = (1, 1) (η = 0) and infinite slope at the point (PF ,PD) = (0, 0) (η = ∞).

Property 4. The ROC of the LRT is convex downward. In particular, PD ≥ PF . To show this property we use the concept of randomized decision rules, discussed in the following section on detection from discrete-valued observations. Suppose we select the endpoints of a randomized decision rule to be on the optimal ROC itself, as illustrated in Figure 5.9. Note that such a randomized decision rule is not necessary optimal. As a result, the optimal test must have performance (i.e. PD for a given PF ) that is better than any randomized test. In particular, if (P

∗ F ,P

∗ D) are the points on

the ROC for the optimal Bayes decision rule, then we must have:

P∗D ≥ PD(p) when P ∗ F = PF (p) (5.62)

This argument shows that points on the optimal ROC between our chosen endpoints must lie above the line connecting the endpoints, and thus that the optimal ROC is convex, as shown in Figure 5.9

To see how the ROC can be used to compare the performance of different problems and detection rules, consider the following example, where we examine how the ROC changes as a function of the amount of data.

118 CHAPTER 5. DETECTION THEORY

1 0

P F

P D

P F 1

P F 2

P D

P p D ( )

P p F ( )

P D

Figure 5.9: Illustration ROC convexity using randomized decision rules.

Example 5.10 Suppose we observe N independent samples of a random variable: yi, i = 1, · · · ,N. Under hypothesis H0, pYi|H0 (yi | H0) ∼ N(0,σ2), and under H1,pYi|H0 (yi | H1) ∼ N(1,σ

2). Define the vector y to be the collection of samples. Our problem is to decide whether our vector of observations came from the H0 distribution or the H1 distribution. This problem is similar to the N-pulse radar detection problem of Example 5.5. Using our analysis there we find that the optimal test can be written as:

`(y) = 1

N∑ i=1

yi H1 ≷ H0

2 + σ2 ln(η)

N (5.63)

Now note that since the observations yi are independent, the sufficient statistic for the test `(y) = 1 n

∑n i=1

yi has the following probability density functions under each hypothesis:

H0 : pL|H0 (` | H0) ∼ N(0,σ 2 /N) (5.64)

H1 : pL|H1 (` | H1) ∼ N(1,σ 2 /N) (5.65)

Thus, the probability of false alarm for a given threshold Γ = 1/2 + σ2 ln(η) N

is given by

PF = 1√

2πσ2/N

∫ ∞ Γ

e −Nx

2σ2 dx = Q

( N1/2Γ

) (5.66)

where Q(Γ) = 1√ 2π

∫∞ Γ e−x

2/2dx. Similarly,

PD = 1√

2πσ2/N

∫ ∞ Γ

e −N(x−1)

2σ2 dx = Q

( N1/2(Γ − 1)

) (5.67)

Note that, as N increases the ROC curves are monotonically increasing in PD for the same PF , and thus nest. In particular, as we make more independent observations the curves move to the northwest and closer to their bounding box. In the limit, we have limN→∞ PD = 1, limN→∞ PF = 0, which indicates that, as N →∞. This effect is shown in Figure 5.10. Simply looking at the ROC curves for the different cases we can see the positive effect of using more observations.

Finally, note that the idea of using the ROC to evaluate the performance of decision rules is so powerful and pervasive that it is used to evaluate decision rules even when they are not, strictly speaking LRT rules for binary hypothesis testing problems.

5.2.2 Detection Based on Discrete-Valued Random Variables

The theory behind detection based on observations y that are discrete valued is essentially the same as when y is continuous valued. In particular, the LRT (5.11) is still the optimal decision rule, as considered in Example 5.7. There are some important unique characteristics of the discrete valued case that are worth

5.2. PERFORMANCE AND THE RECEIVER OPERATING CHARACTERISTIC 119

1 0

P F

P D

I n c r e a s i n g N

Figure 5.10: Illustration ROC behavior as we obtain more independent observations.

discussing, however. When the observations y are discrete-valued the likelihood ratio L(y) will also be discrete-valued. In this case, varying the threshold η will have no effect on the values of PF ,PD until the threshold crosses one of the discrete values of L(y). After crossing this discrete-value, the values of PF ,PD will then change by a finite amount. As a result, the ROC “curve” in such a discrete observation case, obtained by varying the value of the threshold, will be a series of disconnected and isolated points. This is illustrated in the following examples.

Example 5.11 Assume that y is a binomial random variable, resulting from the sum of two independent, identically distributed Bernoulli random variables:

y = x1 + x2 (5.68)

The probabilities of the xi under each hypothesis are given by:

Under H0: Pr(xi = 1) = 1

4 ; Pr(xi = 0) =

4 ; (5.69)

Under H1: Pr(xi = 1) = 1

2 ; Pr(xi = 0) =

2 ; (5.70)

Note that y can only take 3 values: 0, 1, or 2. Under these conditions, the likelihood ratio for the problem is given by:

L(y) = pY |H(y | H1) pY |H(y | H0)

2! y!(2−y)!

( 1 2

)y (1 2

)2−y 2!

y!(2−y)!

( 1 4

)y (3 4

)2−y = 1/4(1/4)y(3/4)2−y (5.71) =

32−y (5.72)

The LRT for this problem is then given by:

32−y

H1 ≷ H0

η (5.73)

Now note that the likelihood ratio can only take the values:

L(y) =

 

4/9 if y = 0 4/3 if y = 1 4 if y = 2

(5.74)

Now let us examine how PD and PF vary as we change η. If η > 4, hypothesis H0 is always selected so that PD = 0 and PF = 0. Thus, these values of η correspond to the point (PF ,PD) = (0, 0) on the ROC. As η is reduced so that 4/3 < η < 4, hypothesis H1 is selected only when y = 2. The probability of detection is PD = P(y = 2 | H1) = 1/4, whereas the probability of false alarm is PF = P(y = 2 | H0) = 1/16. Note that PD and PF will have these values for any

120 CHAPTER 5. DETECTION THEORY

value of η in the range 4/3 < η < 4. Thus, this entire range of η corresponds to the (isolated) point (PF ,PD) = (1/16, 1/4) on the ROC. Further reducing the threshold η so that 4/9 < η < 4/3 implies that H0 is selected only when y = 0. In this case, the probability of detection is PD = 1 − P(y = 0 | H1) = 3/4, and the probability of false alarm is 1 − P(y = 0 | H0) = 7/16. Again, note that PD and PF will have these values for any value of η in the range 4/9 < η < 4/3. Again, this entire range of η thus corresponds to the (isolated) point (PF ,PD) = (7/16, 3/4) on the ROC. Finally, as the threshold is lowered so that η < 4/9, hypothesis H0 is never selected, so that PD = 1 and PF = 1. These values of η therefore correspond to the point (PF ,PD) = (1, 1) on the ROC. In summary, varying the threshold η produces 4 isolated points for the ROC curve for this problem, as illustrated in Figure 5.11

P D

P F

1 / 1 6 1

1 / 4

3 / 4

7 / 1 60

4 / 9 < h £ 4 / 3

h £ 4 / 9

4 / 3 < h £ 4

4 < h

Figure 5.11: Illustration ROC for a discrete valued problem of Example 5.11.

Let us consider another discrete valued example, this time involving Poisson random variables.

Example 5.12 Consider observing a scalar value y, which is Poisson distributed under H0 with mean m0, and Poisson distributed under H1 with mean m1. This situation was considered in Example 5.7. The optimal decision rule for this problem was found in (5.34 to be given by:

y H1 ≷ H0

(m1 −m0) + ln(η) ln ( m1 m0

) = Γ (5.75) Since y is discrete-valued, fractional parts of the effective threshold Γ on the right hand side of (5.75) will have no effect, and the ROC will again have a countable number of points.

The probability of false alarm is thus a function of the integer part of the threshold Γ, and is given by:

PF (Γ) =

∞∑ y=dΓe

m y 0

y! e −m0 (5.76)

where dΓe denotes the smallest integer greater than Γ. Similarly, the probability of detection is given by:

PD(Γ) =

∞∑ y=dΓe

m y 1

y! e −m1 (5.77)

The ROC for this problem is illustrated in Figure 5.12

The discrete nature of the ROC when the observation is discrete-valued seems to suggest that we can only obtain detection performance at a finite number of (PF ,PD) pairs. While this observation is true if we limit ourselves to deterministic decision rules, by introducing the concept of a randomized decision rule we can get a much wider set of detection performance points (i.e. (PF ,PD) points).

To introduce the idea of a randomized decision rule, suppose we have a likelihood ratio L(y) for an arbitrary problem (i.e. not necessarily with discrete-valued observations) and two thresholds η0 and η1. We then essentially have two likelihood ratio decision rules. Assume the decision rule corresponding to η0 has performance (PF0,PD0 ) and the decision rule corresponding to η1 has performance (PF1,PD1 ). Suppose we now define a new (random) decision rule by deciding between H0 and H1 according to the following probabilistic scheme:

5.2. PERFORMANCE AND THE RECEIVER OPERATING CHARACTERISTIC 121

P D

P F

0 < G £ 1

G £ 0

1 < G £ 2

2 < G £ 3

Figure 5.12: Illustration ROC for a discrete valued problem of Example 5.12.

1. Select a Bernoulli random variable Z with Pr(Z = 1) = p. This is equivalent to flipping a biased coin with Pr(heads) = p.

2. If Z = 1 use a LRT with the threshold η = η1 to make the decision.

L(y) H1 ≷ H0

η1 (5.78)

If Z = 0 use a LRT with the threshold η = η0 to make the decision.

L(y) H1 ≷ H0

η0 (5.79)

Note that the resulting overall rule will result in a random decision. The PD(p), PF (p) performance of the overall new detection rule as a function of p can be found as:

PD(p) = Pr(Decide H1 | H1) (5.80) = Pr(Decide H1 | H1,Z = 1)Pr(Z = 1) + Pr(Decide H1 | H1,Z = 0)Pr(Z = 0) = pPD1 + (1 −p)PD0

PF (p) = Pr(Decide H1 | H0) (5.81) = Pr(Decide H1 | H0,Z = 1)Pr(Z = 1) + Pr(Decide H1 | H0,Z = 0)Pr(Z = 0) = pPF1 + (1 −p)PF0

Thus, the performance of the randomized decision rule is on the line connecting the points (PF1,PD1 ) and (PF0,PD0 ). These ideas are illustrated for a generic decision problem in Figure 5.13. By varying p we can obtain a decision rule with performance given by any (PF ,PD) pair on the line connecting the points (PF1,PD1 ) and (PF0,PD0 ).

Now, let us return to the discrete-valued observation case. By using such randomized decision rules with the isolated points of the ROC of the deterministic decision rule as endpoints, we can obtain any (PF ,PD) performance on the lines connecting these points. For example, the resulting ROC for Example 5.11 would be as shown in Figure 5.14. In general, we can obtain an ROC curve which is a piecewise-linear concave curve connecting the isolated points of the deterministic decision rule ROC. Further, c.f. ROC Property 2, it is impossible to get performance that is above this piecewise-linear curve (or below its mirror image).

Finally, note that ROC Property 3 can also be extended to discrete-valued random variables. Note that in this case the ROC curve is not differentiable at the discrete-valued points so the slope of the ROC curve is not defined at these points. At such points of non-differentiability, there is a range of possible slopes, defined by the slopes of the straight lines to the right and to the left of the isolated points. At these points, the value of η must be included in this range of possible slopes.

122 CHAPTER 5. DETECTION THEORY

1 0

P F

P D

P F 1

P F 2

P D

P p D ( )

P p F ( )

Figure 5.13: Illustration of the performance of a randomized decision rule.

P D

P F

1 / 1 6 1

1 / 4

3 / 4

7 / 1 60

Figure 5.14: Illustration of the overall ROC obtained for a discrete valued observation problem using ran- domized rules.

5.3 Other Threshold Strategies

We have now determined that the form of the optimal Bayes risk test is the likelihood ratio test and have studied the performance of decision rules through use of the ROC. We have seen that the ROC compactly represents the performance of the LRT for all choices of the threshold η. In the general Bayes formulation the specific threshold η used for a given detection problem (and thus the specific operating point chosen on the ROC) is a function of the prior probabilities Pi = Pr(Hi) and the cost assignment Cij:

η ≡ (C10 −C00) P0 (C01 −C11) P1

(5.82)

If we have knowledge of all these elements, then this is obviously the right (and easy) thing to do. Often, however, determining either the Pi or the Cij is fraught with difficulties and an alternative strategy for picking the operating point is used. We discuss two such alternatives next.

5.3.1 Minimax Hypothesis Testing

For a given detection problem suppose that we have a cost assignment Cij we believe in, but are unsure of the true prior probabilities used by nature, which are P∗1 and P

∗ 0 . Now suppose we design a decision rule (i.e.,

choose a threshold) based on the costs Cij and a set of assumed (but possibility incorrect) prior probabilities

5.3. OTHER THRESHOLD STRATEGIES 123

P1 and P0 = 1 − P1. Let the performance of the resulting decision rule be given by the operating point (PF (P1),PD(P1)), which, as we have indicated, will be a function of our choice of P1. Since, in general, the Pi we use to design our decision rule will be different from the true underlying P

∗ i , our test will not have the

minimum cost or Bayes risk for this problem. One reasonable approach in such a situation is to assume that nature will do the worst thing possible and to choose our design values of Pi (i.e. choose our threshold η) to minimize the maximum value of the cost or Bayes risk as a function of the true values P∗i . Such a strategy leads to the minimax decision rule.

Now, from (5.36), the resulting cost (i.e. the Bayes risk) of a decision rule using assumed values Pi when truth is P∗i is given by:

E (C,P1,P ∗ 1 ) = C00P

∗ 0 + C01P

∗ 1 + (C10 −C00) P

∗ 0 PF (P1) − (C01 −C11) P

∗ 1 PD(P1) (5.83)

= [(C01 −C00) − (C10 −C00) PF − (C01 −C11) PD] P∗1 + C00 + (C10 −C00) PF

where we have used the fact that P∗0 = (1 −P∗1 ). On the left in Figure 5.15 we illustrate how the expected cost changes as the true prior probability P∗1

is varied. When an arbitrary fixed value of P1 is used, the threshold is fixed, so the corresponding values of PF and PD are fixed. In this case we see from (5.83) that E(C) will be a linear function of the true prior probability P∗1 . This is plotted as the upper curve in Figure 5.15 (left). Now if we knew P

∗ 1 we could design

an optimal LRT using an optimal threshold. In this case the threshold would change as P∗1 varied and thus so would PF and PD and the resulting cost. The cost of this optimal decision rule is the lower curve in Figure 5.15 (left). The two curves touch when the design value of P1 matches the true value of P

∗ 1 . Thus,

they will always be tangent at this matched point. For the example in the figure, the maximum value of the expected cost for the non-optimal rule is obtained at the left endpoint of the curve.

C 0 0

1 P 1

E ( C )

C 1 1

P 1 * = P 10

C o s t i f f i x e d h u s e d

M a x i m u m c o s t o f

t h i s d e c i s i o n r u l e

O p t i m u m L R T d e c i s i o n

r u l e i f P 1 * k n o w n

C 0 0

1 P 1

E ( C )

C 1 1

T h i s c h o i c e m i n i m i z e s

t h e m a x i m u m c o s t

P 1 m m

Figure 5.15: Left: Illustration of the expected cost of a decision rule using an arbitrary fixed threshold as a function of the true prior probability P∗1 . The maximum cost of this decision rule is at the left endpoint. The lower curve is the corresponding expected cost of the optimal LRT. Right: The expected cost of the minimax decision rule as a function of the true prior probability P∗1 .

In general, we would like to minimize the maximum value of (5.83). Examining Figure 5.15, we can see that this goal is accomplished if we choose our value of P1 (or equivalently, our operating point on the ROC) so that the line (5.83) is tangent to the optimal Bayes risk curve at its maximum, as shown on the right in the figure. This happens when the slope of the curve is zero, i.e. when:

[(C01 −C00) − (C10 −C00) PF − (C01 −C11) PD] = 0 (5.84)

This result is valid as long as the maximum of the optimal Bayes cost curve is interior to the interval. When the maximum is at the boundary of the interval, then that is value of P1 to choose.

Equation (5.84) is sometimes termed the minimax equation and defines the general minimax operating point. We can rewrite (5.84) in the following form:

PD =

( C01 −C00 C01 −C11

) − ( C10 −C00 C01 −C11

) PF (5.85)

124 CHAPTER 5. DETECTION THEORY

which is just a line in (PF ,PD) space. Thus the minimax choice of operating point can be found as the intersection of the straight line (5.85) with the ROC for the optimal LRT, as shown in Figure 5.16. For example, if we use the MPE cost assignment, C01 = C10 = 1, C00 = C11 = 0, then (5.85) reduces to PD = 1 −PF and the minimax line is just the −45 degree line.

1 0

P F

P D

P C C

C C

P D F =

- -

F H G

I K J F H G

I K J

0 1 0 0

0 1 1 1

1 0 0 0

0 1 1 1

R O C o f O p t i m a l L R T

h m i n i m a x

Figure 5.16: Finding the minimax operating point by intersecting (5.85) with the ROC for the optimal LRT.

5.3.2 Neyman-Pearson Hypothesis Testing

In the minimax case we assume that the costs Cij can be meaningfully assigned, but that we do not know the prior probabilities. In many cases, finding such meaningful costs assignments can be difficult. This raises the question of how to choose an operating point when neither the prior probabilities Pi or the costs Cij can be found. In general, we would like to make PF as small as possible and PD as large as possible. As the ROC shows, these two desires work in opposition to each other. What is often done is practice is to constrain PF and then to maximize PD subject to this constraint. Mathematically, one wants to solve:

max PD subject to PF ≤ α (5.86)

The solution of this problem is called a Neyman-Pearson detection rule or “NP rule”. Note that the optimal Bayes LRT has the highest PD for any PF , and thus the solution of the Neyman-

Pearson problem must be an optimal LRT for some choice of threshold η. So we are again in the position of needing to find an appropriate operating point on the optimal ROC. Since the ROC of the optimal LRT has PD as a monotonically non-decreasing function of PF , the solution of the NP problem must correspond to the point (α,PD(α)). In the continuous-observation case, the corresponding optimal threshold η is then the slope of the ROC at this point. When the observations are discrete, we can use randomized decision rules to obtain the best PD for any PF = α and the corresponding threshold η can be found from the thresholds of the endpoint. Indeed, the desire to perform NP decision rules is one motivation for randomized decision rules in the discrete case!

Example 5.13 (Neyman-Pearson) Suppose that the likelihoods under each hypothesis for a binary detection problem are as given in Figure 5.17. We want to find the decision rule that maximizes PD subject to PF ≤ 1/2.

This decision rule will be a Neyman-Pearson rule. The observation is continuous valued, so the ROC will be as well. Thus the optimal NP rule will be a LRT with threshold η chosen so that PF = 1/2. We can write this rule as follows:

pY |H(y | H1) H1 ≷ H0

η pY |H(y | H0) (5.87)

Figure 5.18 shows pY |H(y | H1) and η pY |H(y | H0) on the same axes when η < 1. The corresponding decision regions are also shown. On the right of Figure 5.18 the corresponding value of PF = (1 −η) is shown. Now we want PF = 1/2,

5.4. M-ARY HYPOTHESIS TESTING 125

y 0

p y H Y H|

( | ) 0

0 p y H

Y H| ( | )

1 1

y 0

0 1 0 1 2

Figure 5.17: Likelihoods for a Neyman-Pearson problem.

thus we have:

η = 1 − 1/2 = 1/2 (5.88)

The resulting decision rule is given by:

pY |H(y | H1) pY |H(y | H0)

H1 ≷ H0

2 (5.89)

y 0

h p y H Y H|

( | ) 0

p y H Y H|

( | ) 1

0 1 2

hH 0 H 1

p y H Y H|

( | ) 0

y 0

0 1

H 1

P F

Figure 5.18: Scaled densities, decision regions and PF for the problem of Example 5.13.

In practical problems, the bound α on PF is determined by engineering considerations, and includes such constraints as the amount of computing power or other resources available to process false alarms. For example, a common situation we have all experienced relating to false alarm rate is in connection with car alarms. If the threshold of the car alarm is set too high, it will not trigger when the car is assaulted by thieves. On the other hand, if the threshold is set too low, the alarm will often go off even when no thief is present – creating a false alarm. If too many false alarms are generated people become exhausted and cease to check them out.

5.4 M-ary Hypothesis Testing

The exposition so far has focused on binary hypothesis testing problems. When there are M possibilities or hypotheses, we term the problem an M-ary detection or hypothesis testing problem. We can again take a minimum Bayes risk approach, with the same 3 three problem elements we had in the binary case:

1. Model of Nature: In the M-ary case there are M possibilities, denoted as Hi, i = 0, · · · ,M − 1. Our knowledge of these possibilities is captured by the prior probabilities Pi = Pr(H = Hi), i = 0, · · · ,M − 1. Note that

∑ i Pi = 1.

2. Observation Model: This relationship is given in the M-ary case by the M conditional densities pY |H(y | hi).

126 CHAPTER 5. DETECTION THEORY

3. Decision Rule: Our decision rule D(y) will again be obtained by minimizing the average cost or Bayes risk. Again, Cij denotes the cost of deciding hypothesis D(y) = Hi when hypothesis Hj is true and the Bayes risk is given by E

[ CD(y),H

] .

Note that in the M-ary case, the decision rule D(y) is nothing more than a labeling of each point in the observation space with one of the corresponding possible decision outcomes Hi.

In an identical argument to the binary case, we have that the expected value of the cost is given by:

E [ CD(y)

] =

∫ E [ CD(y) | y

] pY (y) dy (5.90)

and as before the expression is minimized by minimizing E [ CD(y) | y

] . In particular, we should choose the

decision resulting in the smallest value of this quantity. Now the expected cost of deciding Hk given y is:

E [ CD(y)=Hk | y

] =

M−1∑ j=0

CkjpH|Y (Hj | y) (5.91)

Thus the optimal decision rule is to choose hypothesis Hk given the observation y if:

M−1∑ j=0

CkjpH|Y (Hj | y) ≤ M−1∑ j=0

CijpH|Y (Hj | y) ∀i (5.92)

The left hand side of (5.92) is the conditional cost of assigning y to the Hk decision region and the right hand side of (5.92) is the conditional cost of assigning y to the Hi decision region. Note that if the left hand side is the smallest, than assigning the given observation y to Hk is the best thing to do. Unlike the binary case, however, if the left hand side is not the smallest, we do not immediately know what the optimal hypothesis assignment is. All we know is that it is not Hk. Using this insight we can recast (5.92) in the following form, which is similar in spirit to (5.6):

M−1∑ j=0

CkjpH|Y (Hj | y) Not Hk ≷

Not Hi

M−1∑ j=0

CijpH|Y (Hj | y) ∀unique i,k pairs (5.93)

where Not Hk ≷

Not Hi

denotes eliminating hypothesis Hk if the inequality is > and eliminating hypothesis Hi if the

inequality is <. In the binary case there is only one comparison needed to define the optimal decision rule.

In contrast, in the M-ary case, we need M(M−1)

2 comparisons to completely define the optimal decision rule

– that is, to unambiguously label all points in the observation space. Each such comparison eliminates one of the hypotheses. Note, however, that given a particular observation, its label can be determined using a series of only (M − 1) sequential tests. In particular, once a hypothesis has been ruled out, further tests involving that hypothesis can be ignored.

We can make (5.93) more similar to the binary case through some manipulations. In analogy with (5.11), let us define the following set of likelihood ratios:

Lj(y) = pY |H(y | Hj) pY |H(y | H0)

j = 0, · · · ,M − 1 (5.94)

where we take L0(y) = 1. Then, combining these likelihood ratios with Bayes rule (5.7) we have the following form for the optimal Bayes M-ary decision rule:

M−1∑ j=0

CkjPjLj(y) Not Hk ≷

Not Hi

M−1∑ j=0

CijPjLj(y) ∀unique i,k pairs (5.95)

Note that quantities Lj(y) form a set of sufficient statistics for the M-ary detection problem. Further, this set of inequalities defines M(M−1)/2 linear decision boundaries in the space of the sufficient statistics Li(y).

5.4. M-ARY HYPOTHESIS TESTING 127

For example, consider the three-hypothesis case where M = 3. In this case, there are three comparisons that need to be performed:

k = 0, i = 1 : P1 (C01 −C11)L1(y) Not H0 ≷

Not H1

P0 (C10 −C00) + P2 (C12 −C02)L2(y) (5.96)

k = 1, i = 2 : P1 (C11 −C21)L1(y) Not H1 ≷

Not H2

P0 (C20 −C10) + P2 (C22 −C12)L2(y) (5.97)

k = 2, i = 0 : P1 (C21 −C01)L1(y) Not H2 ≷

Not H0

P0 (C00 −C20) + P2 (C02 −C22)L2(y) (5.98)

These comparisons are shown for a generic case in Figure 5.19. Each comparison eliminates one hypothesis. Taken together the set of comparisons labels the space of the test statistics. Note that in the space of the likelihood ratio test statistics the decision regions are always linear. In the space of the observations y this will not be true, in general. Further, the dimension of the “likelihood space” is dependent on number of hypotheses, not the dimension of the observation, which may be greater than or less than the likelihood dimension.

L 1 ( )y

L 2 ( )y

N o t H 0N o t H 1

N o t H 2

H 0 H 1

H 2

N o t H 1N o t H 0

N o t H 2

Figure 5.19: Decision boundaries in the space of the likelihoods for an M-ary problem.

5.4.1 Special Cases

Let us now consider some common special cases of the Bayes risk and the associated decision rules corre- sponding to them for the M-ary case.

MPE cost assignment and the MAP rule

Suppose we use the following “zero-one” cost assignment for an M-ary problem:

Cij = 1 −δij (5.99)

where δij = 1 if i = j and δij = 0 if i 6= j. Then the cost of all errors (C10 = C01 = 1) are the same and there is no cost for correct decisions (C00 = C11 = 0). As in the binary case, this cost assignment results in the Bayes risk also equaling the probability or error:

E [ CD(y)

] =

M−1∑ j=0

M−1∑ i = 0 i 6= j

Pr [Decide Hi, Hj true] = Pr [Error] (5.100)

128 CHAPTER 5. DETECTION THEORY

Thus the optimal decision rule for this cost assignment in the M-ary case also minimizes the probability of error. The corresponding decision rule (again termed the minimum probability of error (MPE) decision rule) is to choose hypothesis Hk given the observation y if:

pH|Y (Hk | y) ≥ pH|Y (Hi | y) ∀i (5.101)

This decision rule says that for minimum probability of error choose the hypothesis with the highest posterior probability. As in the binary case, this is termed the Maximum aposteriori probability or MAP rule. So again, the MPE cost assignment results in the MAP rule (independent of prior probabilities).

The MAP decision rule can also be expressed in terms of a series of comparisons of likelihood ratios, as in (5.95). By substituting the MPE cost structure into (5.95) and simplifying we obtain the following equivalent expression of the Bayes optimal M-ary MAP rule:

PiLi(y) Not Hk ≷

Not Hi

PkLk(y) ∀unique i,k pairs (5.102)

Note that the details of the densities are hidden in the expressions for the likelihood ratios Li(y).

The ML rule

Now suppose we again use the MPE cost criterion with Cij = 1−δij, but also have both hypotheses equally likely apriori so that Pi = 1/M. In this case we essentially have no prior preference for one hypothesis over the other. Applying these conditions together with Bayes rule to (5.101), this decision rule is to choose hypothesis Hk given the observation y if:

pY |H (y | Hk) ≥ pY |H (y | Hi) ∀i (5.103)

In this case the decision rule is to choose the hypothesis that gives the highest likelihood of the observation, which is again the maximum likelihood or ML rule.

As for the MAP rule, the ML decision rule can also be expressed in terms of a series of comparisons of likelihood ratios, as in (5.102). Note that the expression (5.102) already reflects the impact of the MPE cost structure. If we further incorporate the fact that Pi = Pj into (5.102), we obtain the following equivalent expression of the Bayes optimal M-ary ML rule:

Li(y) Not Hk ≷

Not Hi

Lk(y) ∀unique i,k pairs (5.104)

5.4.2 Examples

Let us now consider some examples.

Example 5.14 (Known means in White Gaussian Noise) Suppose we want to detect which of three possible N-dimensional signals is being received in the presence of noise. In particular, suppose that under hypothesis Hk the observation is given by:

Under Hk: y = mk + w k = 0, 1, 2 (5.105)

where w ∼ N(0,I). Note that this implies that the observation densities under the different hypotheses are Gaussian, given by:

pY |H ( y | Hk

) = N

( y; mk,I

) k = 0, 1, 2 (5.106)

Assume that we want a minimum probability of error decision rule, which means we want the cost assignment Cij = 1 − δij and results in the MAP rule (5.101). We can also express this rule in the form (5.95). Substituting the densities given in (5.106) and simplifying, we obtain for the optimal decision rule for this example:

`ik(y) = y T

( mk −mi ‖mk −mi‖

) Not Hi ≷

Not Hk

‖mk −mi‖

[ mTk mk −m

T i mi

2 + ln

( Pi Pk

)] = Γik (5.107)

where we perform the comparisons over all unique i,k pairs.

5.4. M-ARY HYPOTHESIS TESTING 129

Note a number of things. First, the set of `ik(y) are a set of sufficient statistics for the problem (as are the set of likelihood ratios Li(y)). In addition, the computation of these sufficient statistics (i.e. the processing of the data) consists of projecting the data vector onto the line between the different means and then comparing the result to a threshold. These ideas are illustrated in Figure 5.20 for a two-dimensional case. Note that the dimension of the space of the observation is independent of the number of hypotheses. Further, in this example of Gaussian densities with identical covariance matrices but different means, the decision boundaries of each comparison in (5.107) are lines (or hyperplanes, when the observations are higher dimensional). In general (i.e. when the likelihood densities are not Gaussian), these decision boundaries will not be simple linear/planar shapes.

N o t H i

y 1

y 2

N o t H k

m k

m i G i k / | | m k - m i | |

y T ( m k - m i ) / | | m k - m i | |

Figure 5.20: Illustration of the decision rule in the original data space.

Of course, we can also depict the decision rule for the MAP decision problem in the space of the likelihood ratios Li, as was done in Figure 5.19. In particular, if we express the MAP rule in the space of the original likelihood ratios for this 3 hypothesis case (i.e. by specializing (5.102) to the three hypotheses) we can express this rule as:

k = 0, i = 1 : L1(y) Not H0 ≷

Not H1

P0 P1

(5.108)

k = 1, i = 2 : L2(y) Not H1 ≷

Not H2

( P1 P2

) L1(y) (5.109)

k = 2, i = 0 : L2(y) Not H0 ≷

Not H2

P0 P2

(5.110)

In Figure 5.21 we show this decision rule in the likelihood space. When expressed in this way, the decision boundaries are independent of the specific likelihoods of the problem! That is, the decision regions for MAP rule for any 3-ary decision problems is as given Figure 5.21. What has happened is that these likelihood details have been hidden in the likelihood ratios L(y)i.

Continuing with this example, suppose we additionally believe that each hypothesis is equally likely, so that Pi = Pj = 1/3. In this case, the decision rule will be the ML rule (5.103). Examining (5.107) and Figure 5.20, we can see that for our Gaussian example the ML rule but decision boundaries in the observation space halfway between each pair of means. Overall, the ML decision rule for this example becomes: Choose Hk if, for all i:∥∥y −mk∥∥ ≤ ∥∥y −mi∥∥ (5.111) In particular, the decision rule chooses the hypothesis whose mean is closest to the given observation, resulting in the decision regions in the observation space shown in Figure 5.22 for a two-dimensional case. The decision boundaries are the bisectors of the lines connecting the means under the different hypotheses. In general, this type of decision strategy is called a nearest neighbor classifier or a minimum distance receiver in the literature. It is a strategy that is used rather widely in practice, even when it is not the optimum detector, due to its ease of implementation and understanding.

130 CHAPTER 5. DETECTION THEORY

L 1 ( )y

L 2 ( )y N o t H 0N o t H 1

N o t H 2

H 0 H 1

H 2

N o t H 1

N o t H 0

N o t H 2

P 0 / P 1

P 0 / P 2

Figure 5.21: Illustration of the decision rule in the likelihood space.

y 1

y 2

H 0

m 0

m 1

m 2

H 1

H 2

Figure 5.22: Illustration of the ML decision rule in the observation space.

Example 5.15 (Gaussians with different variances) In this example, suppose we observe a one-dimensional random variable y and wish to determine which one of three-possible densities it could have come from. Under each of the three hypotheses the likelihoods are given by:

pY |H(y | Hi) = N(y; 0,σ 2 i ) i = 0, 1, 2 (5.112)

where σ0 < σ1 < σ2. Further, suppose the hypotheses are equally likely and we wish to minimize the probability of error. In this case the decision rule will be the ML rule. Applying (5.104) and simplifying we obtain the following decision rule for this case:

y 2

Not Hk ≷

Not Hi

( σ2i σ

2 k

σ2i −σ2k

) ln

( σi σk

) = Γik ∀unique i,k pairs (5.113)

This decision rule is shown in Figure 5.23. The decision rule in the space of the likelihoods is essentially the same as that in Figure 5.21 with Pi/Pj = 1.

5.4.3 M-Ary Performance Calculations

The two performance metrics of the binary hypothesis testing problem were the expected value of the cost E(CD(y) and the probability of error Pr(Error). Both these criteria still make sense in the M-ary case, though the expressions are a bit different. In particular, whereas in the binary case we could express both the metrics in terms of only two conditional densities (PD and PF ), in the M-ary case we need M(M − 1) conditional densities to express them.

5.4. M-ARY HYPOTHESIS TESTING 131

N o t H 1

p Y | H ( y | H 0 )

p Y | H ( y | H 1 ) p Y | H ( y | H 2 )

N o t H 2 N o t H 0 N o t H 1 N o t H 1 N o t H 0 N o t H 2 N o t H 1

H 1 H 1H 0 H 2H 2

Figure 5.23: Illustration of decision rule in the observation space.

First let us consider the expected value of the cost:

E [ CD(y)

] =

M−1∑ i=1

M−1∑ j=1

CijPr (Decide Hi | Hj) Pj (5.114)

Thus, we now need M(M − 1) conditional densities to express the expected cost or Bayes risk versus the two needed in the binary case (i.e. PD and PF ). So the situation is more complicated, but the idea is the same. To find the expected value of the cost (that is, the Bayes risk), we have to find a set of conditional probabilities, as before.

Consider the problem of Example 5.14 with the ML decision rule, shown in Figure 5.22. To find Pr (Decide H0 | H1) in the observation space we need to integrate the conditional density pY |H(y | H1) over the region of the space where we would choose hypothesis H0. The density pY |H(y | H1) is a circularly symmetric Gaussian centered at the mean m1. Referring to Figure 5.22, the H0 region of the space is the shaded region on the left. The term Pr (Decide H0 | H1) is thus the area of the Gaussian in the H0 part of the space, as shown in Figure 5.24. The calculation of the other conditional densities is similar, where, in general, both the region of integration changes and the density being integrated changes.

y 1

y 2

H 0

m 0

m 1

m 2

H 1

H 2

Figure 5.24: Illustration of the calculation of Pr (Decide H0 | H1) in the observation space.

Of course, if it is more convenient, we can also find these conditional densities in the space of a sufficient statistic. The basic idea is the same. Consider again the Example 5.14 with the ML decision rule, shown

132 CHAPTER 5. DETECTION THEORY

this time in the space of the sufficient statistic provided by the likelihood ratios Li(y) in Figure 5.21. To find Pr (Decide H0 | H1) we need to integrate the joint conditional density for the likelihood ratio sufficient statistics pL1(y),L2(y)|H(L1(y),L1(y) | H0) over that part of the space of the likelihood ratios where we decide H1. While the region of the likelihood space space is simply determined in this case, the required density may not be. In Example 5.14, even though the observations are Gaussian under any hypothesis, the likelihood

ratios, being of the form ey T Σy, will not be Gaussian random variables! All sufficient statistics are not equal,

however, and a different choice of sufficient statistic may make the problem easier. Note for this example that the sufficient statistics `ik(y) defined in (5.107) are simply linear functions of the observations, and thus are themselves Gaussian random variables under any hypothesis. The decision regions are also relatively simple for these particular sufficient statistics. This discussion illustrates the issues we face in general when performing such calculations. The challenge is to find a sufficient statistic whose combination of decision regions and densities lead to a tractable set of calculations.

Our other performance metric was the probability of error Pr [Error]. In the M-ary case this is given as:

Pr [Error] =

M−1∑ j=0

M−1∑ i = 0 i 6= j

Pr [Decide Hi | Hj true] Pj (5.115)

As in the calculation of the expected cost, the key is again the calculation of the conditional densities Pr [Decide Hi | Hj truePj]. These probabilities can be calculated as illustrated in Figure 5.24 for Exam- ple 5.14. In the case of the Pr [Error] calculation there is an alternative form to the expression that is sometimes useful. It is based on the fact that the sum in (5.115) includes all the conditional densities except the “self term” Pr [Decide Hi | Hi truePj]. As a result we may rewrite (5.115) as follows:

Pr [Error] =

M−1∑ j=0

(1 − Pr [Decide Hj | Hj true] Pj) (5.116)

Consider again the problem of Example 5.14 with the ML decision rule, shown in Figure 5.22. To find a self term, for example Pr (Decide H1 | H1), in the observation space we need to integrate the condi- tional density pY |H(y | H1) over the region of the space where we would choose hypothesis H1. The term Pr (Decide H1 | H1) is thus the area of the Gaussian in the H1 part of the space, as shown in Figure 5.25. The calculation of the other terms is similar. As in the case of the expected cost calculation, we may also perform such calculations in the space of a sufficient statistic if that is more convenient.

y 1

y 2

H 0

m 0

m 1

m 2

H 1

H 2

Figure 5.25: Illustration of the calculation of Pr (Decide H1 | H1) in the observation space.

5.5. GAUSSIAN EXAMPLES 133

5.5 Gaussian Examples

Gaussian detection problems are of general interest in many applications. In this section, several additional examples are discussed.

The general Gaussian likelihood ratio test is straightforward to compute. Let y be the n-dimensional observation vector, with hypothesized density pY |H(y | H0) ∼ N(m0, Σ0) under H0 and density pY |H(y | H1) ∼ N(m1, Σ1) under H1. Then the likelihood ratio test for the general Gaussian case is given by:

L(y) = pY |H(y | H1) pY |H(y | H0)

1√ (2π)N|Σ1|

e− 1 2

(y−m1) T Σ −1 1 (y−m1)

1√ (2π)N|Σ0|

e− 1 2

(y−m0)T Σ −1 0 (y−m0)

H1 ≷ H0

η (5.117)

where |Σi| is the determinant of Σi. Taking logarithms of both sides and clearing out factors of 1/2, one obtains the following form of the LRT:

`(y) = −(y −m1) T Σ−11 (y −m1) + (y −m0)

T Σ−10 (y −m0) H1 ≷ H0

2 ln(η) + ln(|Σ1|) − ln(|Σ0|) (5.118)

The above expression indicates that a sufficient statistic is `(y) = (y−m0)T Σ −1 0 (y−m0)−(y−m1)

T Σ−11 (y− m1).

Example 5.16 Consider now the detection of known signals in additive Gaussian noise, where the elements yj of the observation vector y are given by the following expression under each hypothesis:

Hi : yj = mij + wj (5.119)

where the values of mij are known, and wj is an independent, identically distributed sequence of Gaussian random variables with distribution N(0,σ2). Note that, in this case, Σ1 = Σ0 = σ

2I, so that the sufficient statistic becomes:

`(y) = 2

σ2 (m1 −m0)

T y +

mT0 m0 −m T 1 m1

σ2 (5.120)

The optimal detector can be written as

y T

(m1 −m0) H1 ≷ H0

mT1 m1 −m T 0 m0

2 + σ

2 ln(η) (5.121)

Example 5.17 (Uniformly Most Powerful Test) One can also detect unknown signals in Gaussian noise, as follows: Assume that the observations are distributed as

yj =

{ wj if H0 is true xj + wj otherwise

(5.122)

where xj is the j-th coefficient of a Gaussian vector x which is independent of w, with distribution N(mx, Σx). Again, this is a Gaussian detection problem, with m1 = mx, Σ1 = Σx + σ

2I, Σ0 = σ 2I, m0 = 0. In this case, the sufficient

statistic becomes

`(y) = σ −2

(y T y) − (y −mx)

T Σ −1 1 (y −mx) = y

T [σ −2 y − Σ−11 (y −mx) + Σ

−1 1 mx] −m

T x Σ −1 1 mx (5.123)

Thus, the optimal detector is to declare H1 whenever

y T [ σ −2 y − Σ−11

( y −mx

) + Σ

−1 1 mx

] H1 ≷ H0

2 ln(η) + m T x Σ −1 1 mx + ln (|det(Σ1)|) − ln (|det(Σ0)|) (5.124)

It is interesting to examine the term on the right-hand side. In particular, note the following relationships which hold true under H1:

Σyy = E [( y −mx

)( y −mx

)T∣∣∣H1] = Σ1 = Σx + σ2I (5.125) Σxy = E[(x−mx)(y −mx)

T | H1] = Σx (5.126) Thus,

σ −2 y − Σ−11 y = Σ

−1 1 (σ

−2 (Σx + σ

2 I) − I) = σ−2Σ−11 Σx

134 CHAPTER 5. DETECTION THEORY

The above expression can be given an interesting interpretation. Consider the case where mx = 0. Then, using the expression for Gaussian estimation,

E[x | y,H1] = Σ−11 Σxy (5.127) and the optimal detection rule selects H1 whenever

In particular, this decision rule is similar to the known signal case, except that the known difference in the means is replaced by E[x | y,H1].

Chapter 6

Stochastic Processes and their Characterization

6.1 Introduction

Consider a random experiment in a probability space (Ω,F,P) where, for every outcome ω ∈ Ω, we assign a real-valued function of time X(t,ω), t ∈ I according to some rule, for t in some totally ordered index set I. For most of our applications, the set I will either be the set of integers, or the set of real numbers. This collection of functions, indexed by outcomes of a probability space, is called a stochastic process or a random process. The index set can be continuous (e.g. the real numbers), in which case we say it is a continuous-time process, or discrete (e.g. the integers), in which case it is a discrete-time process. For a particular outcome ω, the function X(t,ω), t ∈ I can be thought of as a deterministic signal, and it is called a realization of the process.

We get a different view of stochastic processes if we fix a particular time index t, and look at the collection X(t,ω),ω ∈ Ω. For each t ∈ I,X(t,ω) is a random variable. Thus, a stochastic process can also be viewed as an indexed collection of random variables, where the index corresponding to time is either discrete or continuous.

Example 6.1 Let (Ω,F,P) be a probability space where Ω = [0, 1), F is the Borel Field over [0, 1), and P is given by a uniform density function. For each ω ∈ Ω, let X(n,ω) be the n-th bit in the binary expansion of ω. For example, X(1) is a random variable that takes on value 0 when ω ∈ [0, 0.5) and value 1 otherwise; X(2) is a random variable that takes on value 0 when ω ∈ [0, 0.25) ∪ [0.5, 0.75) and value 1 otherwise. X(·) is a discrete-time random process.

The random process can also be defined in terms of another random variable, as in the example below. Note that we will frequently drop the ω (or other variable) dependence in describing the random process, just as we did for random variables.

Example 6.2 Let Y be selected at random according to an exponential distribution with parameter α. Define the continuous-time random process

X(t) = Y cos(t), −∞ < t < ∞. Note that X(0) is a random variable described by an exponential distribution with parameter α, and X(π/3) is a random variable described by an exponential distribution with parameter 2α.

Although the majority of this course is concerned with the above (scalar) definition of a stochastic process, it is easy to generalize the definition to the vector case: a vector stochastic process is a collection of random vectors, with values in Rn, indexed by t ∈ I.

6.2 Complete Characterization of Stochastic Processes

We have already considered the characterization and properties of a finite collection of random variables, which are essentially random vectors. In particular, we characterized random vectors in terms of their

136 CHAPTER 6. STOCHASTIC PROCESSES AND THEIR CHARACTERIZATION

joint probability distribution function. Our approach to obtaining a complete characterization of stochastic processes will build on this approach.

Consider a set of sampling times t1, t2, . . . , tk ∈ I, and let Xi = X(ti) denote the random variable obtained by fixing the value of the process at each time ti, i = 1, . . . ,k. For any finite value k, we have a vector of random variables X = [X1 · · ·Xk]T , and we can completely specify this vector through specification of its joint probability distribution function:

PX(x1, . . . ,xk) = P({ω : X1(ω) ≤ x1, . . . ,Xk(ω) ≤ xk}) = P({ω : X(t1,ω) ≤ x1, . . . ,X(tk,ω) ≤ xk} = PX(tk)(x1,x2, . . . ,xk), (6.1)

where we use the notational abbreviation X(tk) = [X(t1) · · ·X(tk)]T . Note that this joint probability distribution function is defined for any valid finite collection of time indices t1, t2, . . . , tk ∈ I. If the stochastic process is continuous-valued, it is often easier to speak of the joint probability density func- tion pX(tk)(x1,x2, . . . ,xk). We also use the notation pX(x1,x2, . . . ,xn; t1, . . . , tn) to denote a joint density of random variables sampled at specific times. Similarly, for discrete-valued stochastic processes, it may be more convenient to work with the joint probability mass function. Now a complete characterization of a random process can be obtained through the specification of the complete set of k-th order finite-dimensional densities in (6.1). That is, specification of (6.1) for all orders k and all possible sets of sampling points tj is a complete characterization of a stochastic process.

At first glance, the specification of the complete set of joint probability distribution functions seems like an enormous task, as we must specify the properties of all possible subsets of time indices. However, the mechanism for generating different random processes often makes it simple to specify the set of probability distribution functions of interest. Furthermore, we often care only about first- and second-order moments; we define these in the next subsection.

6.3 First and Second-Order Moments of Stochastic Processes

The moments of time samples of a random process can be used to summarize the information in the joint probability distribution function. We define these moments as follows.

First, a definition. A random process is called a second order random process if E[X(t)2] < ∞ for all tinI. For such processes, we define the following statistics:

The mean of a random process X(·), denoted by mX(t) is the time function defined by

mX(t) = E[X(t)] =

∫ ∞ −∞

xpX(x; t) dx

for a continuous-valued process where pX(x; t) is the density of the random variable X(t), or

mX(t) = E[X(t)] =

∞∑ x=−∞

xpX(x; t)

for a discrete-valued process. (Note that the use of summation vs. integral in computing an expectation for a fixed time (or set of times) in a random process depends on the values that the associated variables take on, not whether the time index is discrete vs. continuous.)

The autocorrelation function of a random process X(·), denoted by RX(s,t) is defined as the joint moment of X(s) and X(t):

RX(s,t) = E[X(s)X(t)] =

∫ ∞ −∞

x1x2 pX(x1,x2; s,t) dx1 dx2

For the discrete-valued random process, the double integral would be replaced by a double sum. The autocovariance function of a random process X(·), denoted by KX(s,t), is defined as the covariance

of X(s) and X(t) as

KX(s,t) = E[(X(s) −mX(s))(X(t) −mX(t))] = RX(s,t) −mX(s)mX(t) (6.2)

6.4. SPECIAL CLASSES OF STOCHASTIC PROCESSES 137

Note that the variance of X(t) is given by KX(t,t). The mean, autocovariance and autocorrelation functions of a random process are only partial descriptions

of the process. There can be many different random processes with the same mean and autocorrelation functions.

It is often useful to characterize the relationship between two random processes. The cross-correlation function of random processes X(·) and Y (·), denoted by RXY (s,t) is defined as the joint moment of X(s) and Y (t):

RXY (s,t) = E[X(s)Y (t)] =

∫ ∞ −∞

xypXY (x,y; s,t)dxdy

Again, for the discrete-valued random process, the double integral would be replaced by a double sum. Similarly, the cross-covariance function of random processes X(·) and Y (·), denoted by KXY (s,t), is defined as the covariance of X(s) and Y (t) as

KXY (s,t) = E[(X(s) −mX(s))(Y (t) −mY (t))] = RXY (s,t) −mX(s)mY (t) (6.3)

6.4 Special Classes of Stochastic Processes

In this subsection, we discuss special classes of random processes for which it is relatively simple to specify the joint probability distribution function for any set of times.

Definition 6.1 (Independent and Identically Distributed Process) A discrete-time stochastic process is said to be independent and identically distributed (i.i.d.) if the joint distribution for any sampling times n1, . . . ,nk can be expressed as the product of the first order marginal distribution:

pX(x1, . . . ,xk; n1, . . . ,nk) =

nk∏ i=n1

pX(xi),

where the first order marginal pX(x; n) = pX(x) is independent of time.

The i.i.d. process is perhaps the simplest possible class, since it can be specified completely in terms of a scalar density or mass function.

Definition 6.2 (Gaussian Stochastic Process) A stochastic process is said to be Gaussian if the samples X(t1), . . . ,X(tk) are jointly Gaussian random vectors for any sampling times t1, . . . , tk.

Recall that the probability density function of jointly Gaussian random variables is determined by the vector of means and by the covariance matrix, as

pX(x1, . . . ,xk; t1, . . . , tk) = e−1/2(x−mX)

T Σ −1 X

(x−mX)

(2π)k/2(|det ΣX|)1/2 ,

where

x =

  x1... xk

  mX =

  mX(t1)... mX(tk)

  ΣX =

  KX(t1, t1) KX(t1, t2) · · · KX(t1, tk)... ... ... KX(tk, t1) KX(tk, t2) · · · KX(tk, tk)

  .

Thus, Gaussian random processes are completely specified by the process mean mX(t) and autocovariance function KX(t1, t2). Furthermore, Gaussian processes have additional properties which make them particu- larly useful in the analysis of linear systems driven by stochastic processes, as we will see in later sections.

Definition 6.3 (Independent Increments) A stochastic process X(t) is an independent increments process if for all s < t the random variables X(t) − X(s) and X(τ) are independent for any τ ≤ s.

138 CHAPTER 6. STOCHASTIC PROCESSES AND THEIR CHARACTERIZATION

Note, in particular, that this implies that if t1 ≤ t2 ≤ ··· then the increments of the process X(t2) −X(t1), X(t3) − X(t2), · · · are independent. This property makes it easier to compute the joint probability mass function, as follows:

pX(x1, . . . ,xk; t1, . . . , tk) = Pr[X(t1) = x1, . . . ,X(tk) = xk]

= Pr [X(t1) = x1,X(t2) −X(t1) = x2 −x1, . . . ,X(tk) −X(tk−1) = xk −xk−1] = Pr[X(t1) = x1]Pr[X(t2) −X(t1) = x2 −x1] · · ·Pr[X(tk) −X(tk−1) = xk −xk−1]

due to the independent increments property. Similarly, joint density functions or distribution functions can be computed in terms of a first-order marginal density or distribution function and the distributions of the independent increments.

Definition 6.4 (Markov Process) A stochastic process X(·) is said to be Markov if the future of the process is independent of its past, conditioned on the present value of the process. That is, for any choice of sampling instances t1 < t2 < ... < tk,

Pr(X(tk) = xk|X(t1) = x1, . . . ,X(tk−1) = xk−1) = Pr(X(tk) = xk | X(tk−1) = xk−1)

The above equation states that X(tk) is independent of X(t1), . . . ,X(tk−2), conditioned on knowing the value of X(tk−1). Note that an independent increments process is necessarily Markov; however, Markov processes are not necessarily independent increments processes.

The Markov property makes it simpler to compute joint probability density (or mass) functions, as follows:

pX(x1, . . . ,xk; t1, . . . , tk) = pX(x1; t1)pX(x2; t2|x1; t1) · · ·pX(xk; tk|x1, . . . ,xk−1; t1, . . . , tk−1) = pX(x1; t1)pX(x2; t2|x1; t1) · · ·pX(xk; tk|xk−1; tk−1)

= pX(x1; t1)

k∏ i=2

pX(xi; ti|xi−1; ti−1)

Thus, Markov processes can be characterized by a marginal probability density pX(x1; t1) and transition probability densities pX(xk; tk|xk−1, tk−1) (or the equivalent pmfs for discrete-valued processes). We call pX(xk; tk|xk−1, tk−1) the transition probability density function (or probability mass function) of the Markov process. It is easy to establish that transition probability densities satisfy the Chapman-Kolmogorov equa- tion. Let t1 < t2 < t3; then we have:

pX(x3; t3|x1; t1) = ∫ ∞ −∞

pX(x3,x2; t3, t2|x1; t1) dx2

∫ ∞ −∞

pX(x3; t3|x2,x1; t2, t1)pX(x2; t2|x1; t1) dx2

∫ ∞ −∞

pX(x3; t3|x2; t2)pX(x2; t2|x1; t1) dx2, (6.4)

where the second equality follows from the definition of conditional distributions, and the final equality follows from the Markov property.

In order to exploit the properties of Markov processes, it is useful to generalize the definition to vector- valued processes. A vector-valued stochastic process X(·) is said to be Markov if, for any choice of sampling instances t1 < t2 < ... < tk,

pX(xk; tk|x1, . . . ,xk−1; t1, . . . , tk−1) = pX(xk; tk|xk−1; tk−1).

In many cases, one can transform a non-Markov scalar process into a vector Markov process. We illustrate this below for both discrete and continuous time.

Example 6.3 Consider a discrete-time scalar-valued random process X(n), where

pX(xk; tk|x1, . . . ,xk−1; t1, . . . , tk−1) = pX(xk; tk|xk−1,xk−2; tk−1, tk−2).

6.4. SPECIAL CLASSES OF STOCHASTIC PROCESSES 139

Define a new process

Z(n) =

[ X(n)

X(n− 1)

] .

It is easy to see that Z(n) is Markov, as follows:

pZ(zk; tk|z1, . . . ,zk−1; t1, . . . , tk−1) = pX(xk,xk−1; tk, tk−1|x0, . . . ,xk−1; t0, . . . , tk−1) = pX(xk,xk−1; tk, tk−1|xk−1,xk−2; tk−1, tk−2) = pZ(zk; tk|zk−1; tk−1). (6.5)

Example 6.4 Consider an independent increments process U(·), with enough assumptions to be integrable (a topic which we will discuss later in the notes). Define a new process

Y (t) =

∫ t 0

∫ s 0

U(τ)dτds

The process is not Markov, as we can see by considering P(y(3)|y(2),y(1)). Note that Y satisfies the following differential equation:

dt2 Y = U

Thus, we can write

Y (3) = Y (2) + d

dt Y (t)

∣∣∣∣ t=2

∫ 3 2

∫ s 2

U(τ) dτ ds

= Y (2) +

∫ 2 0

U(s) ds +

∫ 3 2

∫ s 2

U(τ) dτ ds (6.6)

It is clear from the above equation that the conditional density of Y (3) depends on Y (2) and Y (1), since the value of Y (1) will be highly correlated with the term

∫ 2 0 U(s)ds. Now, define the augmented state

Z =

[ Y d dt Y

] .

The vector process Z is now Markov. To illustrate this, consider how Z(t) depends on previous values of Z(τ),τ < t:

Z(t) = Z(τ) +

∫ t τ

[ (t−s)U(s)

U(s)

] ds.

Due to the independent increments property of U(·), it is clear that the integral on the right hand side is independent of Z(s) for any value s < τ, which establishes the vector Markov property.

We define before an independent increments process as one where, given any ordered set of times t0 < t1 < ... < tn, the increments X(t1) −X(t0),X(t2) −X(t1), . . . ,X(tn) −X(tn−1) are mutually independent random variables. When the increments have zero mean, this results in a process with special properties, known as martingales. The formal definition of martingales is given below:

Definition 6.5 (Martingale) A stochastic process X(·) is a martingale if

1. E[X(t)] < ∞ for all t. 2. Given two times s < t, then E[X(t)|{X(s1),s1 ≤ s}] = X(s).

Thus, martingale increments have zero-mean, because

E[X(t) −X(s)] = E[E[X(t) −X(s)|X(s)]] = E[X(s) −X(s)] = 0.

An example of a martingale is a random walk with equal probability of increasing or decreasing (p = 1/2). For such a process, we see that E[X(t)] = E[X(0)], and E[X(t)|X(s)] = X(s), establishing the martingale property. Similarly, an independent increments process with X(0) equal to a constant and with zero-mean increments is a martingale, as, for any t1 > t0,E[X(t1) −X(t0)|X(t0)] = E[X(t1) −X(t0)] = 0.

When the increments are not zero-mean but have a negative mean, we get processes that are called supermartingales.

140 CHAPTER 6. STOCHASTIC PROCESSES AND THEIR CHARACTERIZATION

Definition 6.6 (Supermartingale) A stochastic process X(·) is a supermartingale if

1. E[X(t)] < ∞ for all t. 2. Given two times s < t, then E[X(t)|{X(s1),s1 ≤ s}] ≤ X(s).

The martingale property leads to many strong results that allow us to make statements concerning the probabilty of maximum values. The following result is due to Doob:

Theorem 6.1 • Let X(0),X(1), . . . be non-negative random variables such that E[X(k + 1)|X(0), . . . ,X(k)] ≤ X(k) (a non-

negative supermartingale). Then,

P [ max 0≤k≤n

X(k) ≥ γ] ≤ E[X(0)]

• X(0),X(1), . . . be a martingale sequence with E[X(n)2] < ∞ for some n. Then,

E[ (

max 0≤k≤n

X(k) )2

] ≤ 4E[X(n)2]

6.5 Examples of Stochastic Processes

6.5.1 The Random Walk

Many of the special classes of processes described above arise from considering processes that are built from collections of independent random variables. That is, we start first with an independent process W(t), t ∈ I, and then we define a new process that is formed from elements of X(t). In this consider a process called the random walk. Let W(1),W(2), . . . , be independent, identicaly distributed random variables that take values in {−1, 1}, with P[W(t) = 1] = p, and P[W(t) = −1] = 1 − p for all t ≥ 1. Let X(0) denote a random integer-valued variable, independent of all the W(t) variables. The variable X(0) will be the starting position of the random walk, and the variable W(t) will denote the random step between X(t− 1) and X(t), as follows: We define

X(t) = X(0) + W(1) + . . . + W(t)

Note that this can also be expressed recursively as

X(t) = X(t− 1) + W(t)

This model is similar in structure to the Gaussian models used in the Kalman filter for representing the state. However, the random variables X(t) are discrete-valued, not Gaussian. We call the random process X(t), t ∈{0, 1, . . . ,} as a random walk.

We summarize some of the properties of random walks below: Let X(t), t ∈ {0, 1, . . . ,} be a random walk as described above. Then

• E[X(t)] = E[X(0)] + (2p− 1)t.

• V ar[X(t)] = V ar[X(0)] + 4tp(1 −p)

• limn→∞ X(n) n

a.e. = 2p− 1 (Strong law of large numbers )

• limn→∞ X(n) n

mss = 2p− 1 (Weak law of large numbers and finite second moments)

• Assume X[0] is known. Then, limn→∞P( X(t)−t(2p−1)√

4tp(1−p) ≤ a) = Φ(a), where Φ(a) is the standard normal

probability distribution function (per the Central Limit Theorem.)

• Assume X[0] = 0. Then, P(X(t) = 2j − t) = ( t j

) pj(1 −p)t−j, for j ∈{0, 1, . . . , t}

The last property follows because, when X[0] = 0, then X[t] is the sum of Bernoulli random variables (scaled and shifted) and this sum has a binomial distribution.

From its construction, the random walk is seen to be an independent increments process (the increments are W(t), which are independent by assumption), and a Markov process. It will also be seen to be a martingale, a class of processes that we will discuss later.

6.5. EXAMPLES OF STOCHASTIC PROCESSES 141

The independent increments property of random walks makes it easy to compute the second order prop- erties of random walks. We know the mean of the process already, as

mX(t) = E[X(0)] + (2p− 1)t

The autocorrelation function, assuming n > m, is given by

RX(m,n) = E[X(n)X(m)] = E[(X(n) −X(m) + X(m))X(m)] = E[(X(n) −X(m))X(m)] + E[X(m)2] = E[X(n) −X(m)]E[X(m)] + E[X(m)2] by the independent increments property = (2p− 1)(n−m)(E[X(0) + (2p− 1)m) + V ar[X(0)] + 4mp(1 −p) + (E[X(0) + (2p− 1)m)2 substituting the mean and variance expressions above

When m > n, the roles of m,n need to reverse in the expansion. We can write a combined expression by identifying the maximum and minimum of m,n, as follows:

RX(m,n) = (2p−1)(max m,n−min m,n)(E[X(0)+(2p−1) min m,n)+V ar[X(0)]+4 min m,np(1−p)+(E[X(0)+(2p−1) min m,n)2

Similarly, the autocovariance is given by

KX(m,n) = V ar[X(0)] + 4(p)(1 −p) min m,n

The random walk process has several important properties. First, it is clearly an independent increments process, since the sequence {W(n)} is independent, identically distributed. Second, it is a Markov process (a consequence of the independent increments property). Third, it is a Martingale, since the increments have zero-mean, so that, if n > m

E [X(n) | X(m)] = E

[ X(m) +

n∑ i=m+1

W(i) | X(m)

] = X(m),

because of the independence between X(n) −X(m) and X(m). One of the applications of random walks is in pricing of stock options in the stock market, where stock

prices are modeled as performing a random walk. It is also used to model gambling problems that evolve over time, as follows: Let X(t) denote the amount of money a person has available; thus, X(0) is the amount of money one walks in with into a game. We assume that X(0) > 0. At each time, the gambler plays a game that can have payoff 1 with probability p, and loss 1 with probability 1 −p (the bet is always the same at each time). We can see that the process X(t) obeys a random walk model as before. However, we want to stop the game under two conditions: If X(t) = 0, then the game stops and X(s) = 0 for all s ≥ t, as the gambler is out of money (no credit here). Similarly, if X(t) = A > X(0), the gambler decides to retire with enough winnings, so there will be no further betting, and X(s) = A for all s ≥ t.

We are interested in computing the probability that the gambler will retire with enough winnings, given that we start with X(0) = k money. To do this, we exploit the recursive nature of the random walk definition, as follows. Let v(k) denote the success probability for the gambler when his current amount of money is k. Clearly, v(k) = 1 for k ≥ A, and v(k) = 0 for k ≤ 0. The other values need to be computed. Now, assume there is one bet, for which there are two outcomes. The key observation is the following: based on this outcome, the gambler will either have k + 1 or k − 1 amount of money. Furthermore, we know the probabilities of the outcome, so conditioning the outcome, we obtain the following:

v(k) = P(W(1) = 1)v(k + 1) + P(W(1) = −1)v(k − 1) = pv(k + 1) + (1 −p)v(k − 1)

Because the W(t) are i.i.d., the above equation does not depend on t. What results is now a set of A− 1 equations (one for each integer 0 < k < A) in A− 1 unknowns, which

can be solved to obtain the probability of success given the initial amount of money k. We can also solve the above equation by recognizing this as a time invariant linear system in discrete index k. Such time invariant linear systems have solutions that are exponentials in t, of the form v(k) = aRt1 + bR

t 2, where the roots

R1,R2 must solve the characteristic equation

pz2 −z + (1 −p) = 0

142 CHAPTER 6. STOCHASTIC PROCESSES AND THEIR CHARACTERIZATION

Solving the quadratic, we get R1 = 1,R2 = 1−p p

, which means the roots will be different as long as p 6= 0.5. Thus,

v(k) = a + b( 1 −p p

We select the constants a,b to match the boundary conditions v(0) = 0,v(A) = 1 which imply a = −b, a + b( p

1−p) A = 1, so a = 1

1−( p 1−p )

A , and

v(k) = 1 − ( 1−p

p )k

1 − ( p 1−p)

which provides a complete soluton to the problem.

6.5.2 The Poisson Process

Also known as the Poisson counting process (PCP), the Poisson process is a popular model used in communi- cations, manufacturing and other network applications to model the arrival stream of customers, calls or jobs. The Poisson process value N(t) is the total number of arrivals in the interval [0, t]. Thus the Poisson process counts up from 0 and takes on non-negative integer values. As we will see, the PCP is equivalently defined as an independent increments process whose increments are Poisson distributed. Construction of a Poisson process can be accomplished through a series of steps, which illuminate its connections to applications. We discuss these steps next.

Step 1) First we model the times τk between “arrivals” of the process as a sequence of independent, iden- tically distributed exponential random variables with parameter λ. These times τk are called the “interarrival times” and are illustrated in Figure 6.1. The assumption of independent, identically dis- tributed exponential interarrival times turns out to be a good model of many physical processes, such as subway arrivals, customers entering a line, etc. The exponential density of each interarrival time is given by:

pτk(t) = λe −λtu(t) k = 1, 2, · · · (6.7)

where u(t) is the unit step function, defined by

u(t) =

{ 0 if t ≤ 0 1 otherwise

(6.8)

t 1

t 2

t 3

t 40

Figure 6.1: Interarrival Times τk.

Step 2) Now we can define the sequence of “event times” T(n), which are the times at which the arrivals or events happen. In terms of the interarrival times we have:

T(n) =

n∑ k=1

τk, (6.9)

where the interarrival times τk are independent, identically distributed exponential random variables, as described above. Figure 6.2 shows the relationship between the τk and T(n). Now, note that T(n) is defined as the sum of a series of independent, identically distributed random variables. Thus its pdf can be obtained either by convolving the individual exponential pdfs or by finding the product of the corresponding characteristic functions. It is easiest to use the characteristic function approach:

E [ ejwT(n)

] = E

[ ejw

∑n k=1

τk ]

= ( E [ ejwτk

])n (6.10)

Using the definition of the characteristic function of the exponential distribution from Table 1.2 we obtain:

E [ ejwT(n)

] =

λn

(λ− jω)n . (6.11)

6.5. EXAMPLES OF STOCHASTIC PROCESSES 143

Taking inverse Fourier transforms, we get

pT(n)(t) = λ n t

n−1

(n− 1)! e−λtu(t), (6.12)

Examining the form of pT(n)(t) and comparing to the table of common densities in Table 1.2, we can see it is an Erlang distribution. Note that this distribution has mean mT (n) = n/λ and variance σ2T = n/λ

t 1

t 2 t 3 t 4

0 t

T ( 1 ) T ( 2 ) T ( 3 ) T ( 4 )

Figure 6.2: Arrival times T(n) and interarrival times τk.

Step 3) Finally, suppose we let T(n) be the times where the Poisson process takes a unit step jump. Note that these are random times, and that by construction the time between jumps has an exponential distribution. Mathematically then, the Poisson counting process (sometimes just called the counting process) is finally described as the sum of these shifted step functions:

N(t) =

∞∑ i=1

u(t−T(n)). (6.13)

so that, indeed, N(t) is the number of arrivals in the interval [0, t].

Now we need to construct the probability mass function pN(t)(m). Note that

pN(t)(m) = Pr(N(t) = m) = Pr (T(m) ≤ t,T(m + 1) > t) = Pr (T(m) ≤ t,τm+1 > t−T(m)) (6.14)

which is the probability of all possible ways we can have T(m) ≤ t and τm+1 > t−T(m) for that T(m). Now, by construction, T(m) and τm+1 are independent, so that:

pN(t)(m) =

∫ t 0

pT(m)(u)

[∫ ∞ t−u

pτm+1 (v) dv

] du

∫ t 0

λm um−1

(m− 1)! e−λu

[∫ ∞ t−u

λe−λv dv

] du

∫ t 0

λmum−1

(m− 1)! e−λue−λ(t−u) du

= (λt)m

(m)! e−λt, t,n ≥ 0 (6.15)

which basically states that, at time t, the distribution of the Poisson process is a Poisson random variable with parameter λt. The relationship between τk, T(n), and N(t) is shown in Figure 6.3.

t 1

t 2 t 3 t 40 t

T ( 1 ) T ( 2 ) T ( 3 ) T ( 4 )

N ( t )

Figure 6.3: The Poisson Counting Process (PCP) N(t) and the relationship between arrival times T(n) and interarrival times τk.

144 CHAPTER 6. STOCHASTIC PROCESSES AND THEIR CHARACTERIZATION

By construction, it appears that the Poisson process has independent increments. However, showing this is hard, since the intervals between events are not defined as set times. In particular, we want to establish that, if t2 > t1 > t0, then N(t2) − N(t1) is independent of N(t1) − N(t0). To do this requires exploiting the special structure of the exponential distribution, which was shown to be memoryless in Chapter 1. This structure enables us to show that when p(τ) = λe−λτu(τ) then

p(τ|τ > T) = p(τ −T) = λe−λ(t−τ)u(t− τ).

We will show that N(t2) −N(t1) is independent of N(t1), as follows. Note that

Pr[N(t2) −N(t1) = n,N(t1) = m] = Pr[T(m) ≤ t1,T(m + 1) > t1, tm+n ≤ t2, tm+n+1 > t2].

Now, define a new random variable S = t1−T(m). Note that S < τm+1 = T(m+ 1)−T(m), by construction; thus, define a second random variable V = τm+1 −S. Due to the memoryless property of the exponential distribution, we have

pSV (s,v) = pV |S(v|s)pS(s) = pτ (v + s|τ > s)pS(s) = pτ (v)pS(s),

which shows that S and V are independent! Now, we can easily see that

Pr[N(t2) −N(t1) = n,N(t1) = m] = Pr[T(m) ≤ t1,T(m + 1) > t1,T(m + n) ≤ t2,T(m + n + 1) > t2] = Pr[T(m) + S ≤ t1, t1 + V + τ2 + . . . + τm+n ≤ t2, t1 + V + τ2 + . . . + τm+n+1 > t2] = Pr[T(m) + S ≤ t1]Pr[t1 + V + τ2 + . . . + τm+n ≤ t2, t1 + V + τ2 + . . . + τm+n+1 > t2] = Pr[N(t1) = m]Pr[N(t2) −N(t1) = n]. (6.16)

The last equality follows from the independence of the interarrival times τi, and the decomposition of the interarrival time τm+1 into two independent components S and V , thanks to the memoryless property of the exponential distribution. Thus, we have shown that the Poisson counting process is also an independent increments process!

One of the properties of Poisson random variables is that the sum of two independent Poisson random variables is also a Poisson random variable! To see this, consider the moment generating function of the sum of two Poisson random variables N,M with rates λN,λM respectively, as

E[zN+M ] = E[zN ]E[zM ] = eλN (z−1)eλM (z−1) = e(λN +λM )(z−1),

which is the moment-generating function for a Poisson random variable with rate λN + λM ! Coupled with the independent increments property of Poisson processes, this allows us to make several statements:

1. For any interval [t1, t2], the probability that N(t2) −N(t1) = n ≥ 0 occurs in that interval is Poisson distributed, with intensity λ(t2 − t1). That is,

Pr(N(t2) −N(t1) = n) = (λ(t2 − t1))n

(n)! e−λ(t2−t1).

In other words, the increments of a PCP are themselves Poisson distributed random variables!

2. For any pair of disjoint intervals, the number of events which occur in those intervals is independent, and the average number of events which occur on equal-length intervals is the same.

3. The joint probability pN (n1,n2; t1, t2) for t2 > t1, in which case n2 ≥ n1 ≥ 0, is computed as

pN (n1,n2; t1, t2) = pN (n2; t2|n1, t1)pN (n1; t1) = pN (n2 −n1; t2 − t1)pN (n1; t1)

= (λ(t2 − t1))n2−n1

(n2 −n1)! e−λ(t2−t1)

(λ(t1)) n1

(n1)! e−λt1

= λn2tn11 (t2 − t1)

n2−n1

(n1)!(n2 −n1)! e−λt2 (6.17)

6.5. EXAMPLES OF STOCHASTIC PROCESSES 145

4. A Poisson process is an independent-increments process, where increments over an interval [t1, t2] are Poisson distributed with rate λ(t2 − t1).

5. The sum of two independent Poisson processes N1(t),N2(t) with intensities λ1,λ2 is a Poisson process with intensity λ1 + λ2.

Using the above properties, the first and second-order moment functions of Poisson processes are easy to establish. These are:

E[N(t)] = λt; E[N(t)2] = λt + (λt)2; σ2N(t) = λt (6.18)

and the autocorrelation function:

RN (t,s) = E[N(t)N(s)]

= E[(N(t) −N(s) + N(s))N(s)] assuming t ≥ s without loss of generality = E[(N(t) −N(s))N(s)] + E[N(s)2] = E[(N(t) −N(s))]E[N(s)] + E[N(s)2] = λ2(t−s)(s) + λs + λ2s2 = λ2ts + λs (6.19)

Using symmetry, we can write the autocorrelation and autocovariance functions more generally as

RN (t,s) = λ 2ts + λ min(t,s); KN (t,s) = λ min(t,s)

Note the minimization which occurs due to the independent increments property. Also note that the Pois- son process is not a stationary process; however, it will be a Markov process. The Poisson process is an independent increments process, but it is not a Martingale, because the expected value of the increments is positive. However, if one were to define a modified process Ñ(t) = N(t) −λt, this modified process would be a Martingale.

Example 6.5 When monitoring radioactivity, or also photoelectric intensity, most measuring processes are based on counting the number of particles or photons which are emitted. Due to the quantum nature of electromagnetic waves, the number of photons which are emitted in a given interval is random, and assumed to be independent over any pair of disjoint intervals; furthermore, the average number of photons emitted over any interval is constant, depending on the intensity of the source. This satisfies the assumptions of a Poisson process, which consists of independent increments and a constant rate. Indeed, one can show that these two properties, plus the fact that it is a counting process, implies that the actual number of photons generated is a Poisson process.

6.5.3 Digital Modulation: Phase-Shift Keying

A basic method for modulation of digital data is phase-shift keying (PSK). In this method, binary data, modeled by a stream of 0’s and 1’s, is coded onto a carrier frequency by a phase signal. Define the random phase θ(n) as follows:

θ(n) =

{ π/2 if the n-th bit is 1, −π/2 otherwise (6.20)

Let T denote the duration of the signal used for each bit. Typically, T is a multiple of bit rate (the period of the carrier frequency fc); that is, T = m/fc for some integer m ≥ 1, so that one or more cycles are used per bit. Define the phase signal for the n-th bit as

Θ(t) = θ(n) for nT ≤ t < (n + 1)T (6.21)

The corresponding transmitted signal is given by

X(t) = cos (ωct + Θ(t)) (6.22)

where ωc = 2πfc is the carrier frequency in radians/sec. Now, suppose that the phase process in (6.21) was an independent, identically distributed random se-

quence, where each θ(n) is a binary-valued random variable with probability parameter p; (i.e., pθ(n)(π/2) = p.) Then, the resulting collection of transmitted signals obtained in (6.22) form a continuous-time, continuous- valued random process. When the parameter p = 1/2, it is referred to as the PSK process.

146 CHAPTER 6. STOCHASTIC PROCESSES AND THEIR CHARACTERIZATION

What are the second-order moments of the PSK process? Remember the following trigonometric identity:

cos (ωct + Θ(t)) = cos (ωct) cos (Θ(t)) − sin (ωct) sin (Θ(t)) (6.23)

Using the above, we compute the mean of the process as

mX(t) = E [cos (ωct) cos (Θ(t)) − sin (ωct) sin (Θ(t))] = cos (ωct) E [cos (Θ(t))] − sin (ωct) E [sin (Θ(t))] = 0 − sin (ωct) (p− (1 −p)) = sin (ωct) (1 − 2p) (6.24)

Note that the third equality follows because cos (Θ(t)) = 0 and sin(Θ(t1)) = ±1 by definition. When p = 0.5, the mean is zero.

The autocorrelation (also autocovariance) is given by:

RX(t1, t2) = E[(cos ωct1 cos Θ(t1) − sin ωct1 sin Θ(t1))(cos ωct2 cos Θ(t2) − sin ωct2 sin Θ(t2))] = E [sin (ωct1) sin (Θ(t1)) sin (ωct2) sin (Θ(t2))]

= sin ωct1 sin ωct2E[sin Θ(t1) sin Θ(t2)]

To complete the computation, we must compute the correlation E[sin Θ(t1) sin Θ(t2)]. Note that, from (6.21), Θ(t1), Θ(t2) are independent unless nT ≤ t1, t2 < (n + 1)T for some n. Thus,

E[sin Θ(t1) sin Θ(t2)] =

{ 1 if nT ≤ t1, t2 < (n + 1)T for some n E[sin Θ(t1)]E[sin Θ(t2)] otherwise

(6.25)

and thus the autocorrelation function is given by

RX(t1, t2) =

{ sin ωct1 sin ωct2 if nT ≤ t1, t2 < (n + 1)T for some n sin(ωct1) sin(ωct2)(1 − 2p)2 otherwise

(6.26)

and the autocovariance function is

KX(t1, t2) =

{ 4p(1 −p) sin ωct1 sin ωct2 if nT ≤ t1, t2 < (n + 1)T for some n 0 otherwise

(6.27)

When p = 0.5 then the scaling factor is unity: 4p(1 −p) = 1.

6.5.4 The Random Telegraph Process

The random telegraph process (sometimes also known as the random binary sequence) is a discrete-valued process which is used often as input in model identification problems because it is easy to generate sample paths of this function. The process is generated in a manner which is very similar to a Poisson process; indeed, one way to define the random binary sequence is to switch values at the jump times of a Poisson process.

Let {Tn} denote the sequence of event times associated with Poisson process, as in eq.(6.9). The random telegraph process is generated as follows: Let X(0) be a binary-valued random variable, with equal probability of achieving the values {−1, 1}. Define T0 = 0 for notation; then,

X(t) =

{ X(Tn) if Tn < t < Tn+1 −X(Tn) if t = Tn+1

(6.28)

Due to its construction, the random telegraph process has properties which are similar to the Poisson process. In particular, the inter-event times τn = Tn − Tn−1 are independent, identically distributed exponential random variables with rate λ, and the numbers of events in disjoint intervals are independent random variables.

In order to understand better the relationship between the random telegraph process and the Poisson process, let N(t) denote the Poisson counting process with the same event times. If we assume that X(T0)

6.5. EXAMPLES OF STOCHASTIC PROCESSES 147

is equally likely to be either +1 or -1, then the random telegraph process is clearly zero-mean. Assuming t2 ≥ t1, the autocorrelation (and autocovariance) are given by:

RX(t1, t2) = E[X(t1)X(t2)] = (+1)Pr[X(t1) = X(t2)] + (−1)Pr[X(t1) 6= X(t2)] = Pr[N(t2) −N(t1) = 2n for some n ≥ 0] − Pr[N(t2) −N(t1) = 2n + 1 for some n ≥ 0]

∞∑ n=0

[λ(t2 − t1)]2n

(2n)! e−λ(t2−t1) −

∞∑ n=0

[λ(t2 − t1)]2n+1

(2n + 1)! e−λ(t2−t1)

= 1

( 1 + e−2λ(t2−t1)

) −

( 1 −e−2λ(t2−t1)

) = e−2λ(t2−t1) (6.29)

More generally, using symmetry of RX, we have

RX(t1, t2) = e −2λ|t2−t1| (6.30)

Thus, the random telegraph process is wide-sense stationary. Indeed, we can extend the definition to (−∞,∞) by defining event times Tn for negative integers n in an obvious manner using exponential independent, identically distributed random variables τn for negative integers also. Then, we can show that the random telegraph process is stationary in the strict sense, due to the stationary property of the increments of Poisson processes. That is,

pX(x1,x2; t1, t2) = pX(x2; t2|x1; t1)pX(x1; t1) (6.31) = {δ(x1 −x2)Pr[N(t2) −N(t1) even|X(t1) = x1] +

δ(x1 + x2)Pr[N(t2) −N(t1) odd|X(t1) = x1]} pX(x1; t1)

where δ(·) denotes the dirac or impulse function. Now, the increments N(t2) − N(t1) are independent of X(t1), due to the independent increments property of the Poisson process and the fact that X(t1) depends only on events used to define N(t1). Thus,

pX(x1,x2; t1, t2) = {δ(x1 −x2)Pr[N(t2) −N(t1) even] + δ(x1 + x2)Pr[N(t2) −N(t1) odd]}pX(x1; t1) = {δ(x1 −x2)Pr[N(t2 − t) −N(t1 − t) even] +

(1 − δ(x1 −x2))Pr[N(t2 − t) −N(t1 − t) odd]} pX(x1; t1 − t) (6.32) = pX(x1,x2; t1 − t,t2 − t) (6.33)

since pX(x1; t1) is stationary, and the distribution of a Poisson increment Pr[N(t2)−N(t1)] is also stationary. The above argument can be generalized to an arbitrary finite number of process values X(t1), . . . ,X(tn).

Does the random telegraph process have independent increments? No, it does not. To see this consider three ordered times t1 < t2 < t3 and the corresponding increments X(t2) −X(t1) and X(t3) −X(t2). If we know nothing about the first increment X(t2)−X(t1), then the second increment could either be +2, −2 or 0 (i.e. there are three possibilities, each with some probability). But if we know that X(t2) −X(t1) = +2 then X(t3) −X(t2) can only be −2 or 0, so knowledge of X(t2) −X(t1) is clearly affecting the uncertainty in X(t3) −X(t2) and they cannot be independent.

The above discussion illustrates why Poisson processes are so important for the understanding of stochas- tic processes. We can define other processes based on the transition times of a Poisson process, and these processes inherit many of the fundamental properties of the Poisson processes.

6.5.5 The Wiener Process and Brownian Motion

Suppose that we have defined a discrete-time random walk process Y (·), as in section 6.5.1, indexed by a discrete time n. We can embed this process into continuous time by interpolating between times, as:

YT (t) =

{ 0 if t = 0 Y (n)d if (n− 1)T < t ≤ nT,n = 1, . . . (6.34)

for some sampling time T and some jump size d. Based on the properties of the discrete-time random walk process Y (·), we have the following properties for the continuous-time process YT (·):

E[YT (t)] = 0; E[YT (t) 2] = E[d2Y 2(n(t))] = d2n(t) = d2 min[n : nT ≥ t]

148 CHAPTER 6. STOCHASTIC PROCESSES AND THEIR CHARACTERIZATION

Note that n(t) ≈ t/T in the above equation. Our goal is to define a process which is the limit of the YT (t) process, as we let the sampling time T → 0 in a manner which makes the limit a random variable. In particular, define a monotone decreasing sequence of times Tn = 1/n, so that limn→∞Tn = 0, and define the corresponding sequence of processes as YTn(t) ≡ Yn(t). As n increases, the value of the process at each time, Yn(t), is the sum of nt independent, identically-distributed, zero-mean random variables, and the variance of Yn(t) is d

2nt. By the strong law of large numbers, we have that

lim n→∞

Yn(t)

a.e. = 0 (6.35)

The idea for the construction of the Brownian motion process is to decrease the step size d as n increases, so that the overall variance of Yn(t) remains constant. Thus, define the sequence dn such that d

2 nn = 1. Then,

by the Central Limit Theorem, since Yn(t) is the sum of an increasing number of independent, identically distributed random variables,

lim n→∞

Yn(t)√ t

d. = N(0, 1) (6.36)

That is, the limit of the normalized sequence converges in distribution to a unit variance, zero-mean Gaussian random variable. Alternatively, since the normalizing factor does not depend on n, then the sequence of random variables Yn(t) converges in distribution to a Gaussian random variable B(t), with zero-mean, and variance t. We define the Brownian motion process B(t) to be the limit, for each t, of the sequence of random variables Yn(t).

What properties does the process B(t) have? We know that, for each n, the process Yn(t) has almost independent increments, in the sense that, for t1 < t2 < t3 < t4, the increments Yn(t2) − Yn(t1) and Yn(t4)−Yn(t3) are independent provided they are constructed from independent increments in the underlying random walk. That is, they are independent provided t3 − t2 ≥ 1/n. As n →∞, the limit process B(t) will have independent increments. As established before, the limit process B(t) is also a Gaussian process.

Note also that, as n → ∞, the size of the jumps in the process derived from the random walk, dn, are getting smaller. Indeed, we have the following property:

lim �→0

B(t + �) = lim �→0

lim n→∞

Yn(t + �) = lim n→∞

lim �→0

Yn(t + �)

= lim �→0

lim n→∞

Yn(t) + (Yn(t + �) −Yn(t))

= lim n→∞

[Yn(t) + lim �→0

(Yn(t + �) −Yn(t))] ≈ lim n→∞

[Yn(t) + dn] = B(t) (6.37)

where the argument is somewhat imprecise because we are dealing with random variables, and we have not defined in what sense is the limit valid. The limit indicates that B(t) should be a continuous function of time. Later, we shall define more formally what we mean by continuous random processes, and show that, almost surely, the sample functions B(t) are continuous.

In sum, the Brownian motion B(t) (also known as the standard Wiener process), is a Gaussian, zero- mean, continuous-time random process with independent increments, and with variance E[B2(t)] = t. Based on this definition, we summarize the properties of Brownian motion:

1. B(0) = 0.

2. The sample functions B(t) are almost sure continuous.

3. E[B(t)] = 0,E[B2(t)] = t

4. B(t) is Gaussian.

5. B(t) is an independent increments process; the increments B(t)−B(s) are Gaussian, zero-mean random variables with variance t−s for t > s.

6. The autocovariance and autocorrelation functions are given by

RB(t,s) = KB(t,s) = min(t,s) (6.38)

6.6. STATIONARITY OF STOCHASTIC PROCESSES 149

In addition to the standard Wiener process or Brownian motion, it is possible to define other Gaussian processes which are very similar. In particular, a generalized Wiener process is a Gaussian, zero-mean, independent-increments process with covariance f(t) for some nondecreasing function f(t). In particular, we can define a Wiener process with covariance αt for some positive constant α, and interpret it simply as the limit of a random walk where the step-size d2nn = α.

In terms of the classes of processes described previously, generalized Wiener processes are independent- increment processes, Markov processes and also Martingales. Note remarkably that a generalized Wiener process with covariance αt has the same mean and autocovariance functions as the normalized Poisson process Ñ(t) = N(t)−λt, where the rate λ equals the covariance intensity α. However, the sample functions of the Wiener process are almost sure continuous, in contrast with the discontinuous sample functions of Ñ(t). This highlights the limitations of using only the first and second order statistics to understand how a process behaves.

6.6 Stationarity of Stochastic Processes

There are several properties of random processes that make it easier to specify the joint distribution for arbitrary times {t1, . . . , tk} and/or simplify the first and second-order moments. Of particular importance are properties describing conditions where the nature of the process randomness does not change with time. In other words, an observation of the process on some time interval (s,t) displays the same random behavior as over the time interval (s + τ,t + τ). This and other properties can be defined in terms of distributions or moments, corresponding to strong vs. weak conditions, respectively. Both types are included among the definitions below.

Definition 6.7 (Strict-Sense Stationary) The stochastic process X(·) is called stationary (or strict-sense stationary (SSS), or strictly stationary ) if the joint distri- bution of any collection of samples depends only on their relative time. That is, for any k and any t1, t2, . . . , tk and any τ, we have

pX(x1, . . . ,xk; t1, . . . , tk) = pX(x1, . . . ,xk; t1 − τ, . . . , tk − τ).

Clearly, the concept of stationary processes is easily extended to vector-valued processes, whereby the components are said to be jointly stationary. There are several consequences of stationarity which are useful to exploit. First, note that the mean of the process must be independent of time, since pX(x; t) = pX(x; τ) for all t,τ. Thus, mX(t) = mX for all t. Similarly, the variance σ

2 X(t) = σ

2 X. Second, the second-order joint

density functions depend only on the difference of the time indices; that is,

pX(x1,x2; t1, t2) = pX(x1,x2; 0, t2 − t1).

Thus, second-order moment functions such as autocorrelations and autocovariances depend only on the differences in the times! That is,

RX(t1, t2) = RX(0, t2 − t1) ≡ RX(t2 − t1); KX(t1, t2) = KX(0, t2 − t1) ≡ KX(t2 − t1),

where we have indulged in a standard abuse of notation with the use of a single argument for time difference. There are many processes for which we can only establish the weaker condition of stationarity of the first

and second-order moment functions. We define this class of processes below:

Definition 6.8 (Wide-Sense Stationary) The process X(·) is said to be wide-sense stationary (WSS) (or weakly stationary ) if the mean of the process does not depend on time, and autocorrelation function depends only on the time difference of the two samples. That is,

mX(t) = mX; RX(t1, t2) ≡ RX(t2 − t1)

Another class of random processes of interest are processes whose description exhibits periodic behavior. These processes arise in many communications applications, where operations must be repeated periodically. We define two special classes of processes:

Definition 6.9 (Periodic) A stochastic process X(·) is said to be periodic if the joint probability density (or mass) function is invariant when the time of any of the variables is shifted by integer multiples of some period T . That is, for any k, for any integers mi and sampling times t1, . . . , tk, we have

pX(x1, . . . ,xk; t1, . . . , tk) = pX(x1, . . . ,xk; t1 −m1T,. . . , tk −mkT)

150 CHAPTER 6. STOCHASTIC PROCESSES AND THEIR CHARACTERIZATION

Definition 6.10 (Cyclostationary) A stochastic process X(·) is said to be cyclostationary if the joint probability density (or mass) function is invariant when the time origin is shifted by integer multiples of some period T . That is, for any k, for any integer m and sampling times t1, . . . , tk, we have

pX(x1, . . . ,xk; t1, . . . , tk) = pX(x1, . . . ,xk; t1 −mT,.. . , tk −mT)

The difference between periodic and cyclostationary is that, in periodic processes, each time index can be shifted by a different multiple of the period, while in cyclostationary processes, all time indices must receive the same shift. Note that periodic implies cyclostationary, but not vice versa. Again, when we care only about second-order moment functions, we have weaker definitions.

Definition 6.11 (Wide-Sense Periodic) A stochastic process X(·) is said to be wide-sense periodic if the mean and autocorrelation of the process are invariant when the time of any of the variables is shifted by integer multiples of some period T . That is, for any k, for any integers m1,m2 and sampling times t1, t2, we have:

mX(t) = mX(t−mT); RX(t1, t2) = RX(t1 −m1T,t2 −m2T)

Definition 6.12 (Wide-Sense Cyclostationary) A stochastic process X(·) is said to be wide-sense cyclostationary if the mean and autocorrelation of the process are invariant when the time origin is shifted by integer multiples of some period T . That is, for any k, for any integer m and sampling times t1, t2, we have

mX(t) = mX(t−mT); RX(t1, t2) = RX(t1 −mT,t2 −mT)

Although the above definitions have been stated for scalar-valued processes, it is straightforward to extend the definitions to vector-valued processes X(t).

Example 6.6 Consider the random telegraph process defined previously. If we assume that X(T0) is equally likely to be either +1 or -1, then the random telegraph process is clearly zero-mean. The autocorrelation (and autocovariance) are given by:

KX(t1, t2) = e −2λ|t2−t1| (6.39)

pX(x1,x2; t1, t2) = pX(x2; t2|x1; t1)pX(x1; t1) (6.40) = {δ(x1 −x2)Pr[N(t2) −N(t1) even|X(t1) = x1] +

δ(x1 + x2)Pr[N(t2) −N(t1) odd|X(t1) = x1]} pX(x1; t1)

where δ(·) denotes the dirac or impulse function. Now, the increments N(t2) −N(t1) are independent of X(t1), due to the independent increments property of the Poisson process and the fact that X(t1) depends only on events used to define N(t1). Thus,

pX(x1,x2; t1, t2) = {δ(x1 −x2)Pr[N(t2) −N(t1) even] + δ(x1 + x2)Pr[N(t2) −N(t1) odd]}pX(x1; t1) = {δ(x1 −x2)Pr[N(t2 − t) −N(t1 − t) even] +

(1 − δ(x1 −x2))Pr[N(t2 − t) −N(t1 − t) odd]} pX(x1; t1 − t) (6.41) = pX(x1,x2; t1 − t,t2 − t) (6.42)

Example 6.7 Consider the phase shift keying process X(t) defined previously. The statistics of X were:

mX(t) = sin(ωct)(1 − 2p)

KX(t1, t2) =

{ 4p(1 −p) sin(ωct1) sin(ωct2) nT ≤ t1, t2 < (n + 1)T for some n 0 otherwise

When p = 0.5, the mean of the process is 0. This is an example of a wide-sense cyclostationary process; with some work we could show this process is actually cyclostationary, similar to what we saw in the previous example. Note, however, that the process is not periodic, because KX(t1 + T,t2) 6= KX(t1, t2).

6.7. MOMENT FUNCTIONS OF VECTOR PROCESSES 151

6.7 Moment Functions of Vector Processes

Suppose that we have a vector-valued stochastic process X(t). We define the mean and autocorrelation functions of the vector process as

mX(t) = E[X(t)]; RX(s,t) = E[X(s)X(t) T ] (6.43)

Similarly, the autocovariance function is defined as

KX(s,t) = RX(s,t) −mX(s)mX(t) T (6.44)

If we have two vector-valued processes X(t),Y (t) defined on the same probability space, we define the cross-correlation function RXY as

RXY (s,t) = E[X(s)Y (t) T ] (6.45)

For complex-valued vectors (or matrices) M, we define the Hermitian adjoint of M as

MH = [MT ]∗

where ∗ denotes complex conjugation, element by element. For complex-valued vector processes, the above definitions can be extended as:

RX(s,t) = E[X(s)X(t) H]; KX(s,t) = RX(s,t) −mX(s)mX(t)

H (6.46)

Based on the above definitions, autocorrelation (also autocovariance) functions have the following prop- erties:

1. RX(s,t) = RX(t,s) H

2. Using the Cauchy-Schwarz inequality for random variables, we get, for any appropriately-dimensioned vectors a,b,

|aHRX(s,t)b|2 = E[aHX(s)X(t)Hb]2 ≤ E[(aHX(s))2]E[(bHX(t))2] = [aHRX(s,s)a][b

HRX(t,t)b] (6.47)

In particular, when X(·) is a scalar-valued random process, we have

|RX(s,t)|2 ≤ RX(s,s)RX(t,t)

3. RXY (s,t) = RY X(t,s) H

6.8 Moments of Wide-sense Stationary Processes

In this section, we restrict our attention initially to scalar, real-valued wide-sense stationary random processes When the processes are wide-sense stationary, the autocorrelation function and autocovariance functions can be expressed simply as RX(s,t) = E[X(s)X(t)] = E[X(0)X(t−s)] ≡ RX(t−s). Letting τ = t−s, we use the notation RX(τ) ≡ E[X(t)X(t+τ)]. Assuming that X(·) is real-valued, we have the following properties:

1. RX(0) = E[X(t) 2] ≥ 0.

2. RX(τ) = E[X(t)X(t + τ)] = E[X(t−τ)X(t)] = RX(−τ), because X(·) is wide-sense stationary. Thus, the autocorrelation is an even function of τ.

3. |RX(τ)|2 ≤ RX(0)2, based on the Cauchy-Schwarz inequality.

4. If RX(T) = RX(0) for some T, then RX is periodic, with period T . That is, RX(τ + T) = RX(τ) for all τ. This also follows from the Cauchy-Schwarz inequality, as

|E[(X(τ + T) −X(τ))X(0)]|2 = (RX(τ + T) −RX(τ))2

≤ E[(X(τ + T) −X(τ))2]E[X(0)2] = (E[X(τ + T)2] + E[X(τ)2] − 2E[X(τ + T)X(τ)])RX(0) = (2RX(0) − 2RX(T))RX(0) = 0 (6.48)

152 CHAPTER 6. STOCHASTIC PROCESSES AND THEIR CHARACTERIZATION

5. If RX(τ) is continuous at τ = 0, then RX(τ) is continuous everywhere. This follows again because of the Cauchy-Schwarz inequality

lim �→0 |RX(τ + �) −RX(τ)|2 = lim

�→0 |E[(X(τ + �) −X(τ))X(0)]|2

≤ lim �→0

E[(X(τ + �) −X(τ))]2E[X(0)]2

= lim �→0

2(RX(�) −RX(0))RX(0) = 0 (6.49)

because RX(τ) is continuous at 0.

Note that all these properties also hold for KX(τ), which can be though of as a special case of an autocor- relation function for the random process X̃(t) = X(t) −µX.

The concept of wide-sense stationarity can also be extended to vector processes. A vector process is wide-sense stationary if its autocorrelation satisfies RX(s,t) = RX(0, t−s) = RX(v,v +t−s) for any s,t and v. We write RX(s,t) ≡ RX(τ). For vector real-valued processes, we have the following natural extensions of the above results:

1. The autocorrelation matrix RX(0) is positive semidefinite; that is, a TE[X(t)X(t)T ]a ≥ 0.

2. RX(τ) = E[X(t)X(t + τ) T ] = E[X(t− τ)X(t)T ] = RX(−τ)T , because X(·) is wide-sense stationary.

3. If RX(τ) is continuous at τ = 0, then RX(τ) is continuous everywhere.

This follows again because of the Cauchy-Schwarz inequality for any vectors a,b

lim �→0 |aT (RX(τ + �) −RX(τ))b|2 = lim

�→0 |E[a(X(τ + �) −X(t))X(0)Tb]|2

≤ lim �→0

E[|aT (X(τ + �) −X(t))|2]E[|X(0)Tb|2]

= lim �→0

2aT (RX(�) −RX(0))abTRX(0)b = 0 (6.50)

because RX(τ) is continuous at 0. By properly selecting the vectors, we can show that each entry in RX must be continuous.

Let X(·) and Y (·) denote two scalar processes defined on the same probability space. Then, the cross- correlation function RXY satisfies the following:

1. RXY (τ) = RY X(−τ).

2. By the Cauchy-Schwarz inequality, RXY (τ) 2 ≤ E[X(0)2]E[Y (τ)2] = RX(0)RY (0).

3. |RXY (τ)| ≤ 1/2(RX(0) + RY (0)). To show this, remember the following moment inequality, which was derived from Jensen’s inequality

E[|X + Y |r] ≤ cr(E[|X|r] + E[|Y |r])

where

cr =

{ 1 if r ≤ 1 2r−1 if r > 1

In particular, if r = 2, we have

E[(|X(0)| + |Y (τ)|)2] = RX(0) + RY (0) + 2|RXY (τ)| ≤ 2(RX(0) + RY (0)) (6.51)

which establishes the inequality.

6.9. POWER SPECTRAL DENSITY OF WIDE-SENSE STATIONARY PROCESSES 153

6.9 Power Spectral Density of Wide-Sense Stationary Processes

In this section, we concentrate on describing the properties of scalar wide-sense stationary processes in the frequency domain. As one might imagine, such “frequency domain” characterization will turn out to be particularly convenient when we consider the interaction of wide-sense stationary processes and linear time invariant systems, which we discuss in Section 9.

Let us assume that we have a wide-sense stationary process x(t) with mean mx and autocovariance function Kx(τ) ≡ Kx(t,t + τ) for all t,τ. We assume also that the random process y(t) is also wide-sense stationary, and that x,y are jointly wide-sense stationary. As we have discussed previously in the properties of autocovariance and autocorrelation functions, we know:

1. |Rx(τ)| ≤ Rx(0)

2. |Rxy(τ)| ≤ √ Rx(0)Ry(0)

3. Rx(τ) = R ∗ x(−τ)

4. For all N > 0, all t1 < t2 < ... < tN , all complex a1, . . . ,an, we have

[ a∗1 · · · a∗N

]  

Rx(t1 − t1) Rx(t1 − t2) · · · Rx(t1 − tN ) Rx(t2 − t1) Rx(t2 − t2) · · · Rx(t2 − tN )

... ...

... Rx(tN − t1) Rx(tN − t2) · · · Rx(tN − tN )

    a1... aN

  ≥ 0

5. If x(t) is real, then Rx(τ) is an even function.

Notice that the autocorrelation function Rx(τ) can be viewed as a deterministic function of time, which is bounded by Rx(0) in magnitude. Thus, we can define its Fourier transform (if it exists), as follows:

Definition 6.13 (Power Spectral Density) Let Rx(τ) be the autocorrelation function of the wide-sense stationary process x(t). Then, the power spectral density Sx(ω) of x(t) is defined as the Fourier transform of Rx(τ). That is,

Sx(ω) =

∫ ∞ −∞

Rx(t)e −jωt

If the Fourier transform exists, we can define the inverse Fourier transform as

Rx(τ) = 1

2π

∫ ∞ −∞

Sx(w)e jωτdω

Definition 6.14 (Cross-Power Spectral Density) For two jointly wide-sense stationary processes x(t),y(t), we define the cross power spectral density to be

Sxy(ω) =

∫ ∞ −∞

Rxy(t)e −jωt

Given the properties of the autocorrelation function Rx(τ), we can establish the following properties of the power spectral density function:

1. Sx(ω) is real-valued, even if the process x(t) is complex-valued.

2. If x(t) is real-valued, then Sx(ω) is an even function, since Rx(τ) is an even function.

3. Sx(ω) is nonnegative.

154 CHAPTER 6. STOCHASTIC PROCESSES AND THEIR CHARACTERIZATION

Proving the first two properites is simple; the third property will be shown later when discussing linear systems. Consider the first property: note that Rx(τ) = R

∗ x(−τ). Thus,∫ ∞

−∞ Rx(t)e

−jωtdt =

∫ ∞ 0

Rx(t)e −jωtdt +

∫ 0 −∞

Rx(t)e −jωtdt

∫ ∞ 0

Rx(t)e −jωtdt +

∫ ∞ 0

Rx(−t)ejωtdt

∫ ∞ 0

(Rx(t)e −jωt + (Rx(t)e

−jωt)∗)dt

which is clearly real-valued, since the integrand is real-valued. If x(t) is real-valued, then the integral is symmetric about ω = 0, and so Sx(ω) is an even function.

The power spectral density has the interpretation of a density function for average power in the random process x(t) per unit frequency. For instance, consider a wide-sense stationary process x(t), and consider the random variable at frequency ω, defined by the following integral:

FT (ω) =

∫ T −T

x(t)e−jωtdt

This is a complex-valued random variable, and, as T → ∞, looks like the Fourier transform of the sample path of x(t). Note that the above integral exists as long as x(t) has reasonable sample paths (e.g. bounded). The square of the magnitude of this random variable is given by

FT (ω)F ∗ T (ω) =

∫ T −T

x(t)x∗(s)e−jω(t−s) dsdt

Thus,

E [ |FT (ω)|2

] =

∫ T −T

Rx(t−s)e−jω(t−s) dtds

= 2T

∫ 2T −2T

( 1 − |τ| 2T

) Rx(τ)e

−jωτ dτ (6.52)

Thus, the average power in the random variable FT (ω) is given by

2T E [ |FT (ω)|2

] =

∫ 2T −2T

( 1 − |τ| 2T

) Rx(τ)e

−jωτdτ

As T →∞, we see that this integral converges to Sx(ω)! Thus,

Sx(ω) = lim T→∞

2T E [ |FT (ω)|2

] This provides the interpretation that Sx(w) is the average power in a sample of x(t) at frequency ω. This also establishes that Sx(ω) must be nonnegative.

Let’s consider several examples of autocorrelation functions and compute their corresponding power spectral density functions.

Example 6.8 Consider the white noise process w(t), with autocorrelation function Rw(τ) = δ(τ). As we discussed previously, we can always compute integrals of these functions. Thus, we can take Fourier transforms to obtain Sw(ω) = 1. This is why the name “white noise” is used: the white noise process contains every frequency with uniform intensity.

Example 6.9 Consider a wide-sense stationary process x(t) with autocorrelation function Rx(τ) = e

−a|τ|, where a > 0. Then,

Sx(ω) =

∫ ∞ −∞

e −a|t|

e −jωt

∫ 0 −∞

e (−jω+a)t

dt +

∫ ∞ 0

e (−jω−a)t

= 1

−jω + a −

−jω −a =

a2 + ω2 (6.53)

6.9. POWER SPECTRAL DENSITY OF WIDE-SENSE STATIONARY PROCESSES 155

Example 6.10 Consider a wide-sense stationary process x(t) with autocorrelation function Rx(τ) = max(1− |τ|T , 0). Since the triangular shape is the convolution of two rectangular shapes of width T and height

√ 1/T , the power spectral density is the square

of the transform of the power spectral density of a rectangular pulse. This is given by:

Sp(ω) =

∫ T/2 −T/2

1 √ T e −jωt

= 1

jω √ T

(−e−jωT/2 + ejωT/2) = √ T

sin ωT/2

ωT/2 (6.54)

so that the power spectral density of x(t) is given by

Sx(ω) = T( sin ωT/2

ωT/2 ) 2

Below is a list of properties of autocorrelation functions of wide-sense stationary processes and their corresponding power spectral density functions. Let x(t) be wide-sense stationary with autocorrelation function Rx(τ) and power spectral density Sx(ω). Then,

1. For any constant a, the process ax(t) has autocorrelation function |a|2Rx(τ) and power spectral density |a|2Sx(ω).

2. Let x(t), y(t) be orthogonal, jointly wide-sense stationary processes (that is, Rxy(τ) = 0. Define the process z(t) = x(t) + y(t). Then Rz(τ) = Rx(τ) + Ry(τ), and Sz(ω) = Sx(ω) + Sy(ω).

3. The autocorrelation function of d dt x(t) is given by − d

dτ2 Rx(τ), and the power spectral density by

ω2Sx(ω).

4. The autocorrelation of x(t)ejω0t is Rx(τ)e jω0τ and its power spectral density is Sx(ω −ω0).

5. If the mean function of x(t) is zero, i.e. µx = 0, then the autocorrelation function of x(t) + b is Rx(τ) + |b|2, and its power spectral density function is Sx(ω) + 2π|b|2δ(ω).

6. The autocorrelation of x(t) cos(ω0t + θ), where θ is uniformly distributed in [−π,π] and independent of x(t), is given by 1

2 Rx(τ) cos ω0τ, and its power spectral density by

1 4 Sx(ω −ω0) + 14Sx(ω + ω0).

To conclude this section, consider now vector-valued wide-sense stationary processes. The concept of autocorrelation functions extends naturally to this setting, leading to matrix-valued autocorrelation func- tions. Clearly, a similar extension is easy to establish for power spectral density functions; for vector-valued processes, these will be matrix-valued functions of frequency ω, defined as the matrix of Fourier transforms of the elements of the autocorrelation matrix function. Properties of this matrix-valued power spectral density function follow naturally from the definition of the Fourier transform and the properties of the autocorrelation functions for vector-valued processes.

156 CHAPTER 6. STOCHASTIC PROCESSES AND THEIR CHARACTERIZATION

Chapter 7

Discrete State Markov Processes

Previously, we have discussed the concept of Markov processes. In this section, we want to focus on discrete- valued Markov processes, both continuous and discrete time. Discrete state Markov processes, also known as Markov chains, were introduced by Andrei Markov in 1906 to study extensions of the Central Limit Theorem to sequences where the random variables were not independent, identically distributed. Such models form the basis for many interesting applications such as speech recognition, communications networks analysis and stochastic automata. Special cases of these processes include Poisson processes, random telegraph processes, random walks and some of the other processes we have discussed.

Let S be a finite, or countably infinite set of possible values, which we call the state space. For random variables, this set S ⊂ <. Let S = {a∞,a∈, . . .}. Consider a stochastic process x(t) with values in S. We can define the marginal probability at time t as

p(t) ≡

  p(x(t) = a1)

p(x(t) = a2 ...

to be an infinite-dimensional sequence of values, corresponding to a probability mass function. If the set S were finite, this would be a finite vector.

Let x(t) be a Markov process with values in state space S. Assume we have a set of times t1 < t2 < ... < tn < tn+1. The Markov property requires that

P(X(tn+1|X(tn,X(tn−1, . . . ,X(t1)) = P(X(tn+1|X(tn)

The Markov structure allows us to determine the joint probability mass functions of the process from the transition probability function described above, as follows. Given times t1 < t2 < ... < tn and states α1, . . . ,αn ∈S, we have:

P(x(t1) = α1,x(t2) = α2, . . . ,x(tn) = αn)

= P(x(tn) = αn|x(t1) = α1, . . . ,x(tn−1) = αn−1)P(x(t1) = α1, . . . ,x(tn−1) = αn−1) = P(x(tn) = αn|x(tn−1) = αn−1)P(x(t1) = α1,x(t2) = α2, . . . ,x(tn−1) = αn−1)

...

= P(x(tn) = αn|x(tn−1) = αn−1) · · ·P(x(t2) = α2|x(t1) = α1)P(x(t1) = α1)

Let t1 < t2. We can then think of the transition probability distribution between two times as a matrix (perhaps infinite-dimensional), where

P(x(t2) = ak|x(t1) = aj) ≡ Pjk(t1, t2)

A Markov process is homogeneous if the transiton probabilities Pjk(t1, t2) depend only on the difference t2 − t1, so we write Pjk(t2 − t2).

In the remainder of this chapter, we discuss special properties of discrete state Markov processes. First, we will introduce discrete-time, discrete-space Markov processes, and discuss interesting aspects of such

158 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

processes, including conditions for approaching a stationary distribution and ergodicity. Following this, we will introduce the continuous time Markov chains by means of a limiting argument. We then analyze a special class of processes, known as birth-death processes, that have a nearest neighbor transition property that facilitates analysis. We conclude this chapter with an introduction to queuing theory using applications of the theory of birth-death processes.

7.1 Discrete-time, Discrete Valued Markov Processes

7.1.1 Process Description

Let x(n),n = 0, 1, . . . be a discrete time, discrete-valued Markov process with possible values in the state space S = {a1,a2, . . .}. We assume that the process can have an infinite number of states; the case where there are only a finite number of states is a special case, and is often referred to as a Markov Chain. The Markov process is characterized by its initial distribution

p(0) =

  p(x(0) = 1)p(x(0) = 2)

...

 

and the one-step transition matrices (with an infinite number of rows and columns, potentially) with elements pij(n), where

pij(n) ≡ P(x(n + 1) = aj | x(n) = ai) (7.1)

is a conditional probability kernel over the finite state space S. Note that the transition probabilities are time-dependent, so that this is an inhomogeneous Markov

process. Let P(n) denote the transition matrix at time n, and let p(n) be the probability mass function of the variable x(n).

The transition matrices can be used to compute the evolution of the probability distributions p(t) over time, as follows: Note that, at time 1,

p(x(0) = i,x(1) = j) = p(x(0) = i)Pij(0) (7.2)

Hence, the marginal probability at time 1 is given by

p(x(1) = i) = ∑ i

p(x(0) = i)Pij(0)

which can be written in terms of matrix operations as

p(1) = P(0)Tp(0)

Extending the above argument inductively yields the following recursion:

p(n) ≡

 p(x(n) = a1)p(x(n) = a2)

...

  = P(0,n)Tp(0) (7.3)

where P(m,n) ≡ P(m)P(m + 1) · · ·P(n− 1) for 0 ≤ m < n. The multistep transition matrix satisfies the Chapman-Kolmogorov equation

P(k,n) = P(k,m)P(m,n) for k ≤ m < n

Note that P(k,k) = I. Discrete-time one-step (P(n)) and multi-time step (P(k,n)) transition probability matrices must satisfy

the laws of conservation of probability. That is, for any i, we must have

∞∑ j=1

Pij(n) = 1;

∞∑ j=1

Pij(m,n) = 1; (7.4)

7.1. DISCRETE-TIME, DISCRETE VALUED MARKOV PROCESSES 159

When the number of states is finite and equal to N, the state transition matrix will be an N ×N matrix P(n), where P(n) is such that all of its rows sum up to 1. Matrices with the property that the rows sum up to 1 are known as stochastic matrices. The other key property of transition probabilty matrices is that all of their elements are nonnegative: pij ≥ 0. Combined with the fact that the columns add up to 1, one can use the following theorem from linear algebra:

Theorem 7.1 (Gershgoren’s Theorem) Consider a square matrix A of dimension n×n. Define distances di =

∑n j=1,j 6=i |Aij|. Define the set L = ∪

n i=1{|λ−Aii| ≤

di}. Then, all of the eigenvalues of A are contained in the set L.

In other words, the magnitudes of the off-diagonal elements can be summed up, and provide a bound for how far away from the diagonal elements are the eigenvalues. The set L consists of the union of circles of radius di centered around each of the diagonal elements Aii. Figure 7.1 illustrates the implications of

Gershgoren’s theorem for the matrix A =

[ 3 2 1 1

] . The eigenvalues must lie in the union of two circles in

the complex plane, centered at the diagonal elements (3,0) and (1,0), with radii 2 and 1, respectively.

Figure 7.1: Illustration of Gershgoren’s Theorem.

The important implication of Gershgoren’s theorem for stochastic matrices is that, since the rows must add up to 1 and all of the elements are non-negative, this implies that all of the circles are inside the unit circle, so that all of the eigenvalues of stochastic matrices have magnitude less than or equal to 1! Figure 7.2 illustrates Gershgoren’s theorem for stochastic matrices.

Furthermore, we know that at least one eigenvalue has value equal to 1, because

 

1 1 ... 1

  =

 

1 1 ... 1

 

because each row must add up to 1.

A special class of discrete-time, discrete-valued Markov processes occurs when the transition probability matrix P(n) does not depend on time. In this case, this class of Markov processes is called homogeneous, and has the property that P(n) ≡ P for all n ≥ 0. In this case, P(0,n) = Pn, and P(m,n) = Pn−m for n ≥ m. In case of finite number of states, P is an N ×N matrix, and can be best understood in terms of a graph where nodes represent the possible states, and arcs represent non-zero elements of P , corresponding to possible transitions in the Markov chain. Figure 7.3 shows the graph for a four state Markov transition

160 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

Figure 7.2: Illustration of Gershgoren’s Theorem for stochastic matrices.

matrix

P =

  P11 P12 0 0 0 0 P23 P24 P31 0 0 0 0 0 P43 P44

 

P 12

P 44

P 43

P 31

P 11

P 23

P 24

Figure 7.3: Graph of Markov Chain transition probabilities.

To illustrate the graphical depiction of a Markov chain, consider the following example. We want to consider a biased, reflected random walk on the integers {1, 2, . . . , 10}. The process starts with state x(0) = 5. At each time interval, the process moves one step right with probability p, or one step left with probability 1 − p, except at the end states 1 and 10. In state 10, if the process attempts to move right, it is reflected back to state 10. Hence, at state 10, the process stays in state 10 with probability p, and moves to state 9 with probability 1 −p. Similarly, when the process is in state 1, it is reflected back to state 1 if it tries to move left. The diagram for this Markov chain is shown in Figure 7.4.

7.1. DISCRETE-TIME, DISCRETE VALUED MARKOV PROCESSES 161 • Seems real hard to do … lot’s of possibilities

1 2

1- p

3 4 5 6 7 8 9 10

Figure 7.4: Graph of a reflected random walk.

The corresponding transition matrix is

P =

 

1 −p p 0 0 0 0 0 0 0 0 1 −p 0 p 0 0 0 0 0 0 0

0 1 −p 0 p 0 0 0 0 0 0 0 0 1 −p 0 p 0 0 0 0 0 0 0 0 1 −p 0 p 0 0 0 0 0 0 0 0 1 −p 0 p 0 0 0 0 0 0 0 0 1 −p 0 p 0 0 0 0 0 0 0 0 1 −p 0 p 0 0 0 0 0 0 0 0 1 −p 0 p 0 0 0 0 0 0 0 0 1 −p p

 

7.1.2 Hitting probabilities and mean hitting times

Let x(n) be a homogeneous, discrete-time Markov process with transition probability matrix P , taking values in a discrete state space S. Suppose we have a subset of states A ⊂ S. The first hitting time of the subset A starting from a state x(0) = i is a random variable defined as:

HAi (ω) = inf{n ≥ 0 | x(n,ω) ∈ A}

Note that it is possible for HAi (ω) to take the value +∞, in case that the process trajectory, for the experiment outcome ω, never reaches any of the states in A. The probability that the process hits A at all when it starts at state i is given by:

hAi = P(H A i (ω) < ∞)

In many problems of interest, we want to compute expected hitting times and hitting probabilities given where we are. Such hitting times can indicate successful completion of events and reaching of milestones. What is surprising is that we will be able to do these computations using simple linear algebra techniques, as described below.

Let’s first consider an example. Suppose we have a four state Markov chain, with transition probability matrix P given by:

P =

 

1 0 0 0 1/2 0 1/2 0 0 1/2 0 1/2 0 0 0 1

 

Note that this system, once it reaches states 1 or 4, stays in those states. Suppose we start in state 2. We would like to compute the expected number of steps required to reach states 1 or 4.

We can compute this as follows: Let ki denote the expected time to reach states 1 or 4 starting from from state i. Then, observe the following relationships:

k1 = 0; k4 = 0

What about k2 and k3? By the Markov nature of the process, observe that

k2 = 1 + 0.5k1 + 0.5k3; k3 = 1 + 0.5k2 + 0.5k4

because the expected hitting time from state i has to average over the expected hitting times from the value of the state at the next time. Basically, any trajectory that starts at i and hits the set A = {0, 4} has to

162 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

take the first step to a state that is connected to i. From that next state, by time invariance, the expected hitting time is the same as that of trajectories that start at that state. These last two equations are easily solved to get h2 = h3 = 2.

What about a hitting probability? Let the set A = {4}. Then, reasoning along the same lines,

hA4 = 1; h A 3 = 0.5h

A 2 + 0.5h

A 4 ; h

A 2 = 0.5h

A 1 + 0.5h

A 4 ; h

A 1 = 0

Solving these, we get hA1 = 0,h A 2 = 1/3,h

A 3 = 2/3,h

A 4 = 1.

Note that we have seen examples of this analysis in Chapter 5, where we studied the gambler’s ruin problem associated with a random walk. The hitting probability analysis is the same as the probability that the gambler will go broke starting with a certain amount of money.

Can we generalize this to arbitrary Markov chains? Let’s first assume that the state space S is finite. The result below characterizes the general solution:

Theorem 7.2 Let hA denote the vector of hitting probabilities for a subset A of the finite state space S. Then, hA is the smallest non-negative solution of the following set of linear equations:{

hAi = 1 i ∈ A hAi =

∑ j pijh

A j i /∈ A

In vector form, h A

= P̂h A

where P̂ is the matrix P with the rows corresponding to i ∈ A deleted. By smallest solution we mean that, if x is another non-negative solution, then xi ≥ hAi .

Note that we have the same number of equations and unknowns, as there is one equation for each i /∈ A. Let’s prove the above. First, let’s show that hA satisfies the equations. Assume x(0) = i ∈ A. Then, the

hitting time HAi = 0, and the hitting probability h A i = 1, which the theorem guarantees by construction.

Now, assume that x(0) = i /∈ A. Then, HAi > 0, as it will take at least one step to reach a state in A. By the Markov property of the process,

hAi = P(H A i < ∞) =

∑ j

P(HAi < ∞,x(1) = j) = ∑ j

P(HAi < ∞|x(1) = j)P(x(1) = j|x(0) = i) = ∑ j

hAj Pij

which shows that hA satisfies theorem 7.2. Now, suppose we have a non-negative solution g to the equations in theorem 7.2. We want to show that

these must be greater than or equal to the expected hitting times. We know that hAi = gi for i ∈ A, as they are set to 1. Suppose i /∈ A. Then,

gi = ∑ j

Pijgj = ∑ j∈A

Pijgj + ∑ j/∈A

Pijgj = ∑ j∈A

Pij + ∑ j/∈A

Pijgj

Now, substitute for gj in the last term, to get:

gi = ∑ j∈A

Pij + ∑ j/∈A

Pij( ∑ k∈A

Pjk + ∑ k/∈A

Pjkgk) = P(x(1) ∈ A) + P(x(1) /∈ A,x(2) ∈ A) + ∑ j,k/∈A

PijPjkgk

By repeated substitution, we get

gi = P(x(1) ∈ A) + P(x(1) /∈ A,x(2) ∈ A) + P(x(1),x(2) /∈ A,P(x(3) ∈ A)) + . . . +

P(x(1), . . . ,x(n− 1) /∈ A,x(n) ∈ A) + ∑

j1,...,jn/∈A

Pij1Pj1J2 · · ·Pjn−1jngjn

Note that all but the last term are P(HAi ≤ n). Thus, gi ≥ P(H A i ≤ n) for any n, because the last term is

non-negative (gk ≥ 0 for all k). Thus,

gi ≥ lim n→∞

P(HAi ≤ n) = P(H A i < ∞) = hi

7.1. DISCRETE-TIME, DISCRETE VALUED MARKOV PROCESSES 163

which shows that h is the smallest nonnegative solution. Can there be multiple solutions? Consider the Markov chain with transition probability matrix

P =

 1 0 00 1/2 1/2

0 1/2 1/2

 

and let A = {1}. Clearly, hA1 = 1, and hA2 = hA3 = 0, because starting at either state 2 or 3, one cannot reach state 1 at all. Note, however that the equations in theorem 7.2 can be solved by any vector g = (1,k,k)T .

Of course, the smallest nonnegative solution among this is (1, 0, 0)T , which are the hitting times. The above theorem is true even if the state space S is infinite. However, we now have an infinite number

of equations to consider, which makes numerical computation harder.

Example 7.1 Consider a random walk on {0, 1, 2, . . .}, where P00 = 1, and Pi(i+1) = Pi(i−1) = 1/2 for i ≥ 1, Pij = 0, |i − j| ≥ 2. This corresponds to an infinite gambler’s ruin problem where the gambler never leaves until he is broke. We would like to compute the hitting probability for the set A = {0}, corresponding to the gambler leaving broke. Here are the relevant equations for the hitting probability hAi :

h A 0 = 1

h A 1 = 0.5h

A 0 + 0.5h

A 2

...

h A n = 0.5h

A n−1 + 0.5h

A n+1

...

We can solve this via z-transforms, as follows: the characteristic equation of the recursion is

0.5z 2 −z + 0.5 = 0

. By inspection, this has a repeated root at z = 0. Thus, this admits solutions of the form hAn = C +Dn for some constants C,D. To match the initial condition kA0 = 0, we get C = 1. The second equation yields C + D = 0.5 + 0.5C + 2D, which is true for all D. Thus, any value of D ≥ 0 will yield a valid nonnegative solution! However, hAn is a probability, and as such, it must be less than 1. Indeed, the smallest nonnegative solution will yield D = 0, so hAn = 1 for all n!

What if we change the problem so that Pi(i+1) = 3/4,Pi(i−1) = 1/4? In this case, the main recursion yields

h A n = 0.25h

A n−1 + 0.75h

A n+1

with characteristic equation 1 − 4z + 3z2 = 0, which yields solution of the form hAn = C + D(1/3)n. To fit the initial condition hA0 = 1, we have C + D = 1, or D = 1 −C. Thus, the general form of the solution is

h A n = (1 −C)(

3 ) n

+ C = ( 1

3 ) n

+ C(1 − ( 1

3 ) n )

Note that, for any C ≥ 0, this remains nonnegative. The smallest non-negative solution is given by C = 0, which is hAn = (

1 3 )n.

Theorem 7.2 deals with hitting probabilities. We can develop a similar result for hitting times.

Theorem 7.3 Let kA denote the vector of expected hitting times for a subset A of the finite state space S, where these values could be infinite. Then, kA is the smallest non-negative solution of the following set of linear equations:{

kAi = 0 i ∈ A kAi = 1 +

∑ j∈S pijk

A j i /∈ A

In vector form, kA = 1 + P̂kA, where P̂ is the state transition matrix with the rows for i ∈ A removed.

To show this, we proceed as before. We show that kA satisfies the equations in theorem 7.3. If x(0) = i ∈ A, then HAi = 0, so k

A i = 0. If x(0) = i /∈ A, then H

A i ≥ 1. By the Markov property,

E[HAi |x(1) = j] = 1 + k A j

164 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

Thus,

kAi =

∞∑ n=1

nP(HAi = n) + ∞P(H A i = ∞) =

∞∑ n=1

P(HAi ≥ n)

∞∑ n=1

∑ j∈S

P(HAi ≥ n,x(1) = j) = ∞∑ n=1

∑ j∈S

P(HAi ≥ n|x(1) = j)P(x(1) = j|x(0) = i)

= ∑ j∈S

∞∑ n=1

P(HAi ≥ n|x(1) = j)P(x(1) = j|x(0) = i)

= ∑ j∈S

Pij(1 + E[H A j ]) = 1 +

∑ j∈S

Pijk A j

which shows that the expected hitting times satisfy the equations of theorem 7.3.

Let g be any solution of the linear equations in the Theorem. Then, gi = k A i = 0 for i ∈ A. Suppose

i /∈ A. then,

gi = 1 + ∑ j/∈A

Pijgj

= 1 + ∑ j/∈A

Pij

( 1 +

∑ k/∈A

Pjkgk

) = P(HAi ≥ 1) + P(H

A i ≥ 2) +

∑ j,k/∈A

PijPjkgk

Continuing the substitutions, we get

gi = P(H A i ≥ 1) + P(H

A i ≥ 2) + · · · + P(H

A i ≥ n) +

∑ j1,...,jn/∈A

Pij1Pj1j2 · · ·Pjn−1jngjn

Noting that gj ≥ 0, we have

gi ≥ lim n→∞

[P(HAi ≥ 1) + P(H A i ≥ 2) + · · · + P(H

A i ≥ n)] = E[H

A i ] = k

A i

which shows that kA is the smallest nonnegative solution.

Example 7.2 Consider the previous example 7.1, where we set Pi(i+1) = 1/4,Pi(i−1) = 3/4. Note that, in average, we are headed towards 0. We want to compute the expected time to reach state 0 from any state n. The relevant equations from theorem 7.3 are:

k 0 0 = 0

k 0 1 = 1 + 0.75k

0 0 + 0.25k

0 2

...

k 0 n = 1 + 0.75k

0 n−1 + 0.25k

0 n+1

...

Note that this set of linear equations has an input which is a constant on the right hand side, corresponding to a pole at z = 1. Thus, the solution to this is of the form

k 0 n = 2n + A + B3

The initial condition k00 = 0 means A = −B. Note that B ≥ 0 is required for the solution to stay non-negative. The smallest non-negative solution is B = 0, which yields k0n = 2n.

7.1. DISCRETE-TIME, DISCRETE VALUED MARKOV PROCESSES 165

1 2

3 4

Figure 7.5: Markov chain with two limits

1 2

5 4

Figure 7.6: Period 2 Markov chain

7.1.3 Steady state behavior of discrete time Markov chains

As discussed previously, the marginal probability mass function p(n) evolves according to a linear system:

p(n + 1) = P(n)Tp(n)

For homogeneous Markov chains in discrete time, this equation may have a limit as n → ∞, as all the eigenvalues of P will have magnitude less than or equal to 1. We are interested in providing conditions where

lim n→∞

Pn = P∞ (7.5)

and

lim n→∞

p(n) = lim n→∞

(Pn)Tp(0) = PT∞p(0) = π (7.6)

To illustrate issues that can arise, consider the two graphs illustrated in Figures 7.5 and 7.6. The first graph shows that, after starting in state 2, one can either go to state 1 or to states 3 and 4. Depending on which transition is used, the limit will be different. It is clear that this Markov chain may have multiple limiting distributions. The second figure illustrates a more complex case. If one starts in state 1 at time 0, note that one can only be in an odd-valued state at even times! This Markov chain will not approach a limit, but rather will oscillate between two limits!

166 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

1- p

1 2

3 4 5 6 7 8 9 10

Figure 7.7: Example of Markov Chain with inaccessible states

For finite state Markov chains, one can define regularity conditions that guarantee that there is a unique eigenvalue of P with magnitude 1, so that there are unique limits in eqs. (7.5) and (7.6). Furthermore, these conditions can be established from the transition diagram of the Markov chain! We discuss these next.

Consider two states i,j of the Markov chain. State j is said to be accessible from state i if there exists a time n such that (Pn)ij > 0. An equivalent graphical condition is that there exists a directed path with positive probability arcs from node i to node j in the Markov Chain graph. In the reflected random walk diagram of Figure 7.4, every state is accessible from every other state. However, consider the minor variation shown in Figure 7.7, where one of the feasible arcs has been removed. In this case, state 7 is accessible from state 6, but state 6 is not accessible from state 7.

Two states i,j are said to communicate if i is accessible from j and j is accessible from i; by convention, every state is said to communicate with itself. Communication is a transitive, symmetric and reflexive binary relationship, hence it is an equivalence relationship. A communicating class is a non-empty set of states that communicate with each other, and no state in the class communicates with any state outside the class. The set of possible states of a finite-valued Markov Chain can be partitioned into disjoint communicating classes. For instance, the Markov Chain illustrated in Figure 7.7 has 2 communicating classes: {1, 2, 3, 4, 5, 6} and {7, 8, 9, 10}.

When a Markov Chain has only one communicating class, it is said to be irreducible. In irreducible Markov Chains, every state communicates with every other state, as in Fig. 7.4.

A state i in a homogeneous Markov Chain is said to be transient if, given that the Markov Chain starts at state i, there is a non-zero probability that the state never returns to state i. Formally, assume x(0) = i, and define the random time T = min[t > 0 : x(t∆) = i]. Then, P(T = ∞) > 0. For finite-state Markov Chains, there is graphical way of identifying a transient state: A state i is transient if and only if there is a second state j such that j is accessible from i, but i is not accessible from j. In Figure 7.7, states 1, 2, 3, 4, 5 and 6 are transient states, and they can each access state 7, but cannot be accessed from state 7. When a state is not transient, it is called recurrent: recurrent states have the property that the expected time to return to the state, given that the Markov Chain starts in that state, is finite. In terms of the random time T defined previously, E[T] < ∞ for recurrent states. In Fig. 7.7, states 7, 8, 9 and 10 are recurrent states. Note that, for finite state Markov Chains, we can label each communicating class as either recurrent or transient.

The meaning of transient states is that, as time grows, the probability of being in a transient state decays to zero. If there is a limiting probability distribution π, then πi = 0.

Note the following: If a finite state Markov chain has more than one recurrent communicating class, there will be more than one limiting distribution for p(n), and the limit will depend on the initial distribution p(0). The matrix P will have more than one eigenvalue equal to 1. This is the case in the Markov Chain in Fig. 7.5, where state 1 is one recurrent communicating class, and states 3, 4 are the other communicating class.

When there is only one recurrent communicating class, there is a unique stationary probability distribu- tion π such that

PTπ = π (7.7)

Specifically, the matrix P will have a single eigenvalue with value 1. However, this condition is insufficient to guarantee that this stationary probability distribution will be the limit distribution for arbitrary initial probability distributions. Specifically, consider Fig. 7.6. It is easy to verify that all states belong to a single communicating class, which is recurrent. However, we have already established that, starting from the initial condition x(0) = 1, the probabilities p(n) do not approach a limit! Indeed, they will approach a limit cycle where they will shift among two different limits for odd and even values of n. In this case, there is a second eigenvalue of P at value -1.

For a finite state Markov Chain, we define the period of state j as the greatest common divisor of all the cycles from state j to itself in the graph of the Markov Chain. A more mathematical definition is that the

7.1. DISCRETE-TIME, DISCRETE VALUED MARKOV PROCESSES 167

period d is the largest integer d such that (Pn)jj = 0 unless n is divisible by d. A state with period 1 is said to be aperiodic. A Markov Chain is periodic if every state has the same period, greater than 1.

We can now give conditions for a finite state Markov Chain to have a unique limiting probability dis- tribution π, which is approached from any initial probability distribution p(0). We state this below as a theorem.

Theorem 7.4 Assume that x(n) is a finite state homogeneous Markov chain with transition probability matrix P . If the Markov Chain is irreducible and aperiodic, then there exists a unique limit distribution π. Furthermore, this limit has the property that πj > 0 for all states j. Such a Markov Chain is called ergodic.

The limit distribution π is the unique eigenvector of the matrix PT corresponding to the eigenvalue 1, with entries that sum up to 1.

What happens if the Markov Chain is aperiodic and has only one recurrent communicating class, but has transient states? In this case, there is still a unique limit that all initial probability distributions converge to, but the limit will have the property that πi = 0 if i is a transient state.

Why do we use the name ergodic in the above theorem? It turns out that such Markov chains are completely ergodic processes when the initial distribution for x(0) starts at the limiting distribution π. We can show then that the process is strict sense stationary (by the Markov property), and that it will be strictly ergodic. That is, for any bounded real-valued function f : S →R, we have

lim N→∞

N∑ n=0

f(x(n)) a.e. = E[f(x)]

The convergence above is stronger than mean-square because of the finite nature of the process. Furthermore, even if the initial distribution is not π, the above equation is valid because the distribution of the x(n) will converge to π fast enough!

An important problem in the analysis of Markov Chains is computing the stationary probability distri- bution π. The algebraic characterization is PTπ = π, where P is the transition probability matrix for both finite and infinite number of states. This can be a cumbersome set of equations to solve. There is another set of equations based on the graphical representation of the Markov Chain transitions, that can be easier to analyze. A cut of a directed graph is a set of arcs such that, when the arcs are removed from the graph, the graph is divided into two disjoint set of nodes with no arcs in between them.

The useful property of cuts is that, given any cut of the Markov Chain graph, the probability flow across that cut must equal zero once the system reaches the stationary distribution. A cut C specifies a subset A ⊂S and its complement Ac in S, and consists of the arcs going from A to Ac, and from Ac to A. Given a distribution π on the states of the Markov Chain, the net probability flow on a cut C is defined as

F(A,Ac) = ∑ i∈A

∑ j∈Ac

Pijπi − ∑ j∈Ac

∑ i∈A

Pjiπj

The main result is that, if π is a stationary distribution of a Markov chain, then the net probability flow along any cut must be zero! This is summarized in the theorem below:

Theorem 7.5 π is a stationary distribution of a Markov chain if and only if

∑ i πi = 1 and the net probability flow on any cut in the

Markov chain graph is zero. That is, for any A ⊂A, we have∑ i∈A

∑ j∈Ac

Pijπi − ∑ j∈Ac

∑ i∈A

Pjiπj = 0 (7.8)

This property is referred to as probability balance. To see that the theorem is equivalent to stationarity, note that if we select A = {i}, you get exactly the

balance equations for the eigenvector, as πi = ∑ j∈S Pjiπj. It is also easy to show the converse, so that

starting from balance equations, one can show flow in and out of any group of states is zero for stationary distributions.

So why is this useful? Sometimes, it is easy to identify cuts that yield equations that are simpler than the balance equations. To illustrate how to use probability balance to compute stationary distributions,

168 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

1 2

5 4

Figure 7.8: Illustration of probability balance

0.2" 0.2"

0.2"0.4"

1" 0.1"

0.9"0.8"

0.2"

Figure 7.9: Diagram of the Markov Chain for the ex- ample

consider the example in Figure 7.8. The example shows three different cuts, that separate the graph into two disconnected sets of nodes with no arcs across them. Applying (7.8) to each of these cuts yields the equations:

P14π1 + P12π1 −P51π5 = 0P23π2 −P32π3 = 0P24π2 + P14π1 −P45π4 = 0 Other cuts are possible, yielding additional equations. Furthermore, we know that π is a probability distri- bution, with entries that sum to one, so that yields another equation that can be used to solve for the entries of π.

Example 7.3 Consider a 4-state discrete time Markov chain, with transition probability matrix described below:

P =

 

0.2 0.2 0.2 0.4 0 0 0 1 0 0 0.1 0.9

0.2 0 0 0.8

 

The graph illustrating the transitions of this Markov chain is shown in Fig. 7.9: Looking at the diagram, it is easy to see that all 4 states are recurrent, as there are directed paths from any one state

to any other state. Thus, the chain has a single recurrent communicating class, and thus is irreducible. One can also determine that the Markov chain is aperiodic, because there are some self-loops of length 1. Thus, the Markov chan has a unique steady state distribution, which can be computed as follows: To compute the steady state distribution, we need 4 equations. One of them is:

π1 + π2 + π3 + π4 = 1

To find 3 others, cut node 2 away from the graph. The flow on that cut yields:

0.2π1 = π2

Cut node 3 away from the graph, to get: 0.2π1 = 0.9π3

To get the last equation, we can cut around node 1 to get:

0.8π1 = 0.2π4

Using the last 3 equations, we get: π2 = π1/5; π3 = 2π1/9; π4 = 4π1

Substituting into the first equation yields:

π1(1 + 1/5 + 2/9 + 4) = 1 ⇒ π1 = 45

244

π2 = 9

244 ; π3 =

244 ; πr =

180

244

The above discussion focused on finite state Markov chains, where the state space S has a finite number of states. What changes when the state space is infinite? We can no longer use linear algebra to establish our results, as the transition probability function Pij does not have a convenient representation as a finite matrix. We highlight some of the key issues and differences below.

7.1. DISCRETE-TIME, DISCRETE VALUED MARKOV PROCESSES 169

Example 7.4 Consider a random walk with probability 1/2 of going up or down at each time. It is easy to see that every state communicates with every other state. However, by symmetry, at equilibrium every state should have the same probability, which of course must be 0 since there are an infinite number of states! Note that this Markov chain has period 2, and has a single communicating class.

Example 7.5 Consider a Markov chain defined on the non-negative numbers as follows: P00 = 1/2,P01 = 1/2. For k > 0, P(k−1)k = Pk(k+1) = 1/2. All other Pij = 0, |i − j| ≥ 2. This chain is aperiodic (state 0 has a self-transition, so it has period 1) and has a single communicating class. However, this chain will not have an equilibrium distribution. Looking at balance equations, cutting between states i and j, we the relation:

πk = πk+1,k = 0, 1, . . .

Hence, every state would have the same steady state probability, but with an infinite number of states, they would all be zero, a contradiction!

One way of seeing this is to look at the expected hitting time to hit 0 from state n. As in the previous section, this satisfies

k 0 n = 1 + 0.5k

0 n−1 + 0.5k

0 n+1

The form of the solution is k0n = Bn−n2. There is no value of B for which k0n is non-negative for all n, so k0n = ∞ for all n. Thus, although we visit state 0 with probability from any starting state (as can be seen by solving for h0n = 1), it takes on average an infinite number of steps to do so!

The first important difference is in the concept of recurrence. When the transition graph was connected and the state space was finite, we could guarantee that Pnij > 0, and with probability 1, we would visit state j in finite expected time. When the state space is infinite, this condition is no longer sufficient.

Let x(n) be a time-homogeneous Markov chain with transition probability P . Note that the state space S may be infinite. Define the following quantities:

Hi = inf{n ≥ 0 : x(n) = i} = hitting time for state i Ti = inf{n ≥ 1 : x(n) = i} = first passage time for state i

Notice that Hi = Ti as long as x(0) 6= i. When x(0) = i, then Ti is the revisit time for state i. We can now define some useful quantities relating how x(n) visits a particular state i, as:

Vi =

∞∑ n=0

I[x(n) = i] is the number of visits to state i

fi = P(Ti < ∞|x(0) = i) is the probability that the chain revisits state i mi = E[Ti|x(0) = i] is the expected return time to state i

Consider the case of a finite-state aperiodic Markov chain with a single recurrent communicating class, but with some transient states. Let i be a transient state. Then, Vi is finite, and fi < 1. However, if i is a recurrent state, we get that Vi = ∞ with probability 1, fi = 1 and mi < ∞, so that the chain continues to revisit state i. We use these concepts to extend the definition of recurrence to infinite state Markov chains:

Definition 7.1 A state i of a homogeneous Markov chain x(n) is recurrent recurrent if

P(Vi = ∞|x(0) = i) = 1

A recurrent state is one that you return to an infinite number of times. Indeed, we can characterize a recurrent state as one for which fi = 1, and a transient state as one for which fi < 1. When the state space is not finite, we don’t have simple characterizations of what recurrent states and transient states are. However, we can use the transition probabilities to get equivalent definitions:

Theorem 7.6 State i in a homogeneous Markov chain is recurrent if and only if

∞∑ n=0

(P n )ii = ∞

170 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

To show this, note that for recurrent i, one has P(Vi = ∞|x(0) = i) = 1. Note also the following interpretation:

(Pn)ii = P(x(n) = i|x(0) = i) where Pn is the n-step transition probability kernel P(x(n) = j|x(0) = i), which can be obtained through direct application of the one-step kernel n times. Thus,

∞∑ n=0

(Pn)ii =

∞∑ n=0

E[I{x(n) = i}|x(0) = i] = E[ ∞∑ n=0

I{x(n) = i}|x(0) = i] = E[Vi|x(0) = i] = ∞

If i were a transient state, then fi < 1. Note that we can view the return process as a geometric random variable because of the Markov nature of the process x(n). The first return occurs with probability fi, the second return with probability f2i , etc. Thus, the expected number of returns is

1 1−fi

, which is finite. By

the above argument, for transient states i, ∑∞ n=0(P

n)ii < ∞. We can now use the same definitions we had previously for communicating classes. State i communicates

with state j if (Pn)ij > 0 for some n ≥ 1. belong to the same communicating class if to establish the following:

Theorem 7.7 Let C be a communicating class in the homogeneous Markov chain x(n). Then, either all states in C are recurrent or all states in C are transient.

To see this, take any pair of states i,j ∈ C and suppose that i is transient. Since i,j communicate, there exists n,m ≥ 0 with (Pn)ij > 0, (Pm)ji > 0. Then, for any r ≥ 0,

(Pn+m+r)ii ≥ (Pn)ij(Pr)jj(Pm)ji

So,

(Pr)jj ≤ 1

(Pn)ij(Pm)ji (Pn+m+r)ii

Summing over all r ≥ 0 yields ∞∑ r=0

(Pr)jj ≤ 1

(Pn)ij(Pm)ji

∞∑ r=0

(Pn+m+r)ii

The last sum is finite since i is transient, so the left hand side is also finite, indicating that j is also transient. As was the case for finite state Markov chains, every recurrent class will be closed, so that once a Markov

chain enters a state in a recurrent class, the future states in the chain must belong to the same recurrent class. Otherwise, there would be a state i in the recurrent class that communicates with a transient state j (so (Pn)ij > 0 for some n ≥ 1) but j does not communicate with i. We can thus show that this contradicts P(Vi = ∞) = 1, so that i won’t get revisited infinitely.

However, the converse is not true. If we have a closed class, it may not be recurrent! We do have the following result: if a closed communicating class has a finite number of states, it must be recurrent. However, there will be examples of closed communicating classes that won’t be recurrent.

Recurrence is the key property for extending our previous results to infinite Markov chains. The impli- cations of recurrence are summarized below:

Theorem 7.8 Suppose P has a single communicating class C, which is recurrent. Then, for every state j ∈ C, P(Tj < ∞) = 1.

We now focus on the steady state behavior. Does a steady state distribution exist? Can there be more than one? How can one calculate it? We define a couple of useful variables to help understand this behavior. Remember that Tk is the first return time for state k. Let

V ki =

Tk∑ n=0

I{x(n) = i} = number of visits to state i before revisiting state k.

γki = E[V k i |x(0) = k] expected number of visits to i before revisiting k

Vi(n) =

n∑ k=0

I{x(n) = i} number of visits to state i before time n

7.1. DISCRETE-TIME, DISCRETE VALUED MARKOV PROCESSES 171

If there were an invariant distribution πi, i ∈S, then one would like to show

E[Ti] = 1

πi , γki =

πi πk

and

limn→∞ Vi(n)

a.e. = πi

so that this converges almost surely in distribution. The main result for existence and uniqueness of steady state distributions for general Markov chains

requires two items: First, one must have recurrent states. Second, one must have the property that, for a recurrent state, the expected return time is finite. We call a state i positive recurrent if it is recurrent and mi = E[Ti] < ∞. When a recurrent state has infinite expected return time, we call it null recurrent.

Theorem 7.9 Let P be the state transition kernel of an irreducible Markov chain. Then, the Markov chain has a positive recurrent state i if and only if it has an invariant distribution π. Furthermore, if it has an invariant distribution, then all states are positive recurrent, and E[Ti] =

1 πi

for all states i.

Note that this does not guarantee that all initial distributions approach the invariant distribution π. The problem is that we can still have periodic chains! Here is the final extension that we need:

Theorem 7.10 Let P be the transition probability kernel of an irreducible, aperiodic, positive recurrent Markov chain (also called ergodic), with invariant distribution π. Then, for any initial distribution, P(x(n) = j) → πj as j →∞ for all j. In particular,

lim n→∞

(P n )ij = πj

Ergodic Markov chains are completely ergodic. That is to say, for any reasonable function f mapping states S into real numbers,

lim n→∞

n∑ k=1

f(x(k)) a.e. = E[f(x)]

where the expectation on the right hand side is taken with respect to the limit probability π. Note that this is a much stronger version of the law of large numbers for discrete-valued random variables. The strong law of large numbers requires independent, identically distributed random variables. However, the Markov chain states are not independent over time. Nevertheless, the mixing conditions are sufficient to guarantee ergodicity. The mixing of an ergodic Markov chain is similar to some of the extensions to the law of large numbers discussed earlier in Chapter 2.

Computing the stationary distribution of ergodic Markov chains when the state space is infinite can be done using the balance equations π = PTpi, where the vector notation is extended to infinite dimensions. This will now require solution of an infinite number of linear equations. The characterization of 7.5 is helpful in getting these equations into simple form, as illustrated below.

Example 7.6 Consider a Markov chain defined on the non-negative numbers as follows: P00 = 1/2,P01 = 1/2. For k > 0, P(k−1)k = 3/5,Pk(k+1) = 2/5. All other Pij = 0, |i− j| ≥ 2. This chain is aperiodic (state 0 has a self-transition, so it has period 1) and has a single communicating class. It is also easy to see that the mean revisit time for state 0 is finite, so the states are positive recurrent, and the chain will be ergodic. To get equations characterizing the limit distribution, let’s use cuts between every pair of subsequent states. This yields the equations:

2 π0 =

5 π1

5 π1 =

5 π2

...

5 πn =

5 πn+1

...

172 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

We also have the probability mass equation ∑ i∈S

πi = 1

Solving the equations starting from π1, we get

πn = ( 2

3 ) n−1

π1; π0 = 6

5 π1

Substituting into the probability mass equation yields:

( 6

5 +

∞∑ k=0

( 2

3 ) k )π1 = 1

This implies π1 = 5 21

, which means π0 = 6 21

, and all the other πi can be computed.

7.2 Continuous-Time, Finite Valued Markov Processes

7.2.1 Process Description

To extend the discussion of the previous section to the continuous-time case, let’s consider the case where the state space S is finite and has n elements. We assume we have a discrete time Markov chain, indexed by time steps of size ∆, so we have x(∆),x(2∆),x(3∆), . . . ,. We want to consider the limit as the time step size ∆ → 0. In order for this limit to be well-behaved, we also need that the probability of changing states over an interval of length ∆ should be proportional to ∆, so it also goes to zero as ∆ → 0; otherwise, we have a limiting process that jumps infinitely often in an infinitesimal interval, and the limit will not be defined in any meaningful sense. Suppose that, for very small ∆, we have

pij(n∆) =

{ qij(n∆)∆ + o(∆) if i 6= j 1 −

∑ j 6=i qij(n∆)∆ + o(∆) if i = j

where o(∆) denotes a term for which lim∆→0 o(∆)

∆ = 0. The above equation should be interpreted as

indicating that the probability that there is a transition in the value of the state from i to j is linearly proportional to ∆ for small ∆; with probability close to 1, the process will not undergo any transitions. The quantity qij(n∆) is called the transition rate from i to j at time n∆.

We would like to take the limit of the above equation as ∆ → 0, while holding the product n∆ = t. Define as before the multistage transition matrix P(s,t) = [pij(s,t)], where

pij(s,t) = P(x(t) = j|x(s) = i)

From eq. (7.3), we have

p(t) ≡

  P(x(t) = a1) P(x(t) = a2)

... P(x(t) = an

  = P(0, t)Tp(0)

What is missing to completely specify the continuous-time limit is to determine P(s,t) for any s ≤ t. In order to do this, we use the limit process. For a Markov process, we know that the Chapman-Kolmogorov equation holds:

P(s,t) = P(s,t− ∆)P(t− ∆, t)

Substituting the definition of the one-step transition probability from eq. (7.1), we obtain

P(s,t) = P(s,t− ∆)

 

1− ∑ j 6=1 q1j(t−∆)∆+o(∆) q12(t− ∆)∆ + o(∆) · · · q1n(t− ∆)∆ + o(∆)

q21(t− ∆)∆ + o(∆) 1− ∑ j 6=2 q2j(t−∆)∆+o(∆) · · · q2n(t− ∆)∆ + o(∆)

... ...

. . . ...

qn1(t− ∆)∆ + o(∆) qn2(t− ∆)∆ + o(∆) · · · 1− ∑ j 6=n qnj(t−∆)∆+o(∆)

 (7.9)

= P(s,t− ∆)(I + Q(t− ∆)∆ + o(∆)) (7.10)

7.2. CONTINUOUS-TIME, FINITE VALUED MARKOV PROCESSES 173

where

Q(t− ∆) =

  − ∑ j 6=1 q1j(t− ∆) q12(t− ∆) · · · q1n(t− ∆) q21(t− ∆) −

∑ j 6=2 q2j(t− ∆) · · · q2m(t− ∆)

... ...

. . . ...

qn1(t− ∆) qn2(t− ∆) · · · − ∑ j 6=n qnj(t− ∆)

  (7.11)

Thus, rearranging terms and dividing by ∆, we get

P(s,t) −P(s,t− ∆) ∆

= P(s,t− ∆)Q(t− ∆) + P(s,t− ∆) o(∆)

∆ (7.12)

Taking limits as ∆ → 0 gives

∂

∂t P(s,t) = P(s,t)Q(t) for t ≥ s ≥ 0 (7.13)

with the initial condition P(s,s) = I. Note also that p(t) = P(s,t)Tp(s); thus, taking the transpose of the above equation and multiplying by p(s) on the right gives

∂

∂t P(s,t)Tp(s) =

dt p(t) = Q(t)TP(s,t)Tp(s) = Q(t)Tp(t) (7.14)

subject to the initial condition p(0). The above equation governs how the probability mass function of x(t) evolves over time.

The matrix Q(t) is known as the infinitesimal generator of the Markov chain. The matrix Q(t) has some special properties: the diagonal elements are non-positive, and all the off-diagonal elements are non- negative. Also, note that the rows sum up to 0! This implies that Q(t) has at least one zero eigenvalue. Indeed, Gershgoren’s theorem implies that all of the eigenvalues of Q(t) have non-positive real part, and they can never be purely imaginary. Hence, there can be no periodic finite-valued Markov processes in continuous time.

As in the discrete-time case, one can consider the special case of homogeneous Markov processes where the transition rates qij(t) ≡ qij are independent of time. In this case, equation (7.14) becomes

dt p(t) = QTp(t) (7.15)

This can be solved explicitly as

p(t) = eQ Ttp(0)

where the matrix exponential of a square matrix A is defined as

eA =

∞∑ n=0

so that

eQ Tt =

∞∑ n=0

tn(QT )n

Example 7.7 The simplest case of a homogenous, discrete-valued, continuous time Markov process is the random telegraph process x(t) with values in S = {−∞,∞}. Let λ be the rate of the exponentially distributed switching times. For this process, the transition rate matrix Q is given by:

Q =

[ −λ λ λ −λ

] We also know from our previous discussion that x(t) = (−1)N(t)x(0), where N(t) is the number of jumps in a Poisson process with rate λ. Assuming the process starts with x(0) = 1, then P(x(t) = 1) = P(N(t) is even ). Also, we know

N(t) has Poisson distribution with parameter λt, so P(N(t) = k) = (λt)k

k! e−λt. Thus, P(x(t) = 1) =

∑∞ k=0

(λt)2k

(2k)! e−λt,

where we have summed the probabilities that N(t) is even (equal to 2k for some nonnegative integer k).

174 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

1 2 3 4 5 6 7 8 9 100

Figure 7.10: Graph of Poisson Process

Let’s see what we could learn from using the information from the infinitesimal generator of the Markov chain. Letting p1(t) = P(x(t) = 1),p2(t) = P(x(t) = −1), we get the equations:

dt p1(t) = −λp1(t) + λp2(t);

dt p2(t) = −λp2(t) + λp1(t)

We can easily compute the transition matrix eQ Tt (using, for example, MATLAB’s expm() function), to get

e QTt

[ 1+e−2λt

2 1−e−2λt

2 1+e−2λt

]

which gives the much simpler solution P(x(t) = 1) = p(t) = (1 0)TeQ Ttp(0) = 1+e

−2λt

2 .

We can verify that these solutions are the same, because

∞∑ k=0

(λt)2k

(2k)! = eλt + e−λt

because the odd coefficients in the infinite sum expansion for the exponentials cancel out. Thus,

P(x(t) = 1) =

∞∑ k=0

(λt)2k

(2k)! e −λt

= eλt + e−λt

2 e −λt

= 1 + e−2λt

As in the discrete time case, it is useful to depict the dynamics of a homogeneous discrete space Markov process in terms of a graph. Figure 7.10 shows the graph of a Poisson process. In this case, the arcs indicate the non-zero off-diagonal elements of the matrix Q. There is no need to draw a self-transition between nodes, as this rate is implicitly defined by the other transitions.

There is one key property of continuous time homogeneous Markov chains which is worth noting, which will help us exploit the results we already have for discrete-time Markov chains. In essence, we can view continuous time Markov chains as discrete time Markov chains with a random clock!

Define the occupancy time Ti of state i as

Ti = min{t > 0|x(t) 6= iwhere x(0) = i} (7.16)

That is, Ti is the time until the process which starts at state i leaves state i, the first exit time from i. Then, Ti is an exponentially distributed random variable with rate −qii =

∑ j 6=i qij. An equivalent statement is

P(Ti ≤ t) = 1 −eqiit

A simple way to see this is to consider a modified Markov chain where we eliminate all the transitions in the graph except those out of state i. That is, the Markov process will start in state i, and will leave state i once, and stay wherever it lands for the rest of time. The time to exit state i is the same here as in the original Markov process. For this modified chain, the i-th element of equation (7.15) becomes

dt p(t)i = qiip(t)i (7.17)

where qii < 0, with the initial condition pi(0) = 1. In this case, we get

p i (t) = eqiit (7.18)

Note that p i (t) ≡ P(Ti > t), since if the state is still i at time t, then the exit time Ti must be larger than

t. This yields the previous equation for P(Ti ≤ t).

7.2. CONTINUOUS-TIME, FINITE VALUED MARKOV PROCESSES 175

Another interesting fact is that one can compute the probability distribution of where the process goes when it leaves state i. Consider the same modified graph where the only arcs are those out of state i. For any j 6= i, the probability mass equation is

dt p j (t) = qijpi(t)

with the initial condition p j (0) = 0. Substituting the solution of (7.18) yields

p j (t) =

qij −qii

(1 −eqiit) lim t→∞

p j (t) =

qij −qii

Since the modified Markov chain stays in the same state once it leaves state i, we see that the probability of landing in state j when leaving state i is

qij −qii

. This leads to the interesting alternative view of continuous Markov chains, as discrete transitions between

states occurring at random times. We will exploit this representation to generalize our previous results.

7.2.2 Hitting probabilities and mean hitting times

We can compute hitting probabilities as well as mean hitting times for continuous time homogeneous Markov chains, in a manner similar to what we did for discrete time Markov chains. One way to see this is to view the continuous time Markov chain as a discrete time chain with a random time between jumps, with an exponential clock. That is, focus on the Markov chain behavior at the jump times, where the time between jumps is random. This is what is done in discrete event simulations, where one executes the events, and keeps track of the time between events.

When the Markov chain is in state i, the time to leave state i will be an exponential random variable with rate −qii. When the Markov chain leaves state i, the process will transition from state i to state j 6= i with probability

qij −qii

. With this analogy, the probability that the state x(t) at some future time t will hit a subset of states A ⊂ S is the same as the probability that this “discretized” Markov chain will hit that subset.

Define as before HAi = inf{t ≥ 0|x(t) ∈ A,x(0) = i}

hAi = E[I(H A i < ∞)] = P(H

A i < ∞)

Then, using the above arguments, the vector of hitting probabilities hA satisfies:

hAi = 1 i ∈ A

hAi = ∑

j 6=i,j∈S

qij −qii

hAj i /∈ A

Multiplying by the denominator on the right hand side allows us to rewrite the equation as∑ j∈S

qijh A j = 0, , i /∈ A

where the sum now includes the term qii. In vector form, we can write:

QAhA = 0

where QA is the infinitesimal generator with the rows corresponding to i ∈ A deleted. As was the case for discrete time Markov chains, hA will be the smallest non-negative solution to these

equations. Note that, for recurrent Markov chains, the solution will always be hAi = 0, as every state can can be reached in finite time from every other state. As a matter of fact, since the rows of Q sum up to zero, the solution hAi = 1 is always a solution to the above equations, but it may not be the smallest.

Example 7.8 Consider a Markov chain with state space the non-negative integers S = {0, 1, 2, . . .}. Define the infinitesimal generator Q of this Markov chain as follows: Qi(i+1) = 2,Q(i+1)i = 1, i ≥ 0 and Qij = 0, |i− j| ≥ 2. The diagonal elements of Q

176 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

are defined by the rates in the off-diagonal elements. The generator is illustrated below:

Q =

  −2 2 0 0 · · · 1 −3 2 0 · · ·

0 1 −3 2 · · · ...

... ...

. . . ...

 

This corresponds to a queue with arrival rate 2 and departure rate 1. Select A = {0}, which represents the queue is empty. Hence, we are computing the probability that, if the initial queue has i elements, it will become empty at some future time. Then, the above equations are become

h A 0 = 1

h A i−1 − 3h

A i + 2h

A i+1 = 0

This is again a difference equation with characteristic equation 2z2 − 3z + 1 = 0, so the general solutions are of the form hAn = C(

1 2 )n + D. The initial condition is hA0 = 1, so C + D = 1. However, the smallest nonnegative solution has

C = 1,D = 0 because the first term decays, so C should be as large as possible. Note however that D ≥ 0 is needed or else the solution can become negative. Thus, the minimal non-negative solution is hAn = (

1 2 )n.

Note that C = 0,D = 1 is also a possible solution, but it is not the minimal one.

We can do a similar analysis for the expected hitting times. The biggest difference is that, in discrete time, the expected duration of a single step is 1. In continuous time, the “random clock” expected duration to leave a state i is −1

qii . Let kAi denote the expected hitting time for set A, starting from state i. Using the

results for discrete time chains, we obtain the following equations:

kAi = 0 i ∈ A

kAi = −1 qii

+ ∑

j 6=i,j∈S

qij −qii

kAj , i /∈ A

Multiplying the last equation through by qii, we get∑ j∈S

qijk A j = −1, i /∈ A

As was the case for discrete time Markov chains, kAi is the smallest nonnegative solution of these equations.

Example 7.9 Consider a Markov chain with state space the non-negative integers S = {0, 1, 2, . . .}. Define the infinitesimal generator Q of this Markov chain as follows: Qi(i+1) = 2,Q(i+1)i = 3, i ≥ 0 and Qij = 0, |i− j| ≥ 2. The diagonal elements of Q are defined by the rates in the off-diagonal elements. The generator is illustrated below:

Q =

  −2 2 0 0 · · · 3 −5 2 0 · · ·

0 3 −5 2 · · · ...

... ...

. . . ...

 

This corresponds to a queue with arrival rate 2 and departure rate 3. Select A = {0}, which represents the queue is empty. Hence, we are computing the expected duration that, if the initial queue has i elements, it will become empty. Then, the above equations are become

k A 0 = 0

3h A i−1 − 5

A i + 2h

A i+1 = −1

This is again a difference equation with characteristic equation 2z2 − 5z + 3 = 0. This has solutions z = 1,z = 3 2

. There is also a constant right hand side input, which overlaps the pole at z = 1, leading to the specific solution i.

The general solution is of the form kAn = n + C + D 3 2

n . Since kA0 = 0, we must have C + D = 0. Note that D ≥ 0

for the solution to be nonnegative for large n. Thus, the smallest solution will have D = 0, which requires C = 0 and leaves the final answer: kAn = n.

7.2. CONTINUOUS-TIME, FINITE VALUED MARKOV PROCESSES 177

7.2.3 Steady state behavior of continuous time Markov chains.

Consider the transition graph of a Poisson process shown in Figure 7.10, where the arcs indicate the non-zero off-diagonal elements of the matrix Q. In this graph, we have the same definitions for communicating states and communicating classes as in the discrete time case. A key question is to determine when eq. (7.15) will approach a limit distribution. Following the analogy with the discrete-time case, in case that Q has a single zero eigenvalue, the above equation will converge as t →∞ to

π = lim t→∞

p(t) (7.19)

where π will be the unique eigenvector satisfying

QTπ = 0

with nonnegative elements summing up to 1. Conditions that guarantee that Q will have a single zero eigenvalue are similar to those in the discrete

time case: For finite numbers of states, as long as the graph has a single recurrent communicating class, there will be a single stationary distribution, and since the process cannot be periodic, every initial distribution will converge to this limiting distribution.

When the number of states is infinite, we need additional conditions besides recurrence. The definition of positive recurrent state is the same, where the state is recurrent and the expected revisit time is finite.

Specifically, we need that the Markov chain have a single positive recurrent class. The following theorem summarizes the main results:

Theorem 7.11 Let x(t) be a homogeneous continuous time Markov Chain with infinitesimal generator Q, and assume the Markov chain is irreducible. Then, the Markov chain is positive recurrent if and only if there exists a limit distribution π such that

Q T π = 0

∑ i∈S

πi = 1

. Furthermore,

lim n→∞

p(n) = π

for every initial probability distribution on the initial state x(0).

As a final observation in this section, we observe that the concept of probability balance will again apply as a characterization of the stationary probability distribution function π. Specifically, across any cut of the Markov process graph, the instantaneous net flow of probability must be zero once the steady state distribution has been reached.

Specifically, let C = {(i,j)} be a cut of a continuous time Markov Chain graph with a single positive recurrent communicating class, where cut C separates a set of states A from its complement Ac = S −A. Then, the stationary probability distribution π satisfies∑

i∈A

∑ j∈Ac

Qijπi = ∑ k∈A

∑ j∈Ac

Qjkπj

Figure 7.11 shows a 3 state chain with transition rate matrix

Q =

  −6 5 1−3 −3 0

0 2 −2

 

Probability balance in this graph leads to the following conditions:

π1 − 2π3 = 0π1 + 5π1 − 3π2 = 0

Coupling this with the additional equation π1 + π2 + π3 = 1 yields a system that can be solved to yield π1 = 2/7,π2 = 4/7,π3 = 1/7.

178 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

1 2

1 5

Figure 7.11: Illustration of probability balance for continuous time Markov chains

The simplest case of a homogeneous, discrete-valued, continuous time Markov process is the Poisson process, with values in 0, 1, . . . Let λ denote the rate of the Poisson process. For this process, we know

pn(t) ≡ P(N(t) = n) = (λt)n

n! e−λt

The corresponding transition rate matrix Q is given by

Q =

 −λ λ 0 0 · · ·0 −λ λ 0 · · ·

... . . .

. . . . . .

. . .

 

The temporal evolution of the probability mass function of the Poisson process satisfies

dt pn(t) = −λpn(t) + λpn+1(t)

7.3 Birth-Death Processes

Rather than continuing to focus on general discrete-valued, continuous-time Markov processes, let’s examine first a simpler case which serves as the foundation for queuing theory: the case of continuous-time birth-death processes. A continuous-time birth-death process x(t) is a discrete-time, continuous-state Markov process which has the special transition rate matrix defined as

[Q(t)]ij =

  λi(t) if i = j − 1 µi(t) if i = j + 1

−(λi(t) + µi(t)) if i = j 0 otherwise

(7.20)

where we define µ1(t) = 0 to indicate that there can be no transitions to a value below 1. In other words, over a small interval ∆, the process x(t) = i has a small probability λi(t)∆ that its value increases by 1, a small probability µi(∆) that its value decreases by 1, and a probability 1 − (λi(t) + µi(t))∆ that its value stays the same. The parameters λi(t) are the birth rates, and the parameters µi(t) are the death rates.

A birth-death process is said to be homogeneous if the birth rates and death rates are independent of time. For a homogeneous birth-death process, we have P(s,t) = P(0, t−s) for 0 ≤ s ≤ t.

Example 7.10 Assume that x(t) is a homogeneous birth-death process with birth rates

λi =

{ λ if i = 1

0 otherwise

and death rates

µi =

{ λ if i = 2

0 otherwise

7.3. BIRTH-DEATH PROCESSES 179

Then, the process x(t) can only take values on {1, 2}. This is known as a finite-state Markov process. Define a new process y(t) = 2x(t) − 3; then, y(t) looks like the random telegraph process. Indeed, we have constructed the random telegraph process, as we can verify below! The transition probability density P(s,t) satisfies the equation

∂

∂t P(s,t) = P(s,t)

[ −λ λ λ −λ

] ≡ P(s,t)Q

subject to the initial condition P(s,s) = I. The solution of this equation is given by

P(s,t) = e Q(t−s)

= I + Q(t−s) + Q2

2! (t−s)2 + . . .

Evaluating the exponential, we can compute the probability that y(t) = 1, conditioned on y(0) = 0, as P(0, t)12 = e−λt sinh(λt), which is the same probability distribution as was computed for the random telegraph process.

Example 7.11 Suppose x(t) is a homogeneous birth process that is 1 with certainty at t = 0, and has constant birth rates λi(t) = λ. The resulting process is the same as N(t) + 1, where N(t) is a standard Poisson process with rate λ. To see this, since the process is homogeneous,

P(s,t) = P(0, t−s) for 0 ≤ s ≤ t

Furthermore, since the initial condition is p(0) = [

1 0 0 · · · ]T

, we have

p(t) =

  P(0, t)11 P(0, t)12

...

 

Furthermore, the Chapman-Kolmogorov equation becomes:

dt P(0, t) = P(0, t)

  −λ λ 0 · · · 0 −λ λ · · · ...

... ...

. . .

 

In particular, looking at the first row of the matrix P(0, t), we find

dt P(0, t)11 = −λP(0, t)11

dt P(0, t)12 = λP(0, t)11 −λP(0, t)12

dt P(0, t)1n = λP(0, t)1(n−1) −λP(0, t)1n

Solving recursively, we find

P(0, t)11 = e −λt

P(0, t)12 =

∫ t 0

e −λ(t−s)

e −λs

λ ds = λte −λt

P(0, t)13 =

∫ t 0

e −λ(t−s)

λ 2 se −λs

ds = (λt)2

2! e −λt

P(0, t)1n = (λt)n

n! e −λt

which is clearly the Poisson distribution associated with the Poisson process.

One nice property of the birth-death process is that it is straightforward to compute the steady-state distribution π, since the matrix Q is tri-diagonal! Thus, starting with the first equation, one obtains:

µ2π2 −λ1π1 = 0 λ1π1 + µ3π3 − (µ2 + λ2)π2 = 0 λ2π2 + µ4π4 − (µ3 + λ3)π3 = 0

... (7.21)

180 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

Solving recursively, one obtains:

µ2π2 = λ1π1

µ3π3 = λ2π2

µ4π4 = λ3π3 ... (7.22)

The above equations imply that the “probability flow” is balanced between each pair of states at steady state (this is known as detailed balance). Thus, we can express all of the probabilities in terms of the first value π1, to obtain

πn =

  n∏ j=2

λj−1 µj

 π1 (7.23)

Since the probabilities must add up to 1, one gets the following expression for π1:

π1 = 1

1 + ∑∞ i=2

(∏n j=2

λj−1 µj

) (7.24) In some birth-death processes, the birth rate is zero after a certain value, so that there are only a finite

number of possible states. These are special cases where the above summation is easy to compute. Another special case is considered in a later section on queuing systems, where the birth rates λi and the death rates µi are independent of i, except for µ1 = 0.

7.4 Queuing Systems

A queuing system is an example of a birth-death process, where arrivals to a queue are modeled as Poisson processes, and departures from a queue are also modeled as a separate independent Poisson process. The simplest case of a queuing system is the M/M/1 (the notation stands for Markov arrivals, Markov departures, 1 server): in this system, arrival of customers to a queue is modeled by a Poisson process N(t) with constant arrival rate λ. Thus, it is assumed that, with probability 1, there is at most one arrival at a particular time t. As discussed in the previous section, this corresponds to a birth process, with constant transition rate λ, where the states are now numbered 0, 1, . . . to correspond to the number of customers in a queue. In addition to arrivals, there is a separate independent Poisson process D(t) representing the departure process, which has constant rate µ as long as there are customers in the queue, and rate 0 otherwise. Again, with probability 1, there is a maximum of one departure at each time. Thus, we see that the M/M/1 queue is a birth-death process with parameters

λi = λ,i = 0, 1, . . . ; µ0 = 0, µi = µ,i = 1, 2, . . . (7.25)

Using formulas (7.23,7.24), we obtain the steady state distribution for the M/M/1 queue:

πn =

  n∏ j=1

 π0, i = 1, 2, . . . (7.26)

π0 = 1

1 + ∑∞ i=1

(∏n j=1

λ µ

) = 1 − λ µ

(7.27)

The factor λ µ

= ρ is called the utilization factor. In terms of this factor, the steady-state distribution becomes

πn = ρ n(1 −ρ) =

µ−λ , n = 0, 1, . . . (7.28)

Note that the utilization ρ must be less than 1 for a steady-state distribution to exist. Using this distribution, we can compute several key properties of M/M/1 queues, including the expected number of

7.4. QUEUING SYSTEMS 181

customers in the system, as well as the average waiting time. The expected number of customers is given by:

E[N] =

∞∑ n=0

nπn =

∞∑ n=0

nρn(1 −ρ) = ρ

1 −ρ (7.29)

Using this relationship, we compute the expected waiting time of a newly-arrived customer to begin service; this requires that the existing average number of customers be served. Thus, the waiting time for service is computed as the product of the average number of customers in the system times the average departure time per customer. The average departure time per customer is given by 1/µ, using the exponential distribution of the interarrival time of Poisson processes. Thus, the average waiting time is

E[W ] = E[N] 1

µ =

µ−λ (7.30)

Thus, the larger the utilization, the longer the waiting time. A final relation is the average amount of time to finish service, which is

E[T] = 1

µ + E[W] =

µ−λ Similar formulas can be developed for queues with more than one server. We provide some examples

below.

Example 7.12 Consider the following discrete-space problem: Baybanks has one ATM (automatic teller machine). Customers arrive to use the machine randomly, as a Poisson process, at the rate of 10 customers/hour, and perform one transaction each. Assume that the duration of each transaction is an exponential random variable, independent from transaction to transaction, and independent of the arrival process. The average transaction duration is 5 minutes, so the service rate is 12 transactions/hour, as long as there are enough customers.

1. To formulate this model as an M/M/1 queueing model, we can draw the state transition diagram as in Figure 7.12:

2. Assume that the process is in steady state. What is the probability that there is no one using the ATM when you arrive? We can compute this as the steady state probability π0 = 1− λµ , where µ is the departure rate of 12, and λ is the arrival rate of 10, so it is 1/6.

3. What is the steady state probability that at least 5 people are waiting in line, including the one being served, when you arrive? The answer is given by

1 −π0 −π1 −π2 −π3 −π4 = 1 − 1/6 ( 1 + 5/6 + (5/6)

2 + (5/6)

3 + (5/6)

4 )

= 1 − 1/6 1 − (5/6)5

1 − 5/6 = (5/6)

4. If you get there and there are 3 people ahead of you, how long do you expect to wait until you begin to use the ATM? This is a simple question, since you have to wait until the persons in front of you are served. Thus, the waiting time is 3 times 5 = 15 minutes, since the durations are independent.

5. What is the expected wating time for a new arrival? Using the queuing formula, it becomes 1/6

12−10 = 1/12 hour, or 5 minutes. In essence, the expected number in the queue is 1 customer.

0 1 2

1 0 1 0 1 0

1 2 1 2 1 2

Figure 7.12: Diagram for example

182 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

Example 7.13 Consider the following problem: There is a bank with 2 tellers. Customers arrive at the bank and join a single line (queue...); whenever a teller is free, the next customer in line will go to that teller and be served. Assume that the arrival process of customers to the bank is modeled as a Poisson process, with arrival rate 8 customers/hour. Assume that the service time of each teller for each customer is independent, exponentially distributed, and with average service time of 1/8 hour, so that the service rate of each teller is 8 customers per hour. Note that there is zero probabillity that two customers arrive simultaneously, or that both tellers finish simultaneously, so this is a birth-death process! Note also that, when both tellers are busy, the service rate is 16 customers/hour, whereas when only one teller is busy, the service rate is only 8 per hour.

1. First, we formulate a continuous time, discrete space Markov chain model of the above problem, and draw the state transition diagram, showing the transition rates. Note that the service rate when both tellers are busy (e.g. n > 1 customers in the bank) is 16/hour, whereas the service rate when n = 0 is zero, and n = 1 is 8/hour. This will be a standard diagram for birth-death process, with arrival rates always 8. However, when there are no customers in the queue, the departure is 0; when there is only 1 customer in the queue, the departure rate will be that of only one server, which is 8. As long as there are 2 or more customers, both servers will be busy and the departure rate will be 16.

2. Next, let’s solve for the steady-state probability of this birth-death process by writing an expression for πn in terms of π1,and then writing an expression relating π0 and π1. Note that conservation of probability flow at state zero yields the equation: π0 = π1; the birth-death conservation equations (detailed balance) for every other node yield 16πn+1 = 8πn. Thus, for n > 0, we have πn+1 = 1/2πn = (1/2)

nπ1. Conservation of probability yields:

π0 + π1 + (1/2)π1 + (1/2) 2 π1 + . . . = 1 = π1

( 1 +

∞∑ n=0

(1/2) n

) = 3π1

Thus, π0 = π1 = 1/3,πn+1 = (1/2) n1/3.

3. Now, let’s use the formula ∑∞ n=1

nxn−1 = 1 (1−x)2 to obtain the expected number of customers in the bank at steady

state (including those being served...).

E[n] =

∞∑ n=1

nπn =

∞∑ n=1

π1n(1/2) n−1

= 4/3

4. Now, consider the same bank, with only a single teller with rate 8 customers/hour, but with half the arrival rate (that is, arrival rate of 4 customers/hour). The expected number of customers in the bank at steady state is given by a standard queue, with utilization ρ = 1/2. Thus, the expected number of customers in the bank in steady state is ρ

1−ρ = 1.

5. Given the answers to (c) and (d) above, is it better to have 2 small banks with one teller each, splitting the arrival traffic in half, or to have one larger bank with two tellers sharing a single queue? Clearly, it is better to have one large bank, since the average number of customers is 4/3, whereas for two banks, the average number of customers is 1 + 1 = 2. This is why you see those very long single lines at airports or at banks.

7.5 Inhomogeneous Poisson Processes

In this section, we extend the concept of Poisson processes to processes where the process rate can depend on time. Recall that the construction of the Poisson process began with the definition of a set of independent, identically distributed, exponentially-distributed random variables which represented the interarrival times of the Poisson process. This resulted in an independent increments process, with the property that the event of a jump in any interval (t,t + ∆] was independent from that of a jump in any other disjoint interval (t1, t1 + ∆], and the probability of this event was defined in terms of the jump rate as:

P[N(t + ∆) −N(t) = k] =

 

1 −λ∆ + o(∆) if k = 0 λ∆ if k = 1

o(∆) if k > 1

(7.31)

In order to generalize this construction, we want to maintain this independent increments property, but to allow the instantaneous jump rate to be time-dependent.

7.5. INHOMOGENEOUS POISSON PROCESSES 183

In essence, our construction of an inhomogeneous Poisson process is based on letting the instantaneous jump rate be a time-dependent function λ(t); thus, we have

P [N(t + ∆) −N(t) = k] =

 

1 −λ(t)∆ + o(∆) if k = 0 λ(t)∆ if k = 1

o(∆) if k > 1

(7.32)

where the notation o(∆) is used to denote a term such that lim∆→0 o(∆)

∆ = 0. Consider, therefore, the

probability density associated with the first jump time T1. By a limiting argument as ∆ → 0, we obtain

P [T1 ≤ T ] = lim ∆→0

 1 −T/∆∏

k=0

1 −λ(k∆)∆ + o(∆)

 

= 1 − lim ∆→0

 T/∆∏ k=0

e−λ(k∆)∆+o(∆)

 

= 1 − lim ∆→0

e− ∑T/∆ k=0

λ(k∆)∆

= 1 −e− ∫ T 0 λ(t)dt (7.33)

Hence, the probability density is given by

p(T) = d

dT P [T1 ≤ T ] = λ(T)e−

∫ T 0 λ(t)dt (7.34)

Given T1, we can construct the conditional probability density of T2 in an identical manner, using the independence of the probability that jumps occur in disjoint intervals, as:

p(T2 = t | T1) =

{ λ(t)e

− ∫ t T1 λ(s) ds

for t ≥ T1 0 otherwise

(7.35)

Generalizing, we can obtain the following formula for the conditional density of Tn, the n-th jump time:

p(Tn = t | T1, . . . ,Tn−1) = p(Tn = t | Tn−1) =

{ λ(t)e

− ∫ t Tn−1

λ(s) ds for t ≥ Tn−1

0 otherwise (7.36)

Combining the above equations, we obtain an expression for the joint probability density of T1,T2, . . . ,Tn:

p(T1 = t1, . . . ,Tn = tn) = p(T1 = t1)

n∏ k=2

p(Tk = tk | T1 = t1, . . . ,Tk−1 = tk−1)

= p(T1 = t1)

n∏ k=2

p(Tk = tk | Tk−1 = tk−1)

{∏n k=1 λ(tk)e

− ∫ tk tk−1

λ(s)ds if t0 = 0 ≤ t1 ≤ . . . ≤ tn

0 otherwise

{ ( ∏n k=1 λ(tk))e

− ∫ tn 0

λ(s)ds if t0 = 0 ≤ t1 ≤ . . . ≤ tn 0 otherwise

(7.37)

Furthermore, from this probability, we can compute the conditional probability that Tk+1 > T , given the value of Tk, as

P [Tk+1 > T|Tk] = e −

∫ T Tk λ(s)ds

(7.38)

Note that this implies that N(T) = k. Thus, we can compute the joint probability density of T1, . . . ,Tk and N(T) = k as

p(T1 = t1, . . . ,Tn = tn,N(T) = n) =

{ ( ∏n k=1 λ(tk))e

− ∫ T 0 λ(s)ds if t0 = 0 ≤ t1 ≤ . . . ≤ tn ≤ T

0 otherwise

184 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

As a final note, we need to compute the probability distribution of N(t), and of the increment N(t)−N(s). We can do this from the previous equation, as:

P(N(t) = n) =

∫ t 0

∫ 0tn · · ·

∫ t1 0

( n∏ k=1

λ(tk)dtk

) e−

∫ T 0 λ(s) ds

= e− ∫ T 0 λ(s)ds

( ∫ t

0 λ(s)ds)n

n! (7.39)

To see how this last formula comes about, define the auxiliary function F(t) = ∫ t

0 λ(s)ds. Then, note the

following identity:∫ t2 0

λ(t1)

∫ t1 0

λ(t0)dt0dt1 =

∫ t2 0

F(t1)λ(t1)dt1

∫ t2 0

F(t1)dF(t1) = F 2(t2) −F 2(0) = F 2(t2). (7.40)

Proceeding with the various integrals in the same fashion yields the result. Note that we can perform a similar computation for the probability that N(t) −N(s) = k, to obtain

P(N(t) −N(s) = k) = e− ∫ t s λ(σ)dσ

( ∫ t s λ(σ)dσ)k

One of the important properties of inhomogeneous Poisson processes is that they are birth-death pro- cesses. In particular, the process x(t) = N(t) + 1 is a birth-death process with birth rate λ(t) and death rate 0 for all states.

Other properties of inhomogeneous Poisson processes are:

1. The mean of the process, mN (t), is given by

mN (t) =

∫ t 0

λ(s)ds ,t ≥ 0.

2. The covariance of the process, KN (t,s) is given by

KN (t,s) =

∫ min(t,s) 0

λ(σ)dσ, t,s ≥ 0.

3. Consider the conditional probability density of the jump times T1, . . . ,Tk, given the information that N(t) = k. This is given by:

p(T1, . . . ,Tn|N(t) = k) = p(T1, . . . ,Tn,N(t) = k)

P(N(t) = n)

= ( ∏n k=1 λ(Tk)) e

− ∫ t 0 λ(s) ds

e− ∫ t 0 λ(s)ds (

∫ t 0 λ(s) ds)

= n! ( ∏n k=1 λ(Tk))

( ∫ t

0 λ(s) ds)n

= n!

n∏ k=1

λ(Tk)∫ t 0 λ(s) ds

(7.41)

4. Now, consider the above ordered sequence of times T1, . . . ,Tn, and apply a random permutation to obtain the unordered times U1, . . . ,Un. Assume that each random permutation is equally likely, with probability 1

n! . Then,

p(U1, . . . ,Un|N(t) = n) = n∏ k=1

λ(Uk)∫ t 0 λ(s)ds

since the sum over all permutations must equal the original probability p(T1, . . . ,Tn|N(t) = k). The un- usual fact is that, conditioned on knowing that there are only n transitions up to time t, the unordered

7.6. APPLICATIONS OF POISSON PROCESSES 185

event times U1, . . . ,Un are conditionally independent and identically distributed with conditional prob- ability density

p(Uk | N(t) = n) =

{ λ(Uk)∫ t 0 λ(s) ds

if i ≤ n 0 otherwise

5. Let N(t) be a homogeneous Poisson process, so that λ(t) ≡ λ. Let Tk denote the time of the k-th jump in the Poisson process. Then, the random variables τk = Tk−Tk−1 are exponential, independent, identically distributed with rate λ.

7.6 Applications of Poisson Processes

In this subsection, we describe various applications which can be analyzed using the properties of Poisson processes discussed previously.

Example 7.14 At a customer facility, customers arrive at a rate of 3 customers per hour, randomly distributed according to a Poisson process with constant rate 3. Assume that the doors open at 9:00 am. What is the expected time until the arrival of the 10-th customer? What is the probability that, if the doors close at 10:00 am for 15 minutes for a coffee break, one or more customers will arrive during the break? Suppose that, instead of taking a break at 10:00, the store waits until a customer arrives after 10:00 am (and is served instantaneously) before taking a break; what will the probability be that no customers arrive during the break? In order to answer this, all of the above questions can be posed in the context of a Poisson process N(t) with homogeneous rate λ = 3. The first question is computing E[T10]; since T10 is the sum of 10 independent, identically distributed random variables, each of which has mean 1/3, then E[T10] = 10/3. The probability that any customers arrive from 10:00 to 10:15 is given by

1 −P(N(t + 1/4) −N(t) = 0) = 1 −e−0.75

Since the process is homogeneous, and has independent increments, the above probability is also the probability of any arrivals over any 1/4 hour interval, no matter whether the interval started after a previous arrival. Thus, the probability of no arrivals is e−0.75.

Example 7.15 Consider an extension of the previous problem, where, at the end of a visit, a customer tips the store a random amount. Let yk denote the tip left by the k-th customer, where the sequence yk is independent, identically distributed with common density function p(y), and is independent of the arrival Poisson process. Define the process x(t) to be the amount of tips received up to time t. What are the mean mx(t) and variance σx(t)

2? To solve this problem, note that

x(t) =

{ 0 if N(t) = 0∑N(t) k=1 yk otherwise

To compute the mean of x(t), we use the smoothing property of expectations, as follows:

E [x(t)] = E [E [x(t) | N(t) = n]] = E

[ E

[ n∑ k=1

yk | N(t) = n

]] = E [N(t)] my = 3tmy (7.42)

where we used the independent, identically distributed property of yk, and the independence of yk and N(t). Similarly,

E[x(t) 2 ] = E

[ E

[( n∑ k=1

)2 | N(t) = n

]]

= E

[ E

[ n∑ k=1

n∑ j=1

ykyj | N(t) = n

]] = E

[ N(t)E

[ y

2 k

] + N(t)(N(t) − 1)m2y | N(t) = n

] = 3tE

[ y

2 k

] + m

2 y

( E [ N(t)

2 ] − 3t

) = 3tE[y

2 k] + m

2 y(9t

2 + 3t− 3t) = 3tE[y2k] + (3tmy)

2 (7.43)

so that the variance is given by

σx(t) 2

= 3tE[y 2 ] = 3tm

2 y + 3tσ

2 y

186 CHAPTER 7. DISCRETE STATE MARKOV PROCESSES

Note: The above process x(t) is called a compound Poisson counting process, because the sample functions jump at Poisson times, but with a random amplitude. In general, a compound Poisson process has the following properties:

1. It is an independent increments process.

2. The characteristic function of each increment is given by

Φx(t)−x(s)(jw) = e ∫ t s λ(τ)dτ(Φy(jw)−1)

where Φy(jw) is the characteristic function of the jump size.

3. The mean is given by

mx(t) = my

∫ t 0

λ(τ)dτ for t ≥ 0

4. The autocovariance function is given by

Kx(t,s) = ( m2y + σ

2 y

)∫ min(t,s) 0

λ(τ) dτ for t,s ≥ 0

Chapter 8

Mean-Square Calculus for Stochastic Processes

The purpose of this section is to allow us to define derivatives and integrals of stochastic processes. Consider, for example, a simple electrical circuit, composed of a voltage source V , a resistor R and a capacitor C in series. The equation for the capacitor voltage Vc(t) is given by

V (t) = RC d

dt Vc(t) + Vc(t)

and we can solve this in terms of an impulse response h(t) = e −t RC u(t) where u(t) is the unit step function.

The resulting solution is

Vc(t) =

∫ t 0

h(t−s)V (s)ds

for t ≥ 0. The question before us is what happens when the input voltage V (t) is a stochastic process, rather than

a deterministic time function? Clearly, we should expect that, in some well-defined sense, the output Vc(t) is also a stochastic process. This forces us to define what we mean by an integral of a stochastic process, particularly when the random sample functions V (t,ω) may not be continuous as a function of time. Thus, it is difficult to define integrals or derivatives of random processes strictly in terms of the individual sample functions.

8.1 Continuity of Stochastic Processes

Since definition of integration and differentiation is based on the use of limits, we want to define concepts of continuity of stochastic processes as a function of time, so that we can also define what we mean by a limit. Again, one concept would be that every sample function of the process would have to be continuous. However, this would severely restrict the class of stochastic process for which the calculus would be defined.

Given the different concepts of convergence for sequences of random variables discussed previously, the most appropriate concepts are those of almost sure convergence and mean-square convergence. Thus, we have the following definition.

Definition 8.1 (Almost Sure Continuity) The stochastic process x(t) has almost sure continuous sample paths at time t if

lim �→0

x(t + �) a.e. = x(t)

where the limit is interpreted in almost sure sense. If the sample paths are almost sure continuous at every t, the process is said to have continuous sample paths almost surely.

Definition 8.2 (Mean Square Continuity) The stochastic process x(t) is continuous in mean-square sense at time t if

lim �→0

x(t + �) mss = x(t)

188 CHAPTER 8. MEAN-SQUARE CALCULUS FOR STOCHASTIC PROCESSES

where the limit is interpreted in the mean-square sense. That is,

lim �→0

E[(x(t + �) −x(t))2] = 0

If it is mean-square continuous at every t, it is said to be mean-square continuous everywhere or simply mean-square continuous.

The advantage of mean-square continuity versus almost sure continuity of sample paths is that the mean- square continuity can be verified in terms of the second-order properties of the process, using autocorrelation functions. Note that

E[(x(t + �) −x(t))2] = E[x(t + �)2] + E[x(t)2] − 2E[x(t)x(t + �)] = Rx(t + �,t + �) −Rx(t,t + �) + Rx(t,t) −Rx(t,t + �) (8.1)

Thus, if Rx(t,s) were continuous at s = t = t0, then x(t) would be mean-square continuous at t0. If the process were stationary, then continuity of Rx(τ) at τ = 0 would be sufficient. Thus, all questions of stochastic convergence can be posed in terms of questions of deterministic convergence of the corresponding autocorrelation functions!

Let’s consider two examples of stochastic processes: the Brownian motion process x(t), and the Poisson process n(t). For the Brownian motion process, the autocorrelation function is given by Rx(t,s) = min(t,s). This function is continuous at (t,t) for any t, and thus Brownian motion is continuous everywhere in mean- square sense. As it turns out, the sample paths x(t,ω) can be shown to be continous everywhere with probability one. Thus, Brownian motion is an example of a process with continous sample functions.

What about Poisson processes? Their sample paths are discontinuous everwhere there is a discrete change in value. The autocorrelation function of Poisson processes with rate λ is give by Rn(t,s) = λ

2st + λ min(s,t). Again, this is continuous at (t,t) for any t, and thus Poisson processes are mean-square continuous everywhere!

There are some strong implications of mean-square continuity everywhere, which are summarized below:

1. If a stochastic process x is mean-square continuous, then lim�→0 E[g(x(t + �))] = E[g(x(t)] for every continuous function g.

2. x is stationary, and Rx(t) is continuous at t = 0 if and only if x is mean-square continuous everywhere.

The first property follows because mean-square convergence of random variables implies convergence in probability (or in distribution), so that expectations being integrals of probabilities will also converge. In particular, the mean and all of the moments of the process will be continous functions of time.

As a final note on continuity, we have said little on how to show almost sure sample continuity of a stochastic process, because it is difficult to establish sufficiency theorems. However, we want to mention some results which can be used to establish this property. We state these without proof; the interested reader is referred to Loeve’s book on probability theory, or other advanced probability texts.

Theorem 8.1 Let g(h) and q(h) be functions that are even, nondecreasing for h > 0, such that g(h) → 0,q(h) → 0 as h → 0, and such that

∞∑ n=1

g(2 −n

) < ∞, ∞∑ n=1

2 n q(2 −n

) < ∞

Then, x(t) has almost surely continuous sample paths if

P [{ω : |x(t + h) −x(t)| ≥ g(h)}] ≤ q(h)

for all t,h.

Theorem 8.2 Assume that one can find positive constants p < r,k such that, for all t,h,

E[|x(t + h) −x(t)|p] ≤ k|h|

| log |h||1+r

Then, x(t) has almost surely continuous sample paths.

8.2. MEAN-SQUARE DIFFERENTIATION 189

A sufficient condition for the above result can be obtained using p = 2, in which case autocorrelation functions can be used! This states:

Theorem 8.3 If, for all t,h, we have

Rx(t + h,t + h) + Rx(t,t) −Rx(t + h,t) −Rx(t,t + h) < k|h|

| log |h||s

where k > 0,s > 3, then x(t) has almost surely continuous sample paths.

In particular, the condition of the above result is satisfied if Rx(t+h,t+h)+Rx(t,t)−Rx(t+h,t)−Rx(t,t+h)

h2 is bounded

for all sufficiently small h. A sufficient condition for this is stated in terms of the autocorrelation function, as

∂2

∂u∂v Rx(u,v)|u=v=t < ∞

As an example, consider the Brownian motion process, with autocorrelation function Rx(t,s) = min(t,s). Note that the second derivatives in the last result do not exist. However, we know that a Brownian increment is Gaussian, with variance proportional to the length of the interval. Let 0 < a < 1/2; then,

P [{ω : |x(t + h) −x(t)| ≥ |h|a}] = 2 √

2π

∫ ∞ |h|a−1/2

e−x 2/2ds

≤ 2 √

2π |h|1/2−ae−0.5|h|

2a−1 (8.2)

where the last inequality follows from integration by parts. Now, use the first result quoted above, by letting g(h) = |h|a, and letting

q(h) = 2 √

2π |h|1/2−ae−0.5|h|

2a−1

Then, ∞∑ n=1

g(2−n) =

∞∑ n=1

(2−a)n = 1

1 − 2−a < ∞

∞∑ n=1

2nq(2−n) = 2 √

2π

∞∑ n=1

2n2−n/2+na/2e−0.52 n(1−2a)

which converges because a < 0.5. Thus, this establishes that Brownian motion has almost surely continuous sample paths.

8.2 Mean-Square Differentiation

The normal definition for a derivative of a deterministic function of time is

dt x(t) = lim

�→0

x(t + �) −x(t) �

(8.3)

For stochastic processes, the issue is in what sense will we define the limit. Again, we want to use the concept of mean-square convergence. The limit will also be a stochastic process.

For a stochastic process x(t), if the limit in (8.3) exists in mean-square sense, then we say that the stochastic process limit d

dt x(t) is the mean-square derivative of x(t). More formally,

Definition 8.3 The stochastic process x(t) has mean-square derivative d

dt x(t) at t if

lim �→0

x(t + �) −x(t) �

mss =

dt x(t)

That is,

lim �→0

[( x(t + �) −x(t)

� −

dt x(t)

)2] = 0

190 CHAPTER 8. MEAN-SQUARE CALCULUS FOR STOCHASTIC PROCESSES

Given a stochastic process x(t), how can we determine easily if it is mean-square differentiable at a particular time t? First, we use the Cauchy criterion for convergence of sequences, which states that, if a sequence converges, then the distance between elements of that sequence also converges. In our case, this means

0 = lim �→0 δ→0

[( x(t + �) −x(t)

� − x(t + δ) −x(t)

)2]

= lim �→0 δ→0

[( x(t + �) −x(t)

�

)2 +

( x(t + δ) −x(t)

)2 − 2

x(t + δ) −x(t) δ

x(t + �) −x(t) �

]

= lim �→0 δ→0

[ Rx(t + �,t + �) + Rx(t,t) − 2Rx(t + �,t)

�2 + Rx(t + δ,t + δ) + Rx(t,t) − 2Rx(t + δ,t)

δ2

−2 Rx(t + �,t + δ) + Rx(t,t) −Rx(t + �,t) −Rx(t + δ,t)

�δ

] (8.4)

Now, note that, if the autocorrelation function Rx was twice differentiable at (t,t), so that

∂2

∂u∂v Rx(u,v)

∣∣∣∣ u=v=t

= lim �→0

Rx(t + �,t + �) + Rx(t,t) − 2Rx(t + �,t) �2

= lim δ→0

Rx(t + δ,t + δ) + Rx(t,t) − 2Rx(t + δ,t) δ2

= lim �→0 δ→0

Rx(t + �,t + δ) + Rx(t,t) −Rx(t + �,t) −Rx(t + δ,t) �δ

(8.5)

exists, then (8.4) is true! We summarize the above existence conditions in the following theorems.

Theorem 8.4 A stochastic process x(t) is mean-square differentiable if ∂

∂u∂v Rx(u,v)|u=v=t exists for all t.

Theorem 8.5 A stationary stochastic process x(t) is mean-square differentiable if and only if d

ds2 Rx(s)|s=0 exists.

Note the stronger conditions that we can provide for stationary processes. Mean-square differentiability provides several useful conditions for stationary processes. For any stationary process which is mean-square differentiable, we have

[ d

dt x(t)

] = lim �→0

[ x(t + �) −x(t)

�

] = lim �→0

mx −mx �

= 0

where the interchange of differentiation and expectation is possible due to the existence of the limit. For general stochastic processes, which are mean-square differentiable, we can compute the autocor-

relation statistics of their derivatives based on the statistics of the original process, as follows. Define y(t) = d

dt x(t) for a stochastic process x(t). Then,

Rxy(s,t) = E [x(s)y(t)] = E

[ x(s) lim

�→0

{ x(t + �) −x(t)

�

}] = lim

�→0 E

[ x(s)

{ x(t + �) −x(t)

�

}] = lim �→0

Rx(s,t + �) −Rx(s,t) �

= ∂

∂t Rx(s,t) (8.6)

Using a similar argument, we establish that

Ry(s,t) = ∂2

∂s∂t Rx(s,t) (8.7)

8.3. MEAN-SQUARE INTEGRATION 191

As before, the above equations simplify when x is stationary, to give

Ry(s,t) = Ry(t−s) = ∂

∂s

∂

∂t Rx(t−s) = −

dτ2 Rx(τ) (8.8)

where τ = t−s, and the negative sign arises due to the negative sign of s in the argument of Rx(t−s). To conclude this section, let’s consider our canonical examples of Brownian motion and Poisson processes.

We determined that they were mean-square continuous. Are they differentiable? Note that neither process is stationary. For Brownian motion, Rx(t,s) = min(t,s). As a function of s, we can compute its derivative as

∂

∂s min(t,s) = u(s− t)

where u is the unit step function. Differentiating with respect to t, we get

∂

∂t u(s− t) = δ(s− t)

where δ is the generalized impulse function we use formally; in fact, the derivative is not defined in a regular form, and the limit does not exist, since the value of the impulse function is infinite when its argument is zero. Thus, Brownian motion is not a mean-square differentiable process; however, it is often useful to keep track of formal derivatives using generalized functions. Indeed, we will often use the concept of white noise as a process with autocorrelation function Rx(t,s) = δ(s− t) as an engineering concept; formally, our analysis above suggests that white noise can be interpreted as the “derivative” of Brownian motion. White noise is a very useful concept in the modeling of broadband noise in communicatons and radar systems, and will be studied in greater detail later in the course. However, there is a rich mathematical theory which focuses on the study of Brownian motion and white noise, which is beyond the scope of our course.

Similar computations for Poisson processes establish that Ry(s,t) = λ 2 + λδ(s − t). Thus, Poisson

processes are also not mean-square differentiable. If one thought somewhat as to what can be generalized from these examples, most continuous-time processes with independent increments will not be mean-square differentiable, because the autocorrelation function will depend on min(t,s).

8.3 Mean-Square Integration

We can define mean-square integrals using a limiting process, as before. To begin with, consider an integral of a stochastic process x(t,ω) over the interval [s,t]. For each sample path ω, we can construct an integral exactly the way we would construct it for deterministic functions: sampling the sample path over a discrete grid, computing Riemann sums, and taking the limit as the grid gets finer. The key question is, in what sense is the limit interpreted? As before, we will use the mean-square sense.

Mathematically, consider a sample path x(t,ω) over the interval [s,t]. Let ∆ = (t−s)/N be the increment in a regular discretization of the interval. We define the integral y(t,ω) as

y(t,ω) =

∫ t s

x(τ,ω)dτ mss = lim

N→∞

N∑ i=1

x(s + i∆,ω)∆ (8.9)

Interpreting the limit in the appropriate sense, this means

lim N→∞

 (y(t,ω) − N∑

i=1

x(s + i∆,ω)∆

)2 = 0 As in the case of differentiation, we are interested in conditions which guarantee that a process is inte-

grable. Expanding the above expression, we get

E[y(t)] = E

[ lim N→∞

N∑ i=1

x(s + i∆)∆

]

= lim N→∞

N∑ i=1

E [x(s + i∆)] ∆ =

∫ t s

mx(τ)dτ (8.10)

192 CHAPTER 8. MEAN-SQUARE CALCULUS FOR STOCHASTIC PROCESSES

When does the integral exist? Applying the Cauchy criterion, we get

0 = lim N,M→∞

    N∑ i=1

x(s + i∆N )∆N − M∑ j=1

x(s + j∆M )∆M

 2   (8.11)

The above convergence is guaranteed if we can show that

  lim N→∞

( N∑ i=1

x(s + i∆)∆

)2 < ∞ Expanding and taking expectations,

  lim N→∞

( N∑ i=1

x(s + i∆)∆

)2 = lim N→∞

N∑ i=1

N∑ j=1

Rx(s + i∆,s + j∆)∆ 2

∫ t s

Rx(σ,τ)dσ dτ (8.12)

This is formalized in the following result.

Theorem 8.6 The mean-square integral of a stochastic process x(t) exists over an interval [s,t] if

∫ t s

Rx(σ,τ) dσ dτ < ∞

Note when the process is wide-sense stationary we can simplify the condition somewhat. In particular, we can perform the following change of variables:

u = τ −σ v = τ + σ

The Jacobian of this transformation is 2, so that dσ dτ = (1/2)dudv. We also need to transform the limits of integration. Note that the original limits correspond to a square in the (σ,τ) plane and in the new coordinates (which are a 45 degree rotation), this region will be a diamond. Thus the transformed integral is given by:∫ t

∫ t s

Rx(τ −σ) dσ dτ = ∫ t−s

(∫ 2t−u 2s+u

Rx(u)

( 1

) dv

) du +

∫ 0 −(t−s)

(∫ 2t+u 2s−u

Rx(u)

( 1

) dv

) du

∫ t−s −(t−s)

Rx(u)

(∫ 2t−|u| 2s+|u|

( 1

) dv

) du

∫ t−s −(t−s)

((t−s) −|u|) Rx(u) du = 2 ∫ t−s

((t−s) − τ) Rx(τ) dτ

For the mean-square integral of a wide-sense stationary process to exist we want the integral above to exist (i.e. be finite). Note, for example, that this will be the case if Rx(τ) is absolutely integrable. This observation yields the following result:

Theorem 8.7 The mean-square integral of a wide-sense stationary stochastic process x(t) exists over an interval [s,t] if

∫ t−s 0

|Rx(τ)|dτ < ∞

8.3. MEAN-SQUARE INTEGRATION 193

As in the case of differentiation, we are interested in the relationship between the autocorrelation of the integral process and the original process. If the integral exists, this is easily computed through exchange of expectation and integration, as follows. Let y(t,ω) =

∫ t s x(τ,ω)dτ. Then,

Ry(a,b) = E[

∫ a s

x(τ,ω)dτ

∫ b s

x(σ,ω)dσ]

∫ a s

∫ b s

E[x(τ)x(σ)]dτdσ

∫ a s

∫ b s

Rx(τ,σ)dτdσ (8.13)

For wide-sense stationary processes, the above integral can be simplified, as follows: Assume a ≤ b; then

Ry(a,b) =

∫ a s

∫ b s

Rx(τ −σ)dτdσ

Make the variable substitution u = τ −σ,v = 0.5(τ + σ). This results in

Ry(a,b) =

∫ 0 s−a

∫ a+u/2 s−u/2

Rx(u)dvdu +

∫ b−a 0

∫ a+u/2 s+u/2

Rx(u)dvdu +

∫ b−s b−a

∫ b−u/2 s+u/2

Rx(u)dvdu

∫ 0 s−a

(a−s + u)Rx(u)du + ∫ b−a

(a−s)Rx(u)du + ∫ b−s b−a

(b−s−u)Rx(u)du (8.14)

In particular, if a = b, this simplifies to

Ry(a,a) =

∫ a−s s−a

(a−s−|u|)Rx(u)du (8.15)

Example 8.1 Let x(t) be a wide-sense stationary process. Consider the moving average process y(t) defined as

y(t) = 1

∫ t+T t−T

x(s)ds

What are the mean and covariance statistics of y? We answer this question using the above properties of integration. First, we assume that the autocorrelation function

Rx(τ) satisfies appropriate integrability conditions. Then, by (8.15), we have:

Ry(t,t) = 1

4T 2 E

[∫ t+T t−T

∫ t+T t−T

x(σ)x(τ) dσ dτ

] =

4T 2

∫ t+T t−T

Rx(σ − τ) dτ dσ

= 1

4T 2

∫ 2T −2T

Rx(u)(2T −|u|) du (8.16)

By a similar computation, we compute my as

my = 1

∫ t+T t−T

E[x(s)]ds = mx (8.17)

Thus, the covariance of y(t) is given by σ2y = Ry(t,t) − m2x. If we let the autocorrelation function of x have a special form, such as Rx(t) = e

−2a|t|, then we can evaluate this as

σ 2 y =

2Ta −

8a2T 2 (1 −e−4aT )

The above example raises some interesting questions concerning the relationship of time averages of a process and statistics of a process. In particular, note that, as T →∞, as long as Rx(u) decays to 0 fast enough (as it has to, in order to be integrable), then the covariance of the process must approach 0! Thus, the process

194 CHAPTER 8. MEAN-SQUARE CALCULUS FOR STOCHASTIC PROCESSES

must be approaching a deterministic constant! Indeed, we can determine from the above example that the constant must be the process mean, mx. Thus, we have an example where, for every sample trajectory, if we take a long enough temporal average, this average converges to the true mean of the process, as averaged over all possible sample paths. This type of property is called an ergodic property. We discuss ergodicity in greater detail later.

There are other sample statistics of interest, which can be defined for each sample trajectory of wide-sense stationary processes. For instance, instead of the sampled mean, we can define the sample autocorrelation of a trajectory with itself as:

〈x(t + τ)x(t)〉T = 1

∫ T −T

x(t + τ,ω)x(t,ω)dt (8.18)

Since this is defined for each trajectory, the resulting time average is a random variable. The question of interest is determining conditions which guarantee that

lim T→∞

〈x(t + τ)x(t)〉T = Rx(τ)

Note that, in general, the limit will exist, as long as x(t) is stationary; however, it may be a random process also! In order to show that it is a constant, we must analyze its covariance, and show that the covariance goes to zero.

Example 8.2 (Strong Law of Large Numbers) Let y(n) be a sequence of independent, identically distributed, zero-mean random variables, and let s be a constant. Define x(n) = s + y(n). Define the moving average of x as

SN = 1

N∑ i=1

x(i)

Note that the mean

mSN = 1

N∑ i=1

E[x(i)] = s

and the variance is

σ 2 SN

= 1

N∑ i=1

E[y 2 (i)] =

N σ

2 y

Thus, as N →∞, the variance goes to zero, and we have

lim N→∞

SN a.e. = s

This is the strong law of large numbers; in essence, it says that the time average of x(n) converges to its expected value almost everywhere.

8.4 Integration and Differentiation of Gaussian Processes

When the original stochastic process x(·) is a Gaussian random process, will its integral and derivative also be Gaussian random processes? The answer is in the affirmative, as we will show below.

The key observation is that convergence in mean-square sense implies convergence in distribution. Thus, consider a sequence xn of jointly Gausssian random vectors, which satisfy the Cauchy criterion (i.e. a Cauchy sequence): For any � > 0, there exists an N(�) such that, for n,m > N(�)

E [ (xn −xm)

T (xn −xm)

] < �

As we discussed in the convergence section, such a sequence is guaranteed to converge in the mean-square sense to a random vector x; furthermore, each member of the sequence has a Gaussian distribution. Since mean-square sense convergence implies convergence in distribution, this requires that x also have a vector Gaussian distribution.

8.5. GENERALIZED MEAN-SQUARE CALCULUS 195

Now, consider a Gaussian stochastic process x(·). For any sampling times t1, . . . , tk, the derivative of x will be the limit in mean-square sense of the Gaussian vector

 x(t1+�)−x(t1)

� x(t2+�)−x(t2)

� ...

x(tk+�)−x(tk) �

 

By the above argument, this limit will also have a Gaussian probability density function for any sampling times t1, . . . , tk, and thus the derivative process will also be Gaussian.

Similarly, the integral of a Gaussian stochastic process will also be Gaussian. That follows because the integral is defined as a limit in mean-square sense of a sum of jointly Gaussian random variables, and sums of jointly Gaussian random variables are Gaussian.

8.5 Generalized Mean-Square Calculus

Since white noise is a very useful abstraction in the analysis of engineering systems, we want to understand in what sense is white-noise defined. As we discussed before, if we define white noise w(t) as the formal derivative of Brownian motion b(t), then we should have as an autocorrelation function

Rw(t,s) = δ(t−s)

The purpose of this subsection is to provide additional intuition into the construction of the white noise process.

Consider the increment process of Brownian motion, defined as

w∆(t) = b(t + ∆) − b(t)

∆

By the properties of Brownian motion, this is a zero-mean, Gaussian process with autocorrelation

Rw∆ (t,s) = E

[ b(t + ∆) − b(t)

∆

b(s + ∆) − b(s) ∆

] =

{ 0 if |t−s| > ∆ ∆−|t−s|

∆2 otherwise

(8.19)

Note that Rw∆ (t,s) is a function only of the difference τ = s − t, and thus the process w∆ is wide-sense stationary. Since the process is Gaussian, in addition, then it is also strict-sense stationary.

A graph of the autocorrelation function Rw∆ (τ) would show a triangle of height 1/∆ and base of length 2∆. Thus, the area under the graph equals 1. As we shrink the size of ∆, the autocorrelation function Rw∆ (τ) converges to an impulse function δ(τ), which corresponds to the white noise limit.

Note that, even though the white-noise limit is a process which is difficult to construct, or even imagine what a sample path would look like as a function, it is easy to derive its properties. In particular, in a manner completely analogous to the definition of a delta function as a generalized function, we can derive the properties of white noise from the properties of its integral. Formally, since white noise is the limit of Gaussian, stationary processes, it should also be Gaussian and stationary.

The more rigorous way of defining white noise is as follows: Consider any bounded function h(t) (that is, |h(s)| < C for s ∈ [0, t].) Then, define integrals with respect to the white noise process w(s) as the mean-square limits of integrals with respect to the processes w∆(s), as

y(t) ≡ ∫ t

h(s)w(s)ds mss = lim

∆→0

∫ t 0

h(s)w∆(s)ds

196 CHAPTER 8. MEAN-SQUARE CALCULUS FOR STOCHASTIC PROCESSES

Note that, in the limit, the autocorrelation of the right-hand side becomes

Ry(t,s) = lim ∆→0

∫ t 0

∫ s 0

h(σ)h(τ)Rw∆ (τ,σ) dtdσ

= lim ∆→0

∫ min(t,s) 0

∫ max(t,s) 0

h(τ)h(σ) ∆ −|τ −σ|

∆2 I(|τ −σ| < ∆) dτ dσ

∫ min(t,s) 0

∫ max(t,s) 0

h(τ)h(σ)δ(τ −σ) dτ dσ (8.20)

where the last equality comes from the definition of a deterministic delta function in terms of limits of integrals. As long as the function h is bounded, the Cauchy criterion can be used to establish that the mean-square limit y will exist, and will have the above autocorrelation. Furthermore, the above expression for autocorrelation is identical to that which would arise from formally assuming that the white noise process existed as an input, with autocorrelation Rw(t,s) = δ(t−s).

As an example, suppose that we wanted to define the random process y(t) = ∫ t

0 w(s)ds, where w is white

noise. Clearly, if white noise is to be viewed as the derivative of Brownian motion, then y(t) = b(t) - b(0). However, let’s demonstrate this fact using the above construction. For any ∆ > 0, the mean-square integral

y∆(t) =

∫ t 0

w∆(s)ds

exists. Define the process y(t) as the mean-square limit of y∆(t), to obtain

y(t) mss = lim

∆→0

∫ t 0

w∆(s)ds mss = lim

∆→0

∫ t 0 (b(s + ∆) − b(s))ds

∆

mss = lim

∆→0

∫ t 0 (b(s + ∆)ds−

∫ t 0 b(s)ds

∆

mss = lim

∆→0

∫ t+∆ t

b(s)ds− ∫ ∆

0 b(s)ds

∆ mss = b(t) − b(0) (8.21)

because the Brownian process is integrable, and the limit converges in mean-square to the derivative of the integral! This states that, from a mean-square sense, the processes y(t) and b(t)−b(0) are indistinguishable; in particular,they have the same autocorrelation and mean.

The important aspect of the above definition is that the statistical properties of integrals of generalized processes obey the rules of the mean square calculus. Specifically, the autocorrelation of y(t) is the limit of the autocorrelation of y∆(t), which is given by:

Ry∆ (t,s) =

∫ t 0

∫ s 0

Rw∆ (τ,σ) dtdσ (8.22)

so that

Ry(t,s) = lim ∆→0

∫ t 0

∫ s 0

Rw∆ (τ,σ) dτ dσ

= lim ∆→0

∫ min(t,s) 0

∫ max(t,s) 0

∆ −|τ −σ| ∆2

I(|τ −σ| < ∆) dτ dσ

Note that, for ∆ very small, the integrand is a deterministic integral which is approaching a delta function! Indeed, one can show that the above limit becomes Ry(t,s) = min(s,t). This is the same expression which would have been obtained from formally substituting Rw(a,b) = δ(a−b), and using the mean-square calculus to obtain

Ry(t,s) =

∫ t 0

∫ s 0

Rw(a,b) dadb

∫ t 0

∫ s 0

δ(a− b) dadb

∫ t 0

u(s− b) db = min(s,t) (8.23)

8.5. GENERALIZED MEAN-SQUARE CALCULUS 197

which is indeed the autocorrelation function for the Brownian increment b(t) −b(0). In the above equation, u(t) is the unit step function, which is the integral of the delta function. In a similar manner, the mean of y(t) will be the limits of the means of y∆(t), which are computed as

my(t) = lim ∆→0

∫ t 0

mw∆ (s)ds = lim ∆→0

0 = 0

Thus, formally, we can define white noise as the generalized derivative of Brownian motion, and define the statistics of this derivative process as

mw(t) = d

dt mb(t)

Rw(t,s) = d2

dtds Rb(t,s)

where the derivatives are taken in a generalized sense, using delta functions. The above discussion shows that we can compute the statistics of the output process y(t) =

∫ t 0 h(s)w(s)ds as:

my(t) =

∫ t 0

h(s)mw(s)ds

Ry(t,s) =

∫ t 0

∫ s 0

h(a)h(b)Rw(a,b)dadb

and, even though the process w(t) does not exist as a mean-square derivative, integrals of the process can be defined, and the mean-square calculus can be extended in a natural manner using generalized functions to obtain the properties of integrals of white noise.

Note that Brownian motion is not the only process which will have a generalized mean square derivative. Indeed, mean-square continuous independent increment processes such as Poisson processes will also have generalized derivatives which include delta functions. The important item to remember is that the use of delta functions is justified as the limit of ordinary functions, and that integrals of delta functions are well- defined. Thus, for stochastic processes, integrals of generalized mean-square derivatives such as white noise will be well-defined also!

In sum, the standard mean-square calculus can be extended to derivatives of processes which are mean- square continuous, but not mean-square differentiable, by defining generalized processes such as “white noise”, with autocorrelation functions which use generalized functions such as delta functions and which can be obtained as the generalized derivatives of the autocorrelation functions of the original process.

To conclude, consider the process z(t) = ∫ t

0 f(s)w(s)ds. It is clear that the mean of z(t) will be the

integral of the mean of f(s)w(s), which is zero! Also, since w(s) is Gaussian, the resulting integral will also be Gaussian. Furthermore, the autocorrelation function will be given by

Rz(t,s) =

∫ t 0

∫ s 0

f(a)Rw(a,b)f(b) dadb

∫ t 0

∫ s 0

f(a)δ(a− b)f(b) dadb

∫ t 0

f(b)u(s− b)f(b) db

∫ min(t,s) 0

f2(b) db (8.24)

Note that by defining f(s) = h(t−s) for some causal impulse response function h, we have an integral which looks like the response of a causal linear system to a ”white noise” input!

Finally, note that z(t) will be an independent increments process! In particular, since the process is Gaussian, we only have to show that nonoverlapping increments are uncorrelated. Thus, for t > s > u, we

198 CHAPTER 8. MEAN-SQUARE CALCULUS FOR STOCHASTIC PROCESSES

have

E[(z(t) −z(s))(z(s) −z(u)] = E [∫ t

f(a)w(a) da

∫ s u

f(b)w(b) db

] =

∫ t s

∫ s u

f(a)f(b)E[w(a)w(b)] dadb

∫ t s

∫ s u

f(a)f(b)δ(a− b) dadb = 0 (8.25)

because the intervals do not overlap!

8.6 Ergodicity of Stationary Random Processes

One of the important questions in the analysis of stochastic processes is the determination of the statistical properties of the process. Ideally, in order to compute statistics such as the mean and autocorrelation, we would have to repeat the same experiment many times, obtain many realizations of the required random variables, and average them, in the limit, to obtain an accurate estimate. The strong law of large numbers provides us with the necessary theory to establish that such a procedure works.

In practice, there are many situations where we do not have the flexibility to repeat an experiment! In particular, in many situations we can only gather a single sample path of the stochastic process. For instance, in stock market modeling, we only observe a single time history of the prices; we do not have the luxury of “repeating” the experiment by moving time backwards and reliving the experience. In process control or communications, we observe the noise which is present at a particular time, but again we cannot repeat that experiment at the exact same time.

For most applications, what is needed is some way of learning the needed statistical quantities from a single observed sample function of the process. This is possible primarily for stationary random processes. In particular, for stationary processes, the random variables x(t) and x(s) have identical distributions. Thus, if we were to observe a sample path of the process over a time interval [−T,T], we can generate an estimate of the mean of the process

∫ T −T

x(s)ds

The above integral is to be interpreted in the mean-square sense, as discussed previously. Intuitively, it seems that this would be a good estimate, since we are “averaging” many identically distributed random variables. The problem is that they are not independent, so that the convergence properties of the law of large numbers do not apply. Nevertheless, in many cases, the above estimate will converge to the true mean mx as T →∞. This property is called ergodicity in the mean.

In essence, ergodicity is a property which establishes that certain time averages of sample functions of stochastic processes converge to their corresponding ensemble averages. Although there is a general definition of what we mean by a general ergodic process, we will focus our attention on ergodicity of certain statistics, such as ergodicity of the mean and autocorrelation. In this section we define these concepts, and discuss conditions where we can establish that processes have ergodic properties.

Before discussing the theory, let’s discuss why we should expect convergence, in spite of the fact that the samples x(t),x(s) are correlated across time. In particular, in our earlier handout on convergence, we discussed that the law of large numbers can be extended to correlated random variables! What was really needed was the condition that the variance of the weighted sum s(n) = 1

∑n i=1(xi−E[xi]) decrease to zero,

since, by the Chebyshev inequality, this would imply convergence in probability to zero, and this convergence could be extended to almost-sure convergence. Thus, showing that a process is ergodic in a given statistic will correspond to showing that the variance of that statistic is converging to zero.

What is the mechanism that leads to ergodic processes? In essence, although the process is correlated with itself, what is needed is that the degree of correlation decreases sufficiently rapidly with the time increment, so that the time average in question looks like the average of many uncorrelated random variables, which by the many forms of the weak law of large numbers will converge to the appropriate expectation. Consider the 3 examples below.

8.6. ERGODICITY OF STATIONARY RANDOM PROCESSES 199

Example 8.3 The most trivial example of a stationary process which is not ergodic is the constant process x(t) = A, where A is a random variable. Clearly, any average over time will not be a true reflection of the statistics of A, but would merely be a sample value for A.

Example 8.4 Define the stochastic process x(t) = A sin t + B cos t, where A,B are Gaussian, zero-mean, unit variance, independent random variables. To verify that this process is wide-sense stationary, the autocorrelation function is given by

E[x(t)x(s)] = E[(A sin t + B cos t)(A sin s + B cos s)]

= sin t sin s + cos t cos s = cos(t−s) (8.26)

In this case, it is not clear that the process is ergodic in any statistic, since the random process at time t is strongly correlated with the value at any other time s. However, this correlation fluctuates in sign, so perhaps this can average out.

Example 8.5 Consider the stationary stochastic process x(t) with autocorrelation function Rx(τ) = e

−|τ|. In this case, we should expect that certain statistics would be ergodic, since the autocorrelation function indicates that the strength of the correlation decreases exponentially with the time difference in the two samples.

With those examples in mind, let’s proceed to define ergodicity of the different statistics of interest.

Definition 8.4 (Ergodic in the Mean) A wide-sense stationary process x(t) is ergodic in the mean if the time average of x(t) converges in mean-square sense to the ensemble average E[x(t)] = mx. That is,

lim T→∞

〈mx〉T ≡ lim T→∞

∫ T −T

x(s)ds mss = mx

Note that, in the above equation, the sample average 〈x〉T is a random variable which is defined in terms of the values of the random process x(t) over the interval [−T,T]. Thus, the limit, if it exists, is also a random variable defined in terms of the process samples x(t), t ∈ [−∞,∞]. Ergodicity is a statement that says that these special random variables are equal to constants!

Since 〈mx〉T is a random variable defined in terms of an integral of a stochastic process, we can compute its mean and variance using the theory of mean-square integration, as:

E [〈mx〉T ] = 1

∫ T −T

E[x(s)] ds = mx

E [ (〈mx〉T −mx)

2 ]

= E

 ( 1

∫ T −T

[x(s) −mx] ds

)2 =

4T 2

∫ T −T

E[(x(s) −mx)(x(t) −mx)]dtds

= 1

4T 2

∫ T −T

Kx(s,t)dtds = 1

4T 2

∫ T −T

Kx(s− t)dtds (8.27)

Clearly, if this is to converge to a constant as T → ∞, then the variance must decrease to zero. Indeed, this is a necessary and sufficient condition for convergence in the mean-square sense. However, the clumsy part about the condition in eq. (8.27) is that, although the autocovariance function is only a function of the time difference τ = t−s, the integral is stated in terms of integration with respect to both t and s. We can remedy this by switching the variables of integration, using the following coordinate transformation:

τ = t−s; σ = t + s

The Jacobian of this transformation is 2, as verified by simple computation. Thus, dsdt = 0.5dσdτ. We also need to transform the limits of integration. Note that the original limits correspond to a square in the (s,t)

200 CHAPTER 8. MEAN-SQUARE CALCULUS FOR STOCHASTIC PROCESSES

plane; in the new coordinates (which are a 45 degree rotation), it will be a diamond. Thus, the transformed integral is given by:∫ T

−T

∫ T −T

Kx(t−s) dtds = ∫ 2T

(∫ 2T−τ −2T+τ

Kx(τ) dσ

) 0.5 dτ +

∫ 0 −2T

(∫ 2T+τ −2T−τ

Kx(τ) dσ

) 0.5 dτ

∫ 2T −2T

(∫ 2T−|τ| −2T+|τ|

Kx(τ) dσ

) 0.5 dτ =

∫ 2T −2T

(2T −|τ|) Kx(τ) dτ (8.28)

Thus, the condition for ergodicity in the mean becomes

lim T→∞

4T 2

∫ 2T −2T

(2T −|τ|)Kx(τ) dτ = 0

We formalize the above in the following result:

Theorem 8.8 A wide-sense stationary process x(t) is ergodic in the mean if and only if the autocovariance function satisfies

lim T→∞

∫ 2T −2T

( 1 − |τ| 2T

) Kx(τ)dτ = 0

We can also obtain a sufficient condition for ergodicity in the mean. Note that (

1 − |τ| 2T

) ≤ 1 in the range of

integration. Thus, ∣∣∣(1 − |τ|2T )Kx(τ)∣∣∣ ≤ |Kx(τ)| in the range of integration. This gives a sufficient condition

which is easier to verify:

Theorem 8.9 A wide-sense stationary process x(t) is ergodic in the mean if the autocovariance function satisfies

lim T→∞

∫ 2T −2T |Kx(τ)|dτ = 0

In addition to the mean, there are other statistics which we use in characterizing wide-sense stationary processes. We provide definitions for ergodicity of these statistics below:

Definition 8.5 (Ergodic in Mean Square) A wide-sense stationary stochastic process x(t) is ergodic in mean square if

lim T→∞

〈Rx(0)〉T ≡ lim T→∞

∫ T −T

x 2 (s) ds

mss = Rx(0)

Definition 8.6 (Ergodic in Autocorrelation) A wide-sense stationary stochastic process x(t) is ergodic in autocorrelation if, for any shift τ,

lim T→∞

〈Rx(τ)〉T ≡ lim T→∞

∫ T −T

x(s + τ)x(s) ds mss = Rx(τ)

Like ergodicity in the mean, we can develop conditions for verifying that a process is ergodic in mean square or in autocorrelation. We have to establish that the covariance of the random variables 〈Rx(0)〉T ,〈Rx(τ)〉T decreases to zero in the limit. Unfortunately, this will usually require the computation of higher order (fourth- order) moments of x(t), and one must show that the process has stationarity of fourth-order moments, which is stronger than wide-sense stationarity. For instance, a necessary and sufficient condition for ergodicity in autocorrelation is given in the following result:

Theorem 8.10 A wide-sense stationary stochastic process x(t) is ergodic in autocorrelation if and only if

lim T→∞

∫ 2T −2T

( 1 − |τ| 2T

) KΦs(τ) dτ = 0

for all s, where KΦs(τ) = E[x(t + s + τ)x(t + τ)x(t + s)x(t)] −Rx(s)

is the autocovariance of the correlation process Φs(t) = x(t + s)x(t).

8.6. ERGODICITY OF STATIONARY RANDOM PROCESSES 201

The key to the above result is that a new stationary process, Φs(t), is defined in terms of the original process. Then, ergodicity in the mean of this new process is equivalent to ergodicity in autocorrelation for the estimate 〈Rx(s)〉T for shift s. When this is true for all s, then we have ergodicity in autocorrelation of the original process.

Example 8.6 Consider the process x(t) = A cos(t + θ), where A ∼ N(0, 1), and θ is uniform in [0, 2π], and A,θ are independent. Note that this process has mean mx(t) = E[A]E[cos(t + θ)] = 0. The autocorrelation of this process is given by:

Rx(t,s) = E[x(t)x(s)] = 1

2π

∫ 2π 0

cos(t + θ) cos(s + θ)dθ

= 1

4π

∫ 2π 0

[cos(t + s + 2θ) + cos(t−s)]dθ

= 1

4π

∫ 2π 0

cos(t−s)dθ = 1

2 cos(t−s) (8.29)

which establishes that the process is wide-sense stationary. To show that the process is ergodic in the mean, we have to compute the variance of the estimate as

∫ 2T −2T

( 1 − |τ| 2T

) cos(τ)dτ

and show that, in the limit, it goes to zero. Indeed, with a lot of algebra and Fourier analysis, we can show

∫ 2T −2T

( 1 − |τ| 2T

) cos τdτ =

sin T 2

T 2 → 0 as T →∞.

Thus, the process is ergodic in the mean. To check whether the process is ergodic in autocorrelation, consider if it is ergodic in mean square. That is, define

〈Rx(0)〉T = 1

∫ T −T

x 2 (s)ds

= 1

∫ T −T

A 2

cos (t + θ) 2 dt

= A 2 1

∫ T −T

cos (t + θ) 2 dt = A

2 /2 +

2T � (8.30)

where � is a small number, depending on how much of a period is integrated in the interval [−T,T]. Thus, note that the limit will always be a random variable, depending on the value of A. Thus, the process will not be ergodic in mean square.

Example 8.7 Recall the example of the stochastic process x(t) = A sin t + B cos t, where A,B are Gaussian, zero-mean, unit variance, independent random variables. The autocorrelation was given by E[x(t)x(s)] = cos(t − s). By the discussion in the previos example, this process is clearly ergodic in the mean. In the mean-square, direct computation establishes that 〈Rx(0)〉T ≈ A2 + B2, which has a variance which will not converge to zero.

Sometimes it is useful to describe the ergodic properties of the distribution of the process. Define the random variable

Ix(t) =

{ 1 if x(t) < x 0 otherwise

Then, we can define the sample distribution function

〈Px(x)〉T = 1

∫ T −T

Ix(t)dt

and under certain conditions, this random variable converges to the true distribution function of x(t); that is,

lim T→∞

〈Px(x)〉T = Px(x)

We have the following result:

202 CHAPTER 8. MEAN-SQUARE CALCULUS FOR STOCHASTIC PROCESSES

Theorem 8.11 The wide-sense stationary random process x(t) is ergodic in distribution if and only if

lim T→∞

∫ 2T −2T

[ 1 − |τ| 2T

] KIx(τ)dτ = 0

where

KIx(τ) = E[Ix(t + τ)Ix(t)] −E[Ix(t)] 2

= Px(x,x; t,t + τ) −Px(x; t)2 (8.31)

Typically, the above condition would be met if, as τ →∞, the random values x(t),x(t+τ) were asymptotically independent.

Note that verifying ergodicity is often reduced to verifying a condition of the form

lim T→∞

∫ 2T −2T

[ 1 − |τ| 2T

] Ka(τ) dτ = 0 (8.32)

for some wide-sense stationary random process a(t), defined in terms of the original random process x(t). We now provide a sufficient condition so that the above limit is zero:

Theorem 8.12 Suppose that Ka(τ) has a limit as τ →∞. Then, eq. (8.32) is true if and only if limτ→∞ Ka(τ) = 0.

The proof of the above result goes as follows. Clearly, if the limit is not zero, then the ergodic condition should not hold. So the only part that needs proof is to show that, if the limit is zero, then the condition of eq. (8.32) is satisfied. Assume that the limit is zero. Then, for any � > 0, there exists a T� such that |Ka(τ)| < � for t > T�. Let T > T�. Then,

∫ 2T −2T

[ 1 − |τ| 2T

] Ka(τ) dτ ≤

[∫ 2T� −2T�

[ 1 − |τ| 2T

] Ka(τ)dτ + 4T�

]

= 1

2T [M(�) + 4T�] (8.33)

Thus,

lim T→∞

∫ 2T −2T

[ 1 − |τ| 2T

] Ka(τ) dτ ≤ lim

T→∞

2T [M(�) + 4T�] = 2�

which establishes that the limit must be zero, since it is less than any arbitrary positive 2�. As a final concept in ergodicity, for stationary processes in the strict sense we can define the concept of

ergodicity in terms of random variables defined on the samples of the process x(t). Let X denote the space of all random variables which can be defined based on samples of x(t); that is, y ∈ X ⇒ y = f({x(t) : t ∈ (−∞,∞)}) for some function f. Define the shift operation Tsy = f({x(t + s) : t ∈ (−∞,∞)}). Then, a process is said to be completely ergodic if every random variable in X with the property that Tsy = y for all shifts s is almost surely a constant.

How does the above definition relate to the previous concepts in ergodicity? Note that the random variables

lim T→∞

∫ T −T

f(x(t))dt

are in X, and are also invariant with respect to shifts Ts! Thus, all of the statistics which we describe above can be computed as time averages of a single sample path for a completely ergodic process. In essence, for any function f such that E[|f(x(0))|] < ∞, we have

lim T→∞

∫ T −T

f(x(t))dt = E[f(x(t))]

In general, it is very difficult to obtain conditions for complete ergodicity. However, in the special case of Gaussian random variables, a simple sufficient condition is possible, since the Gaussian distributions are specified completely by the mean and autocorrelation function:

8.6. ERGODICITY OF STATIONARY RANDOM PROCESSES 203

Theorem 8.13 A Gaussian process x(t) is completely ergodic (also referred to as ergodic) if∫ ∞

−∞ |Kx(τ)|dτ < ∞

204 CHAPTER 8. MEAN-SQUARE CALCULUS FOR STOCHASTIC PROCESSES

Chapter 9

Linear Systems and Stochastic Processes

9.1 Introduction

In this section, we discuss the analysis of linear systems with random processes as inputs. Although most of the analysis is focused on continuous-time linear systems, the notes include some material on the analysis of discrete-time linear systems driven by random sequences as inputs. To begin with, we review some concepts from linear system theory for deterministic inputs. Then, we extend these concepts to systems with stochastic processes as inputs.

9.2 Review of Continuous-time Linear Systems

A general linear system with input u(t) and output y(t) has the form

y(t) =

∫ ∞ −∞

h(t,s)u(s)ds (9.1)

where h(t,s) is referred to as the impulse response or weighting function of the system. That is, if u(t) = δ(t− t0) where δ is the unit impulse, then y(t) = h(t,t0). The system is said to be causal if

h(t,s) = 0 for s > t (9.2)

or equivalently

y(t) =

∫ t −∞

h(t,s)u(s)ds (9.3)

The system is said to be time-invariant if h(t,s) = h(t−s, 0) ≡ h(t−s), using the short-hand notation similar to that of autocorrelation for wide-sense stationary processes. If the system is time-invariant, then y(t) is the convolution of h(t) and u(t). That is,

y(t) =

∫ ∞ −∞

h(t−s)u(s)ds = ∫ ∞ −∞

h(s)u(t−s)ds (9.4)

An linear, time-invariant system is causal if and only if h(t) = 0 for t < 0. The analysis of linear, time-invariant systems is often conducted in the frequency domain, using Laplace

transforms. Denote by X(s) ≡L[x(t)] the two-sided Laplace transform of the time signal x(t), defined as

L[x(t)] = ∫ ∞ −∞

x(t)e−stdt ≡ X(s) (9.5)

Note that this is different from the standard one-sided Laplace transform which may be used for causal systems with initial conditions. Then, for a linear, time-invariant system, we have

Y (s) = H(s)U(s) (9.6)

206 CHAPTER 9. LINEAR SYSTEMS AND STOCHASTIC PROCESSES

where Y (s) = L[y(t)],H(s) = L[h(t)],U(s) = L[u(t)]. The Fourier transform of x(t) is simply X(jω), so that, for an linear, time-invariant system,

Y (jω) = H(jω)U(jω) (9.7)

The inverse Fourier transform is given by

x(t) = 1

2π

∫ ∞ −∞

X(jω)ejωtdω (9.8)

There are several important properties of Fourier and two-sided Laplace transforms which are summarized below:

1. Assume that x(t) is real-valued. Then, X(−jω) = X∗(jω), where x∗ denotes the complex conjugate of x.

2. If x(t) is an even function (e.g. x(−t) = x(t)), then X(−s) = X(s).

3. If x(t) is real-valued and even, so is X(jω).

4. If x(t) = ejω0t, then X(jω) = 2πδ(ω −ω0).

5. If x(t) = cos(ω0t), then X(jω) = π(δ(ω −ω0) + δ(ω + ω0)).

6. The Laplace transform of d dt x(t) is sX(s).

See Appendix A for a summary of the definition and properties of continuous-time Fourier transforms.

Example 9.1 Using the above results, we see that, if the input u(t) = A cos(ω0t), then the Fourier transform of the output, Y (jω), is given by

Y (jω) = AπH(jω)(δ(ω −ω0) + δ(ω + ω0))

Transforming back to the time domain, we get

y(t) = A

2 (H(jω0)e

jω0t + H(−jω0)e−jω0t)

Letting H(jω) = |H(jω)|ejθ(ω), we get

y(t) = A|H(jω0)|cos(jω0t + θ(ω0))

Example 9.2 Consider an ideal low-pass filter, so that the transfer function

H(jω) =

{ 1 if |ω| ≤ W, 0 otherwise

In the time domain, the impulse response of such a system is given by

h(t) = sin(Wt)

πt

In the analysis of continuous-time linear systems, it is often useful to define two standard input signals: the unit step u−1(t) and its generalized derivative the unit delta function δ(t). The unit step function is defined as

u−1(t) =

{ 1 if t ≥ 0 0 otherwise

The unit delta function has the formal property that d dt u−1(t) = δ(t). Furthermore, we have, for any

continuous function g(t), ∫ b a

g(t)δ(t)dt =

{ 0 if 0 /∈ (a,b] g(0) otherwise

9.2. REVIEW OF CONTINUOUS-TIME LINEAR SYSTEMS 207

Example 9.3 Consider the signal x1(t) = e

atu−1(t). Its Laplace transform is given by

X1(s) = 1

s−a X1(s) is said to have a pole at s = a; this means that the denominator is 0 at that value, so that the magnitude of X(s) is unbounded.

Example 9.4 Consider the function x2(t) = te

atu−1(t). Its Laplace transform is given by

X2(s) = 1

(s−a)2

X2(s) is said to have a pole of order 2. More generally, if xn(t) = tn

n! eatu−1(t), then Xn(s) =

1 (s−a)n+1 has a pole of

order n + 1.

The above examples provide the basis for inverting Laplace transforms which can be written as ratios of polynomials (also known as rational transforms); these transforms have the form

X(s) = bn−1s

n−1 + bn−2s n−2 + . . . + b1s + b0

sn + an−1sn−1 + an−2sn−2 + . . . + a1s + a0 ≡ n(s)

d(s) (9.9)

The denominator polynomial d(s) can be factored as

d(s) = (s−λ1)k1 (s−λ2)k2 · · ·(s−λm)km

for some distinct roots λ1, . . . ,λm. Then, X(s) can be written as a sum of simple factors using a partial- fraction expansion, as

X(s) = A11

(s−λ1) +

A12 (s−λ1)2

+ . . . + A1k1

(s−λ1)k1 +

A21 (s−λ2)

+ A22

(s−λ2)2

+ . . . + A2k2

(s−λ2)k2 + . . . +

Amkm (s−λm)km

(9.10)

See Appendix B for more detail on partial-fraction expansions. The constant coefficients Aij can be obtained by comparing the two equations (9.9) and (9.10) and matching coefficients of equal powers of s in the numerator. There is an alternative closed-form expression, given by

Aij = 1

(ki − j)!

{ d(ki−j)

ds(ki − j) [ (s−λi)kiX(s)

]} s=λi

(9.11)

Once the partial-fraction expansion is known, the time signal x(t) is easily determined from the previous examples, as

x(t) = (A11e λ1t + A12te

λ1t + . . . + A1k1 tk1

k1! eλ1t + A21e

λ 2t + A22te

λ2t + . . .

+A2k2 tk2

k2! eλ2t + . . . + Amkm

tkm

km! eλmt)u−1(t) (9.12)

Example 9.5 One of the applications of rational transforms is in the solution of linear differential equations with constant coefficients. Consider a causal, linear, time-invariant system with input u(t), and output y(t) defined as the solution of the linear differential equation with constant coefficients

dtn y(t) + an−1

d(n−1)

dt(n−1) y(t) + . . . + a0y(t) = bm

d(m)

dt(m) u(t) + bm−1

d(m−1)

dt(m−1) u(t) + . . . + b0u(t) (9.13)

where m < n. Then, the transfer function for this system is given by

H(s) = Y (s)

U(s) =

bms m + bm−1s

m−1 + . . . + b1s + b0 sn + an−1sn−1 + an−2sn−2 + . . . + a1s + a0

so that h(t) can now be obtained using a partial fraction expansion. Note that h(t) will be the sum of terms eaitu−1(t), te aitu−1(t),

where ai is a pole of H(s). The poles of H(s) will be the roots of the denominator polynomial d(s).

208 CHAPTER 9. LINEAR SYSTEMS AND STOCHASTIC PROCESSES

An important concept in the analysis of linear systems is the concept of stability. In particular, a linear, time-invariant system is called bounded-input, bounded-output stable if, whenever |u(t)| ≤ K < ∞ for all t, then there exists a finite value M such that |y(t)| ≤ M for all t. Since y(t) can be written in terms of u(t) and the impulse response function h(t), a necessary and sufficient condition for this stability is∫ ∞

−∞ |h(t)|dt < ∞

Note that, for a causal system such as that defined in (9.13), this will be true if and only if all of the poles have negative real parts, so that they lie inside the complex left-half plane.

A system is said to be anti-causal if h(t) = 0 for t > 0. Anti-causal systems can also have rational transfer functions H(s); however, for anti-causal systems, stability corresponds to all poles lying in the right- half plane! For instance, the transfer function h(t) = eatu−1(−t) is anti-causal, and has Laplace transform H(s) = − 1

s−a. Plotting h(t) for a positive and negative indicates that the system is stable only if the real part of a is positive.

9.3 Review of Discrete-time Linear Systems

As in the continuous-time case, discrete-time linear systems with input u(t) and output y(t) can be written as

y(t) =

∞∑ −∞

h(t,s)u(s)

where h(t,s) is the impulse response. In discrete time, the delta function δ(s) is defined as

δ(t) =

{ 1 if t = 0 0 otherwise

(9.14)

The system is causal if h(t,s) = 0 whenever s > t, and anticausal if h(t,s) = 0 whenever t > s. The system is time-invariant if h(t,s) = h(t−s, 0) ≡ h(t−s). For linear, time-invariant systems, y(t) is the convolution of h(t) and u(t), as

y(t) =

∞∑ −∞

h(t−s)u(s) = ∞∑ −∞

h(s)u(t−s) (9.15)

As in the continuous-time case, it is easiest to analyze linear, time-invariant systems in the transform domain. For discrete-time systems, we define the bilateral z-transform of x(t) as

X(z) =

∞∑ −∞

x(t)z−t (9.16)

For a linear, time-invariant system, we have

Y (z) = H(z)U(z) (9.17)

The Fourier transform of x(t) is defined as X(ejω), so that

X(ejω) =

∞∑ −∞

x(t)e−jωt

Note that X(ejω) is periodic in ω, with period 2π. The inverse Fourier transform is given by

x(t) = 1

2π

∫ π −π

X(ejω)ejωtdω (9.18)

Useful properties of z-transforms and Fourier transforms are:

1. If x(t) is real, then X(e−jω) = X∗(ejω), and X(z−1) = X∗(z).

9.3. REVIEW OF DISCRETE-TIME LINEAR SYSTEMS 209

Function x(t) z-Transform X(z) δ(t) 1

at−1u−1(t− 1) 1z−a (t− 1)at−2u−1(t− 1) 1(z−a)2

tx(t) −z d dz X(z)

x(t + 1) zX(z)

x(t− 1) X(z) z

Table 9.1: Common Functions and their z-transforms.

2. If x(t) is even, then X(z−1) = X(z).

3. If x(t) is real and even, so is X(ejω) as a function of ω.

4. The transform of x(t + 1) is zX(z).

5. If x(t) = ejω0t, then X(ejω) = 2πδ(ω −ω0).

6. If x(t) = A cos(ω0t), then, letting H(e jω) = |H(ejω)|ejΘ(ω),

y(t) = A|H(ejω0 )|cos (ω0t + Θ(ω0))

See Appendix A for a summary of the definition and properties of discrete-time Fourier transforms. Consider the signal x1(t) = a

tu−1(t); its z-transform is

X1(z) = 1

1 −az−1 =

z −a

Note the presence of a pole at a. As in the continuous-time case, let x2(t) = ta tu−1(t). Then,

X2(z) = az−1

(1 −az−1)2 =

(z −a)2

The way most of these identities are derived are by noting that

dz X(z) =

∞∑ −∞ −tx(t)z−t−1

Thus,

X2(z) =

∞∑ −∞

tx1(t)z −t = −z

dz X1(z) =

(z −a)2

If xn(t) = t natu−1(t), then Xn(z) = −z ddzXn−1(z).

Using the above properties, it is useful to know the relationship between some standard z-transforms and their time functions. We summarize these in Table 1

The above discussion allows us to invert funtions with rational z-transforms using partial fraction ex- pansions in a manner identical to the continuous-time case. As in the continuous-time case, we can also define the define the concept of stability for discrete-time linear, time-invariant systems. Such systems will be stable if and only if

∞∑ −∞ |h(t)| < ∞

A necessary and sufficient condition for a causal system described by a rational transfer function to be stable is that all of the poles of the transfer function have magnitude less than 1, so that they are strictly inside the unit circle. For anti-causal systems, the stability condition is reversed, so that the poles must have magnitude strictly greater than 1.

210 CHAPTER 9. LINEAR SYSTEMS AND STOCHASTIC PROCESSES

9.4 Extensions to Multivariable Systems

All of the ideas in the previous subsections are easily extended to address linear systems with vector-valued outputs y(t) and vector-valued inputs u(t). The general form of such a multi-input, multi-output (MIMO) linear system is

y(t) =

∫ ∞ −∞

H(t,s)u(s)ds (9.19)

where H(t,s) is the impulse response matrix. In the notation, we use capitals to denote matrices, and underlining to denote column vectors. Causality can be defined again in terms of H(t,s) = 0 if s > t. Time invariance corresponds to H(t,s) = H(t−s, 0) ≡ H(t−s). For linear, time-invariant systems, we can represent the system as a convolution, with

y(t) =

∫ ∞ −∞

H(t−s)u(s)ds = ∫ ∞ −∞

H(s)u(t−s)ds (9.20)

In order to avoid some confusion in notation, we denote the Laplace transform of the matrix H(t) as the matrix H(s). In the transform domain, this implies

Y (s) = H(s)U(s) (9.21)

Similar extensions exist for discrete-time systems, where

y(t) =

∞∑ −∞

H(t,s)u(s)

9.5 Second-order Statistics for Vector-Valued Wide-Sense Station- ary Processes

Before proceeding with the theory of linear systems driven by random processes, it will be useful to review a few definitions. To be consistent throughout the notes, we use the convention that the argument of an autocorrelation of a wide-sense stationary process is added to the time argument of the second variable. However, you must be aware that this notation is not standard, and that the answer may depend on the precise convention used above. Thus, for complex-valued wide-sense stationary processes, we have

Rx(t−s) = E[x(s)x(t)H] = E[x(t)x(s)H]H = Rx(s− t)H

Similarly, the cross-correlation between two jointly wide-sense stationary processes x(t),y(t) is

Rxy(t−s) = E[x(s)y(t)H] = E[y(t)x(s)H]H = Ryx(s− t)H (9.22)

For vector wide-sense stationary processes, we can define the power spectral density Sx(ω) and the cross- power spectral density Sxy(ω) as

Sx(ω) =

∫ ∞ −∞

Rx(τ)e −jωτdτ; Sxy(ω) =

∫ ∞ −∞

Rxy(τ)e −jωτdτ

Thus, we can derive the following relationship for the special case of real-valued processes:

Sx(−ω) = ∫ ∞ −∞

Rx(τ)e jωτdτ = (

∫ ∞ −∞

e−jω(−τ)Rx(−τ)dτ)T = Sx(ω)T (9.23)

We also know that, for any vector a, the scalar process aHx(t) must have a nonnegative power spectral density. This means

aHSx(ω)a ≥ 0 for any vector a

which implies Sx(ω) is a positive semidefinite matrix for any ω.

9.6. CONTINUOUS-TIME LINEAR SYSTEMS WITH RANDOM INPUTS 211

Consider now modulation of a wide-sense stationary process x(t) by a cosine with uniformly distributed phase. In particular, define the process y(t) = 2x(t) cos(ω0t + θ), where θ is uniformly distributed in [0, 2π], independent of x(t) for all t. Then, the autocorrelation of this process is given by:

Ry(t,s) = 4E [ x(t)x(s)H cos(ω0t + θ) cos(ω0s + θ)

] = 4E

[ x(t)x(s)H

] E [cos(ω0t + θ) cos(ω0s + θ)]

= 4Rx(t−s) (

2 cos(ω0(t−s)) + E

[ 1

2 cos(ω0(t + s) + 2θ)

]) = 2Rx(t−s) cos(ω0(t−s)) (9.24)

and the mean is my(t) = E[x(t)]E[cos(ω0t + θ)] = 0 (9.25)

which shows that y(t) is also wide-sense stationary. The power spectral density of y(t) can be computed as

Sy(ω) =

∫ ∞ −∞

Ry(τ)e −jωτdτ =

∫ ∞ −∞

Rx(τ) ( e−jω0τ + ejω0τ

) e−jωτdτ = Sx(ω −ω0) + Sx(ω + ω0) (9.26)

We have already shown that Sy(ω) must be positive semidefinite for any ω. In particular, for ω = 0, this states that

Sx(−ω0) + Sx(ω0) ≥ 0 for any arbitrary ω0. If x(t) is real-valued, then this can be further simplified to obtain

Sx(ω) + Sx(ω) T ≥ 0

a condition which is referred to as positive real. As a final note, the covariance of the random vector x(t) from a wide-sense stationary process can be

obtained readily from the power spectral density of the process as follows:

E[x(t)x(t)H] = Rx(0) = 1

2π

∫ ∞ −∞

Sx(ω)e jω0dω =

2π

∫ ∞ −∞

Sx(ω)dω (9.27)

9.6 Continuous-time Linear Systems with Random Inputs

For the purposes of this section, it makes little difference whether the processes are scalar-valued or vector- valued, real-valued or complex-valued. In order to introduce the most general form of the results, we will make no assumptions, and describe the results in the general case. Thus, assume that u(t) is a complex- valued, vector-valued random process with mean mu(t) and autocorrelation Ru(t,s), defined as

mu(t) = E[u(t)]; Ru(t,s) = E[u(t)u(s) H]

where aH = (aT )∗ is the transpose of the complex conjugate of a. Consider the linear system with input u(t), described by

y(t) =

∫ ∞ −∞

H(t,s)u(s)ds (9.28)

The issue is to relate the statistics of the output process y(t) to those of the input process u(t). For- tunately, we have already developed a theory of mean-square integration, which allows us to determine the properties of (9.28) for each t. In particular, we know

my(t) = E[

∫ ∞ −∞

H(t,s)u(s)ds] =

∫ ∞ −∞

H(t,s)E[u(s)]ds =

∫ ∞ −∞

H(t,s)mu(s)ds (9.29)

Furthermore, if z(t) is any other process, we know

Ryz(t,s) ≡ E[y(t)z(s)H]

= E[

∫ ∞ −∞

H(t,σ)u(σ)z(s)Hdσ]

∫ ∞ −∞

H(t,σ)E[u(σ)z(s)H]dσ

∫ ∞ −∞

H(t,σ)Ruz(σ,s)dσ (9.30)

212 CHAPTER 9. LINEAR SYSTEMS AND STOCHASTIC PROCESSES

In particular, this leads to

Ryu(t,s) =

∫ ∞ −∞

H(t,σ)Ruu(σ,s)dσ (9.31)

and

Ryy(t,s) =

∫ ∞ −∞

H(t,σ)Ruy(σ,s)dσ

∫ ∞ −∞

H(t,σ)Ryu(s,σ) Hdσ

∫ ∞ −∞

H(t,σ)

[∫ ∞ −∞

H(s,τ)Ruu(τ,σ)dτ

]H dσ

∫ ∞ −∞

H(t,σ)

[∫ ∞ −∞

Ruu(τ,σ) HH(s,τ)Hdτ

] dσ

∫ ∞ −∞

H(t,σ)Ruu(τ,σ) HH(s,τ)Hdτdσ

∫ ∞ −∞

H(t,σ)Ruu(σ,τ)H(s,τ) Hdτdσ (9.32)

There are similar expressions for the autocovariances and cross-covariances, based on the autocovariance of the input u(t). For instance,

Kyy(t,s) =

∫ ∞ −∞

H(t,σ)Kuu(σ,τ)H(s,τ) Hdτdσ (9.33)

and

Kyu(t,s) =

∫ ∞ −∞

H(t,σ)Kuu(σ,s)dσ

One of the important properties of linear systems is the property of superposition. In particular, if the input u(t) = x(t) + z(t), where x(t),z(t) were random processes, we would like to make some statements about the statistics of the output y(t). In particular, let

y 1 (t) =

∫ ∞ −∞

H(t,s)x(s)ds

y 2 (t) =

∫ ∞ −∞

H(t,s)z(s)ds

The question is how are the statistics of y(t) related to the statistics of y 1 (t),y

2 (t)?

To answer this question, let’s analyze the mean of y(t), which is defined as

my(t) =

∫ ∞ −∞

H(t,s)mu(s)ds =

∫ ∞ −∞

H(t,s)(mx(s) + mz(s))ds = my1 (t) + my2 (t) (9.34)

Thus, the means of the processes satisfy the deterministic superposition law! That is, the mean of the output in response to two inputs is the sum of the means of the individual outputs for each input.

Will the autocorrelation statistic satisfy a similar relationship? Consider the relationship provided by (9.32):

Ryy(t,s) =

∫ ∞ −∞

H(t,σ)Ruu(σ,τ)H(s,τ) Hdτdσ

∫ ∞ −∞

H(t,σ)[Rxx(σ,τ) + Rxz(σ,τ) + Rzx(σ,τ) + Rzz(σ,τ)]H(s,τ) Hdτdσ

= Ry1y1 (t,s) + Ry2y2 (t,s) +

∫ ∞ −∞

H(t,σ)[Rxz(σ,τ) + Rzx(σ,τ)]H(s,τ) Hdτdσ (9.35)

9.6. CONTINUOUS-TIME LINEAR SYSTEMS WITH RANDOM INPUTS 213

Thus, if Rxz(σ,τ) = 0, (which means the processes x(t).z(s) are orthogonal), then we have

Ryy(t,s) = Ry1y1 (t,s) + Ry2y2 (t,s) (9.36)

In general, if the processes x(t).z(s) are uncorrelated, then we have superposition of the autocovariances:

Kyy(t,s) = Ky1y1 (t,s) + Ky2y2 (t,s) (9.37)

A useful method for analysis of stochastic systems driven by random processes is to decompose the random process into two processes: the mean process and a zero-mean process. That is, for an input random process u(t), define the decomposed process as u(t) = mu(t) + u

′(t), where u′(t) is zero-mean. Then, by superposition, the mean of the output y(t) is given as the sum of the means of the individual outputs:

my(t) =

∫ ∞ −∞

H(t,s)(mu(s) + mu′(s))ds =

∫ ∞ −∞

H(t,s)mu(s)ds (9.38)

so that analysis of the mean of the output depends only on the first input. Note also that the processes mu(t) and u

′(t) are uncorrelated, so that the autocovariance is given by

Kyy(t,s) = Ky1y1 (t,s) + Ky2y2 (t,s) =

∫ ∞ −∞

H(t,σ)Ku′u′(σ,τ)H(s,τ) Hdτdσ

since the process mu(t) is deterministic, and therefore has zero autocovariance. Thus, the covariance of the output depends only on the process u′(t). It will be useful in the analysis of linear systems driven by stochastic processes, to remember that we can separate the analysis of the mean and the analysis of the autocorrelation.

Suppose now that our system is linear, time-invariant and that u(t) is wide-sense stationary. Assume in addition that the system is bounded-input, bounded-output stable. For vector input systems, this means that ∫ ∞

−∞

∑ i,j

|Hij(t)|dt < ∞

Under this conditions, u(t) and y(t) are jointly wide-sense stationary, which can be verified by direct calcu- lation:

my(t) =

∫ ∞ −∞

H(s)mu(t−s)ds = ∫ ∞ −∞

H(s)ds mu = my (9.39)

Ryu(t,s) =

∫ ∞ −∞

H(t−σ)E[u(σ)u(s)H]dσ = ∫ ∞ −∞

H(t−σ)Ru(s−σ)dσ = ∫ ∞ −∞

H(−τ)Ru(s−t−τ)dτ ≡ Ryu(s−t)

(9.40) A simpler notation for writing the above equation is using the convolution operator ∗, as

Ryu(t) = H(−t) ∗Ru(t)

Furthermore, if u(t) and z(t) are jointly wide-sense stationary, then so are y(t) and z(t), and

Ryz(t,s) =

∫ ∞ −∞

H(−τ)Ruz(s− t− τ)dτ ≡ Ryz(s− t) (9.41)

or, in convolution terms, Ryz(t) = H(−t) ∗Ruz(t)

Extending the above discussion to z(t) = y(t), we get

Ry(t,s) =

∫ ∞ −∞

H(−τ)Ruy(s− t− τ)dτ

∫ ∞ −∞

H(−τ)Ryu(−τ + s− t)Hdτ

∫ ∞ −∞

H(−τ)Ru(τ + t−s−σ)HH(−σ)Hdσdτ

∫ ∞ −∞

H(−τ)Ru(s− t− τ −σ)H(σ)Hdσdτ = Ry(t−s) (9.42)

214 CHAPTER 9. LINEAR SYSTEMS AND STOCHASTIC PROCESSES

It will be useful to recognize the convolution form of the above operation. Let

G(t) = H(−t) ∗Ru(t)

Then,

G(s− t−σ) = ∫ ∞ −∞

H(−τ)Ru(s− t−σ − τ)dτ

and we can rewrite (9.42) as

Ry(t) =

∫ ∞ −∞

G(t−σ)H(σ)Hdσ = G(t) ∗H(t)H

Note that similar equations can be obtained for the auctocovariance Ky(t − s) and the cross-covariance Kyz(t−s), simply by replacing all R with K.

Since both the inputs and outputs are wide-sense stationary, and the system is time-invariant, one can take Fourier transforms of (9.40), (9.41) and (9.42) to obtain:

Syu(ω) = H(−jω)Su(ω) (9.43)

Syz(ω) = H(−jω)Suz(ω) (9.44)

Note also that the Fourier transform of H(t)H is given by∫ ∞ −∞

H(t)He−jωtdt = (

∫ ∞ −∞

H(t)ejωtdt)H = H(−jω)H = H(jω)T

Hence,

Sy(ω) = H(−jω)Su(ω)H(jω)T (9.45)

Using a similar analysis technique, we obtain

Suy(ω) = Su(ω)H(jω) T (9.46)

Example 9.6 One way of interpreting the power spectral density of a process is to consider what happens to that density when it is filtered under an ideal band-pass filter. Consider a wide-sense stationary scalar process u(t), which is used as an input into a linear, time-invariant system with transfer function described in the frequency domain as

H(jω) =

{ 1 if ω ∈ (ω1,ω2) 0 otherwise

Denote the output as y(t). Using the above relationships, we have

Sy(ω) = |H(jω)|2Su(ω)

Suppose we wanted to compute the second moment of the process y(t); as we have seen before, this is equal to Ry(0). Then, we can use the formula relating the autocorrelation to the power spectral density, as follows:

Ry(t) = 1

2π

∫ ∞ −∞

Sy(ω)e jωt dω

so that, in particular,

Ry(0) = 1

2π

∫ ∞ −∞

Sy(ω)dω = 1

2π

∫ w2 w1

Su(ω)dω

The result is that the average power in the output is given as an integral of the power spectral density of the input. This is consistent with the interpretation of the power spectral density as a density, since, if one integrates the density across a frequency band, one gets the average power in that frequency band.

9.6. CONTINUOUS-TIME LINEAR SYSTEMS WITH RANDOM INPUTS 215

Note that, for scalar-valued processes, many of the matrix relationships described above simplify. We summarize these below:

Syu(ω) = H(−jω)Su(ω) = H(jω)∗Su(ω) Suy(ω) = Su(ω)H(jω) = H(jω)Su(ω)

Sy(ω) = H(jω)Su(ω)H(jω) ∗ = H(jω)H(jω)∗Su(ω) = |H(jω)|2Su(ω) (9.47)

Example 9.7 Consider the causal linear, time-invariant system described by the differential equation

dt y(t) = −ay(t) + u(t)

where a > 0. The transfer function of this linear system is given by

H(s) = 1

s + a

Assume that the input is standard white noise, with power spectral density Su(ω) = 1. Then, the power spectral density of the output is given by

Sy(ω) = |H(jω)|2Su(ω) = 1

a + jω

a− jω =

a2 + ω2

Taking inverse Fourier transforms, the autocorrelation is given by

Ry(τ) = 1

2a e −a|τ|

Example 9.8 Consider the causal, linear, time-invariant system described by the differential equation

dt2 y(t) = −4

dt y(t) − 4y(t) +

dt u(t) + u(t)

The transfer function of this system is

H(s) = s + 1

s2 + 4s + 4 =

s + 1

(s + 2)2

Assume that the input u(t) is the sum of a standard white noise and a second wide-sense stationary process u′(t), uncorrelated with the white noise, with zero mean and autocovariance Ku′(τ) = e

−|τ|. The problem is to determine the autocovariance of the output process y(t). First, note that the output process will be zero-mean, because the mean of the two input processes is zero. Second, obtain the power spectral density of the input u as

Su(ω) = 1 + Su′(ω) = 1 + 2

1 + ω2

because the white noise process and u′(t) are uncorrelated. Third, obtain the power spectral density of the output as:

Sy(ω) = H(jω)H(−jω)Su(ω) = (1 + ω2)

(ω2 + 4)2 (1 +

1 + ω2 )

Fourth, compute the power spectral density due to the white noise input only as

Sy1 (ω) = (1 + ω2)

(ω2 + 4)2 =

(ω2 + 4) +

−3 (ω2 + 4)2

and compute the corresponding autocorrelation as

Ry1 (τ) = 1

4 e −2|τ|

+ −3 16

( 1

2 + |t|)e−2|τ|

Similarly, the power spectral density due to the u′(t) input is

Sy2 (ω) = 2(1 + ω2)

(1 + ω2)(ω2 + 4)2 =

(ω2 + 4)2

and the autocorrelation is given by

Ry2 (τ) = 1

8 ( 1

2 + |t|)e−2|τ|

Combining these, we get

Ry(τ) = 1

4 e −2|τ| −

16 ( 1

2 + |t|)e−2|τ|

216 CHAPTER 9. LINEAR SYSTEMS AND STOCHASTIC PROCESSES

Chapter 10

LLSE Estimation of Stochastic Processes and Wiener Filtering

10.1 Introduction

In Chapter 3 we studied the estimation of random variables based on observation of other random variables. In this chapter we extend this work to study the estimation of stochastic processes. Our focus will be the problem of finding the best linear minimum mean square error estimate (i.e. the linear least squares estimate – LLSE) of the zero-mean random process X(t) based on observation of the zero-mean process Y (τ) for Ti ≤ τ ≤ Tf . If the process is not zero mean, then we estimate X̃(t) = X(t) − mx(t) based on Ỹ (τ) = Y (τ)−my(τ) and substitute the definitions of X̃(t) and Ỹ (τ) in at the end, as usual (if this bothers you see Appendix D). In our search for the best estimate we assume that the second order statistics of the processes KXX(t,s), KY X(t,s), and KY Y (t,s) are known. Note, since we are assuming the means are zero KXX(t,s) = RXX(t,s), etc. In the development to follow we will present the results for both the continuous and the discrete time cases concurrently, since they are so similar. In summary then, the problem setup is given by the following:

X(t) = Process to estimate (10.1)

Y (τ); Ti ≤ τ ≤ Tf = Observed Process (10.2) Given: KXX(t,s), KY X(t,s), KY Y (t,s) (10.3)

x(t),y(τ) zero mean (10.4)

Since we want a linear estimate we know the estimate will be a linear combination of the points in the observation interval. For the discrete case, which we already have experience with, this corresponds to a simple weighted sum, while for the continuous case this operation corresponds to a weighted integral. In particular, the form of the estimator for each case will be:

CT: x̂(t) =

∫ Tf Ti

h(t,σ) y(σ) dσ (10.5)

DT: x̂(t) =

Tf∑ σ=Ti

h(t,σ) y(σ) (10.6)

Thus the estimate may be viewed as the output of a time-varying linear filter whose weighting pattern h(t,σ) defines the estimator, as shown in Figure 10.1. When estimating one stochastic process based on observation of another then, the estimator itself may be simply interpreted as a filter, and LLSE estimation viewed as a problem of filter design.

There is some standard terminology that is used in describing such problems depending on the time t the estimate x̂(t) is generated relative to the time interval Ti ≤ τ ≤ Tf of the observation. These correspond to whether the estimate is on the boundary, the interior, or the exterior of the observation interval. The most

218 CHAPTER 10. LLSE ESTIMATION OF STOCHASTIC PROCESSES AND WIENER

FILTERING

y ( t ) h ( t , s ) $ ( )x t

Figure 10.1: Linear Estimator for a Stochastic Process.

common are:

Ti < t < Tf : Smoothing, Noncausal filtering

t = Tf : Filtering

Tf < t : Prediction

The relationship between the time the estimate is generated and the time interval of the observation is shown in Figure 10.2.

T i T f

t D a t a I n t e r v a l

S m o o t h i n g F i l t e r i n g P r e d i c t i o n

Figure 10.2: Estimation Types Based on Relative Times of Observation and Estimate.

10.2 Historical Context

Before proceding to the mathematical developments leading to the LLSE estimate, it is useful (and perhaps some will think more interesting!) to first consider the historical context for its development, and hence better understand the type of the result we will obtain. The problem of linear least square estimation of one stochastic process based on observation of another stochastic process was originaly done for the continuous case by Norbert Wiener and hence often is called “Wiener Filtering,” even when applied to the discrete-time situation. Actually, the discrete case was solved by the Russian mathematician Kolmogorov.

To understand the goal in the mind of these pioneers, let us start with a question. What do you think of when you think of a “filter”? In other words, if asked to design or implement a filter by e.g. your boss or instructor, what would you bring back as your result? (Please take a minute to think about this before reading on!). In the present day, when filter banks rule the DSP journals and wavelets are everywhere, most people would immediately think of a digital definition of a filter, and of a filter as being defined e.g. by its vectors of delay coefficients or some such – at least that is what I do. But such was not always the case, and we must think about how Wiener saw the world to understand the form of his solution. Let us begin with a brief history of Norbert Wiener.

Norbert Wiener was born Noverber 26, 1894 in Columbia, MO and died March 18, 1964 in Stockholm Sweden. He was a child prodigy, finishing high school at the age of 11 and getting an undergraduate degree from Tufts in mathematics at age 14. He went on to Harvard and obtained his Ph.D. in Mathematical Logic at age 18! Wiener then went abroad and stuided under the great mathematicians Russel, Hardy, and Hilbert. Finally in 1919 he obtained a teaching appointment at MIT, where he remained for the rest of his life. Wiener contributed to many areas including cybernetics (a term he coined), stochastic processes, and quantum theory.

Thus we can see that Wiener was active during the period from the 1920’s to, say, the 1950’s. The major world event during this period was World War II, and this event formed the backdrop for Wiener’s work. In particular, Wiener worked on gun fire control at MIT – the goal being to direct a gun to shoot down an airplane. With this motivation, Wiener worked through the 1930’s on the problem of estimation and prediction of continuous-time processes – which is all that really mattered at the time! The first general purpose electromechanical digital computer, the “Mark I,” was build at Harvard in 1944 (a multiply took 4 seconds) and the first electronic digital computer was built in 1946, as the war drew to a close. The results of Wiener’s work on this problem (not declassified till the late 1940’s) were written up in an internal MIT

10.3. LLSE PROBLEM SOLUTION: THE WIENER-HOPF EQUATION 219

technical report entitled “Extrapolation, Interpolation, and Smoothing of Stationary Time Series,” known popularly by engineers of the time as “the yellow peril”, due to the yellow color of its original cover.

For Wiener, working when he did, a filter was truly an analog device composed of capacitors, resistors, and the like. As such, specification of an realizable filter required specification of its poles and zeros – in other words, what was needed was a closed form solution to the problem if it was to be implemented. We will contrast this to the recursive algorithm comprising the Kalman filter later. Given this view, the focus of work during Wiener’s time and for some time beyond was on an closed form solution to the LLSE estimation problem and the explicit specification of the corresponding filter. Let us now proceed to find this solution.

10.3 LLSE Problem Solution: The Wiener-Hopf Equation

The solution to the problem described in (10.1)–(10.4) can be found through use of the orthogonality principle discussed previously. In particular, we know that the optimal LLSE estimate will have the property that it is unbiased and that the error is orthogonal to the estimate:

E [(x(t) − x̂(t)) y(τ)] = 0 ∀τ ∈ [Ti, Tf ] (10.7)

Expanding this expression, we obtain the following condition that the optimal LLSE estimate must satisfy:

KXY (t,τ) = KX̂Y (t,τ) ∀τ ∈ [Ti, Tf ] (10.8)

Note that this condition says that the optimal estimate x̂(t) has the same cross-correlation with the data as the true process x(t).

We can find K X̂Y

(t,τ) using the definition of cross-correlation and (10.5) or (10.6). Working with the CT case:

K X̂Y

(t,τ) = E [x̂(t)y(τ)] (10.9)

∫ Tf Ti

h(t,σ) E[y(σ)y(τ)] dσ (10.10)

∫ Tf Ti

h(t,σ) KY Y (σ,τ) dσ (10.11)

Similarly in discrete time we have:

K X̂Y

(t,τ) =

Tf∑ σ=Ti

h(t,σ) KY Y (σ,τ) (10.12)

Now we can substitute (10.11) or (10.12) into (10.8) to obtain the following conditions that the optimal filter h(t,σ) must satisfy:

CT: KXY (t,τ) =

∫ Tf Ti

h(t,σ) KY Y (σ,τ) dσ ∀τ ∈ [Ti, Tf ] (10.13)

DT: KXY (t,τ) =

Tf∑ σ=Ti

h(t,σ) KY Y (σ,τ) ∀τ ∈ [Ti, Tf ] (10.14)

This equation is called the Wiener-Hopf equation, and captures the conditions that the optimal estimate must satisfy.

In addition to the Wiener-Hopf equation for the optimal estimate, we can also get expressions for the estimation error variance ΛLSE(t):

ΛLSE(t) = E [ (x(t) − x̂(t))2

] = E [(x(t) − x̂(t))(x(t) − x̂(t))] (10.15)

= E [x(t)(x(t) − x̂(t))] −E [x̂(t)(x(t) − x̂(t))] (10.16) = KXX(t,t) −KXX̂(t,t) (10.17)

220 CHAPTER 10. LLSE ESTIMATION OF STOCHASTIC PROCESSES AND WIENER

FILTERING

where we have used the fact that the second term in (10.16) is zero since the error is orthogonal to linear functions of the data, in this case the estimate itself. Now we can calculate K

XX̂ (t,t) from the definition of

covariance and (10.5) or (10.6). Doing this yields the following expressions for the error covariance:

CT: ΛLSE(t) = KXX(t,t) − ∫ Tf Ti

h(t,σ) KY X(σ,t) dσ (10.18)

DT: ΛLSE(t) = KXX(t,t) − Tf∑ σ=Ti

h(t,σ) KY X(σ,t) (10.19)

Before we proceed to Wiener filtering, we stop to point out that while this development may seem new and intimidating, we have visited some of these issues before in our study of random variables. In particular, consider the discrete-time result (10.14) when the times involved are finite. We can collect the equations represented by this set of conditions into vector form to obtain a matrix equation capturing the entire set:[

KXY (t,Ti), · · · KXY (t,Tf ) ]

= (10.20)

[ h(t,Ti), · · · h(t,Tf )

]  

KY Y (Ti,Ti) KY Y (Ti,Ti + 1) · · · KY Y (Ti,Tf ) KY Y (Ti + 1,Ti)

... ...

KY Y (Tf,Ti) KY Y (Tf,Ti + 1) · · · KY Y (Tf,Tf )

 

=⇒ ΛXY = hT ΛY (10.21)

where we have made the natural matrix/vector associations in the last equation. As can be seen (10.21) are nothing more than the familiar normal equations. Their solution is given by:

hT = ΛXY Λ −1 Y (10.22)

Recall that the associated error covariance for this case was given by

ΛLSE = ΛX − ΛXY Λ−1Y Λ T XY = ΛX −h

T ΛTXY (10.23)

Notice the similarity between (10.23) and (10.19). Thus solving the Wiener-Hopf equations in the general (finite observation interval) discrete-time case is

equivalent to solving the normal equations and, while straightforward, could be computationally challenging for large problem sizes. In our examination of the Kalman filter we will see that in certain situations we can be more computationally efficient. In the continuous time case we must solve the corresponding integral equation given by (10.13). Overall, solving the Wiener-Hopf equation is difficult in general and we must look at special cases to proceed further.

10.4 Wiener Filtering

In this section we discuss what is known as Wiener Filtering. Wiener filtering problems are a subclass of LLSE problems where additional assumptions are made. In particular, all Wiener filtering problems satisfy the following assumptions:

Definition 10.1 (Wiener Filtering) • Find LLSE estimate of x(t) based on y(τ), τ ∈ [Ti, Tf ] • x(t), y(t) are jointly wide-sense stationary • Ti = −∞

Thus the additional assumptions over and above the LLSE problem are the stationarity of the processes and the fact that we observe the data starting at Ti = −∞. These two additional assumptions assure that there are no transients in the estimate, in particular that the corresponding filter h(t,σ) will be time-invariant, i.e.:

h(t,σ) = h(t−σ) (10.24)

10.4. WIENER FILTERING 221

Substituting this into the expressions for the Wiener-Hopf equation (10.13) and (10.14) and making the changes of variables v = t−σ and u = t− τ yields the following expressions for the Wiener-Hopf equations for this case:

CT: KY X(u) =

∫ ∞ t−Tf

h(v) KY Y (u−v) dv t−Tf ≤ u ≤∞ (10.25)

DT: KY X(u) =

∞∑ v=t−Tf

h(v) KY Y (u−v) t−Tf ≤ u ≤∞ (10.26)

Note that these expressions are written in terms of KY X on the left hand side rather than KXY , which involves a change in the sign of the argument. In addition, we have changed the sign of the left hand side by exchanging the limits of the integral.

There are common subclasses of Wiener filtering problems based on the choice of the right end point of the observation interval Tf . The two cases we will look at in detail are when Tf = +∞, so we have observations for all of time, and when Tf = t so we produce an estimate based only on past observations.

10.4.1 Noncausal Wiener Filtering (Wiener Smoothing)

The first case we will examine is when we choose Tf = +∞, so that we base our estimate at time t on observations of y(τ) for all time: τ ∈ [−∞, ∞]. This case is referred to as Wiener smoothing or noncausal Wiener filtering, since the filter will necessarily be noncausal. For this reason the corresponding filter is sometimes referred to as the unrealizable Wiener filter.

Definition 10.2 (Noncausal Wiener Filtering or Wiener Smoothing) • Find LLSE estimate of x(t) based on y(τ), τ ∈ [Ti, Tf ]

• x(t), y(t) are jointly wide-sense stationary

• Ti = −∞

• Tf = +∞

In this case the Wiener-Hopf equations become:

CT: KY X(u) =

∫ ∞ −∞

hnc(v) KY Y (u−v) dv −∞≤ u ≤∞ (10.27)

DT: KY X(u) =

∞∑ v=−∞

hnc(v) KY Y (u−v) −∞≤ u ≤∞ (10.28)

Thus the noncausal Wiener filter is the time invariant impulse that satisfies (10.27) or (10.28). To solve these equations we only need to recognize the expression on the right as a convolution and use e.g. Fourier or Laplace (or Z) transforms. Taking Laplace transforms of (10.27) and z-transforms of (10.28) gives the following:

CT: SY X(s) = SY Y (s)Hnc(s) (10.29)

DT: SY X(z) = SY Y (z)Hnc(z) (10.30)

Solving yields for the optimal noncausal Wiener filter:

CT: Hnc(s) = SY X(s)

SY Y (s) (10.31)

DT: Hnc(z) = SY X(z)

SY Y (z) (10.32)

Since we require stationarity we need for the filters to be stable. The region of convergence must therefore include the jω axis in continuous time and the unit circle in discrete time. Also, if SY Y (jω) = 0 for some frequency ω (10.29) (or (10.30) in discrete time) shows that the corresponding value of Hnc(jω) or Hnc(e

jω) does not matter, and the estimate is indeterminant at this frequency.

222 CHAPTER 10. LLSE ESTIMATION OF STOCHASTIC PROCESSES AND WIENER

FILTERING

The corresponding equation for the estimation error covariance for the Non-causal Wiener filter ΛNCWF can be obtained from the general formulas (10.18) or (10.19):

CT: ΛNCWF = KXX(0) − ∫ ∞ −∞

h(u) KY X(u) du (10.33)

DT: ΛNCWF = KXX(0) − ∞∑

u=−∞ h(u) KY X(u) (10.34)

Notice that these expressions are independent of time, as we might expect given the stationarity assumptions. Another expression for the estimation error covariance can be obtained by the following line of reasoning,

which we elaborate for the continuous time case. First recall that the error is given by e(t) = x(t) − x̂(t), and that this error is uncorrelated with the data y(τ) and thus with the estimate x̂(t), which itself is a linear function of the data. Thus:

x(t) = x̂(t) + e(t) (10.35)

⇓ (10.36) SXX(s) = Sx̂x̂(s) + SEE(s) (10.37)

⇓ (10.38) SEE(s) = SXX(s) −SX̂X̂(s) (10.39)

This expression is valid for any LLSE estimator for which the transforms are valid. Now for the case of the noncausal Wiener we have the following expression for the second term:

S X̂X̂

(s) = Hnc(s)Hnc(−s)SY Y (s) (10.40)

= SY X(s)

SY Y (s)

SY X(−s) SY Y (s)

SY Y (s) (10.41)

= SY X(s)SY X(−s)

SY Y (s) (10.42)

Thus, substituting (10.42) into (10.37) and solving for SEE(s) we have:

SEE(s) = SXX(s) − SY X(s)SY X(−s)

SY Y (s) (10.43)

If we let s = jω we obtain:

SEE(jω) = SXX(jω) − |SY X(jω)|2

SY Y (jω) (10.44)

This expression, though derived for the continuous time case, is also valid for the discrete time case, with appropriate adjustments to the transform definitions. Finally, we have that:

ΛNCWF = REE(0) = 1

2π

∫ ∞ −∞

SEE(jω) dω (10.45)

The mean square error can thus be obtained either by finding REE(τ) as the inverse transform of SEE(jω) and then evaluating the result at τ = 0 or by directly evaluating the integral in (10.45).

Linear Observations and Additive Noise:

We now consider the important special case of a continuous-time process with linear observations and additive noise. Suppose:

y(t) = x(t) + v(t) (10.46)

where x(t) and v(t) are uncorrelated zero mean wide-sense stationary random processes. We wish to find the noncausal Wiener filter for this problem.

10.4. WIENER FILTERING 223

We start by finding the covariances KY X(t) and KY Y (t):

KY X(t) = E [y(τ) x(t + τ)] = E [( x(τ) + v(τ) ) x(t + τ)] = KXX(t) (10.47)

KY Y (t) = E [( x(τ) + v(τ) )( x(t + τ) + v(t + τ) )] = KXX(t) + KV V (t) (10.48)

Taking Laplace transforms of the expressions (10.47), (10.48) we obtain the following the power spectral density relationships:

SY X(s) = SXX(s) (10.49)

SY Y (s) = SXX(s) + SV V (s) (10.50)

Using these expressions in the formula (10.31) for optimal noncausal Wiener filter yields the following filter:

Hnc(s) = SY X(s)

SY Y (s) =

SXX(s)

SXX(s) + SV V (s) (10.51)

Note that this expression for the filter is real, even and nonnegative, thus it indeed corresponds to a two-sided or noncausal filter impulse response h(t).

We can also obtain an expression for the power spectral density of the estimation error covariance using (10.43):

SEE(s) = SXX(s) − SY X(s)SY X(−s)

SY Y (s) = SXX(s) −

S2XX(s)

SXX(s) + SV V (s) =

SXX(s)SV V (s)

SXX(s) + SV V (s) (10.52)

= Hnc(s)SV V (s) (10.53)

Before proceeding to an example let us interpret the behavior of the filter (10.51). From its form we can see that this filter does a reasonable thing. In particular, at frequencies where the power in the signal SXX(jω) is large relative to the power in the additive noise SV V (jω), the filter H(jω) ≈ 1, looks like an ideal all pass at that frequency, and thus allows the signal to pass unaltered. Conversely, at those frequencies where the power in the noise SV V (jω) is large relative to the power in the signal SXX(jω), the filter H(jω) ≈ 0 and attenuates both components. Similar results hold for the discrete-time case as well.

Example 10.1 (Single Pole Spectrum:) Let us apply the above results to a particular example. Specifically suppose x(t) and y(t) are related by (10.46) and

KXX(t) = Qe −α|t|

, α > 0 (10.54)

KV V (t) = Rδ(t) (10.55)

and we want to find the noncausal Wiener filter. Taking transforms and applying the formulas (10.49),(10.50) we find:

SXX(s) = 2Qα

α2 −s2 (10.56)

SV V (s) = R (10.57)

SY Y (s) = SXX(s) + SV V (s) = R(β2 −s2) α2 −s2

; β 2

= 2Qα

R + α

2 (10.58)

Now we apply (10.51) to find the optimal estimate:

Hnc(s) = 2Qα

(α2 −s2) (α2 −s2) R(β2 −s2)

= 2Qα

R(β2 −s2) (10.59)

=⇒ Hnc(f) = Qα

Rβ

  2/β

1 +

( 2πf

)2   (10.60)

We can find the corresponding impulse response by taking the inverse transform of Hnc(s) in (10.59):

hnc(t) = Qα

Rβ e −β|t|

(10.61)

This impulse response is sketched in Figure 10.3.

224 CHAPTER 10. LLSE ESTIMATION OF STOCHASTIC PROCESSES AND WIENER

FILTERING

h ( t )

Figure 10.3: Impulse response of noncausal Wiener filter of example.

S v v ( j w ) S x x ( j w )

a- a

2 K / a

Figure 10.4: Power spectra of signal and noise for example.

This is a single pole filter with its bandwidth equal to β. The filter looks at the power in the signal and the power in the noise at each frequency and adjusts its gain accordingly. In Figure 10.4 we show the power spectra of the signal and the noise. As R →∞, so the signal to noise ratio goes to zero, the bandwidth of Hnc(jω) approaches the bandwidth of SXX(jω) and the overall amplitude goes to zero. As R → 0, so the signal to noise ratio goes to infinity, the bandwidth of Hnc(jω) approaches infinity and the filter approaches an all pass.

Finally, we can obtain the error variance using any of the general expressions we have previously derived. For example using the time domain expression (10.33) we obtain:

ΛNCWF = KXX(0) − ∫ ∞ −∞

hnc(u) KY X(u) du = Q− ∫ ∞ −∞

Qα

Rβ e −β|u|

︸︷︷︸ hnc(u)

Qe −α|u|︸︷︷︸

KYX(u)

du (10.62)

= Q− 2Q Qα

Rβ

∫ ∞ 0

e −(α+β)u

du (10.63)

= Q− 2Q2α

Rβ(α + β) (10.64)

where we have used the fact that KXX(0) = Q and KY X(u) = KXX(u). We could also use the frequency domain expression (10.43) as follows:

SEE(s) = SXX(s) − SY X(s)SY X(−s)

SY Y (s) = SXX(s) −

S2XX(s)

SXX(s) + SV V (s) = Hnc(s)SV V (s) (10.65)

= 2Qα

β2 −s2 (10.66)

Now taking the inverse transform we obtain:

REE(t) = Qα

β e −β|t|

(10.67)

Thus:

ΛNCWF = REE(0) = Qα

β (10.68)

This expression looks different than (10.64), but recall that β2 = 2Qα R

+ α2. Using this definition we can show that these two solutions are actually the same by showing their difference is zero:

Q− 2Q2α

Rβ(α + β) − Qα

β =

Q (Rβ(α + β)) − 2Q2α Rβ(α + β)

− Qα (R(α + β))

Rβ(α + β) (10.69)

= QRαβ + QRβ2 − 2Q2α−QRα2 −QRαβ

Rβ(α + β) (10.70)

= Q ( Rβ2 − 2Qα−Rα2

) Rβ(α + β)

(10.71)

= Q ( 2Qα + Rα2 − 2Qα−Rα2

) Rβ(α + β)

= 0 (10.72)

10.4. WIENER FILTERING 225

where in the last equality we substituted in the definition for β2.

10.4.2 Causal Wiener Filtering

The other case of Wiener filtering we will examine is when we choose the end of the observation interval to coincide with the time of the estimate Tf = t so that we produce the estimate based only on past observations y(τ), τ ∈ [−∞, t]. This case is referred to as causal Wiener filtering.

Definition 10.3 (Causal Wiener Filtering) • Find LLSE estimate of x(t) based on y(τ), τ ∈ [Ti, Tf ] • x(t), y(t) are jointly wide-sense stationary • Ti = −∞ • Tf = t

Since we are basing the estimate on only past values of y(τ) the filter in this case will be causal and thus realizable in real time – a property of considerable practical interest if we are to implement the filter. For this reason the corresponding filter is sometimes referred to as the realizable Wiener Filter. In particular, the optimal filter hc(t) is linear, time-invariant and has the property that h(t) = 0 for t < 0. The optimal causal estimate will therefore be of the form:

CT: x̂(t) =

∫ t −∞

hc(t− τ)y(τ) dτ (10.73)

DT: x̂(t) =

t∑ τ=−∞

hc(t− τ) y(τ) (10.74)

Applying the general Wiener-Hopf equation (10.25) (or (10.26) in discrete time) to this case we find that the Wiener-Hopf equations for the impulse response of the optimal filter become:

CT: KY X(t− τ) = ∫ ∞

hc(t−σ) KY Y (σ − τ) dσ −∞≤ τ ≤ t (10.75)

DT: KY X(t− τ) = ∞∑ 0

hc(t−σ) KY Y (σ − τ) −∞≤ τ ≤ t (10.76)

Now by making the change of variables t− τ = u and t−σ = v we obtain:

CT: KY X(u) =

∫ ∞ 0

hc(v) KY Y (u−v) dv 0 ≤ u ≤∞ (10.77)

DT: KY X(u) =

∞∑ v=0

hc(v) KY Y (u−v) 0 ≤ u ≤∞ (10.78)

From (10.77) or (10.78) we can see that the Wiener-Hopf equation is only enforced for u ≥ 0 and that it only constrains hc(t) for t ≥ 0. We have the additional constraint that hc(t) = 0 for t < 0.

These expressions seem quite similar to those obtained for the case of the noncausal Wiener filter (10.27) or (10.28) and thus we might think of using the same solution methods we used there (i.e. transform techniques). This is not the case, however, as the causality constraint makes things considerably more difficult. While the expressions on the right in (10.77) or (10.78) are still convolution integrals (or sums), KY X(u) need not equal this integral (or sum) for u < 0. Thus we cannot simply use bilateral Laplace transforms. We might think of using Unilateral Laplace Transforms, since interval in question is unilateral. But KY Y (τ) in the integral (sum) is an auto-covariance, so it possesses even symmetry and thus cannot be represented using a unilateral Laplace transform. So transform techniques directly applied to (10.77) or (10.78) will not work and we are forced to seek other approaches. In the development to follow we focus on the continuous-time case. As usual, similar arguments may be made for the discrete-time case with the substitution of z-transform for Laplace transform and inside (outside) unit circle for left (right) half complex plane.

Examination of (10.77) or (10.78) shows that the Wiener-Hopf equation for this case is easy to solve if the data y(τ) are white so KY Y (t) = δ(t). We will see a similar situation in our treatment of the Kalman filter

226 CHAPTER 10. LLSE ESTIMATION OF STOCHASTIC PROCESSES AND WIENER

FILTERING

and sequential estimation. Our approach there will be to first whiten the data by finding the innovations ν(t) and then to generate an estimate based on the resulting whitened observations. This is the approach Bode and Shannon took to solving (10.77) and it is the approach we will take. The basic plan of attack is shown in Figure 10.5 From experience we know (or might suspect) that the whitening filter will involve

W h i t e n i n g

F i l t e r

S o l u t i o n f o r

W h i t e N o i s e

$ ( )x t y ( t )

n ( t )

K n n ( t ) = d ( t )

Figure 10.5: Bode-Shannon Whitening Approach to Causal Wiener Filtering.

spectral factorization. For the overall filter to be causal each block must be causal. In particular, the choice we make in the spectral factorization step must yield a corresponding filter that is causal. Further, for there to be no loss of information the whitening filter must also be invertible and thus stable with a stable inverse. With these points in mind we will first find the causal Wiener filter for white noise (the second block in Figure 10.5) then we will find the appropriate whitening filter.

Causal Wiener Filter for White Noise

$ ( )x t

K n n ( t )

n ( t ) G ( s )

Figure 10.6: Wiener Filter for White Noise Observations.

Here we will consider the problem of designing the optimal causal Wiener filter for x(t) based on obser- vation of a zero-mean wide-sense-stationary unit power white noise, as depicted in Figure 10.6. To avoid confusion we will denote this white observation process as ν(t) and will denote the corresponding Wiener filter system function as G(s) and its impulse response as g(t). Since the observation process ν(t) is white we have:

Kνν(t) = δ(t) (10.79)

Substituting this expression into the Wiener-Hopf equation (10.77) (or (10.78) in discrete time) yields the following expression for the optimal filter impulse response g(t):

KνX(u) =

∫ ∞ 0

g(v) Kνν(u−v) dv = ∫ ∞

g(v) δ(u−v) dv = g(u) 0 ≤ u ≤∞ (10.80)

The optimal causal Wiener filter for white noise is thus given by:

g(t) =

{ KνX(t) 0 ≤ u ≤∞ 0 t < 0

(10.81)

= KνX(t)u−1(t) (10.82)

where u−1(t) is the unit step. The causal Wiener filter for white noise observations is given by the causal (i.e. nonnegative time) part of the cross-covariance between x(t) and the observations.

For its use later, let us introduce the following notation for the positive and negative time portions of signals:

{KνX(t)}+ ≡ KνX(t)u−1(t) (10.83) {KνX(t)}− ≡ KνX(t)u−1(−t) (10.84)

Thus we can always decompose KνX(t) into its positive time and negative time components as:

KνX(t) = {KνX(t)}+ + {KνX(t)}− (10.85)

10.4. WIENER FILTERING 227

A similar decomposition can obviously be applied to any time function. Let us also introduce a similar notation for the bilateral Laplace transforms of these positive and negative time components:

{SνX(s)}+ ≡ ∫ ∞

0− KνX(t)e

−st dt ←→ KνX(t)u−1(t) ≡{KνX(t)}+ (10.86)

{SνX(s)}− ≡ ∫ 0− −∞

KνX(t)e −st dt ←→ KνX(t)u−1(−t) ≡{KνX(t)}− (10.87)

SνX(s) =

∫ ∞ −∞

KνX(t)e −st dt = {SνX(s)}+ + {SνX(s)}− (10.88)

For example, {SνX(s)}+ is the bilateral Laplace transform of the positive time portion of KνX(t). Note that terms at t = 0 (e.g. impulses at t = 0) are included in the definition of {KνX(t)}+. Based on this notation the optimal causal Wiener filter for white noise observations is given by:

{KνX(t)}+ = g(t) ←→ G(s) = {SνX(s)}+ (10.89)

The relationship between these quantities is shown in Figure 10.7. If we possess an expression for KνX(t) then finding its positive time portion is straightforward, and the filter system function G(s) may then be obtained as the Laplace transform of this quantity, as indicated in the figure. A natural question is whether we may directly find G(s) from knowledge of cross-power spectral density SνX(s). When SνX(s) possess a rational spectrum then we can indeed find G(s) directly in the transform domain through use of a partial fraction expansion.

KνX(t) −→ g(t) = {KνX(t)}+

BLT

xy BLT xy

SνX(s) −→ G(s) = {SνX(s)}+

Figure 10.7: Relationship between time domain and Laplace domain quantities for the Causal Wiener Filter for White Noise Observations.

To understand the relationship between the bilateral Laplace transform of a rational function and the bilateral Laplace transform of its positive time part, consider a general rational function with Laplace transform F(s). Being rational, this function possess a partial fraction expansion as:

F(s) =

m∑ i=1

ki∑ k=1

Aik (s−si)k

(10.90)

This partial fraction expansion expression represents F(s) as a sum of terms. The inverse Laplace transform of F(s) will thus be composed of the sum of the inverse transforms of each term. Recall that the expression (10.90) alone does not uniquely specify the a time function – a corresponding region of convergence (ROC) must also always be specified. For stability of the associated time function, the ROC associated with each term must include the jω axis, as shown in Figure 10.8. The ROC depends on the poles, not the zeros.

Now recall that right going or positive time signals have right going ROCs and left going or negative time signals have left going ROCs. Thus the terms in the sum (10.90) corresponding to the positive-time portions of f(t) are those corresponding to the left-half plane poles, while the terms (10.90) corresponding to negative-time parts of f(t) are those corresponding to the right-half plane poles (note that the zeros play no direct role in this discussion). Thus we have:

{F(s)}+ = m∑ i=1

Re(si)<0

ki∑ k=1

Aik (s−si)k

(10.91)

{F(s)}− = m∑ i=1

Re(si)>0

ki∑ k=1

Aik (s−si)k

(10.92)

228 CHAPTER 10. LLSE ESTIMATION OF STOCHASTIC PROCESSES AND WIENER

FILTERING

R O C

Figure 10.8: Pole-zero plot and associated regions of convergence.

If F(s) is not strictly proper, so the partial fraction expansion includes nonnegative powers of s, these are included in the definition of {F(s)}+ since, by definition, we include terms at t = 0 in the definition of {f(t)}+. In summary, we can directly find the bilateral Laplace transforms of the positive and negative time portions of the function f(t) from knowledge of its rational transform F(s) via partial fraction expansion. A similar argument can be made for the discrete time case with the inside and outside of the unit circle playing the role of the left and right half plane, respectively. We summarize these points for continuous signals and Laplace transforms below:

• A stable time function has an ROC that contains the jω axis and vice versa.

• A right-sided time function possesses a right-sided ROC and vice versa.

• A two-sided time function has a bounded ROC (i.e. the ROC is a strip) and vice versa.

• The right-sided part of a two-sided signal corresponds to the left-half plane poles.

• To get the right-sided part of a two-sided signal, perform partial fraction expansion of the transform and keep the poles with the right-sided ROCs together with any positive powers of s (corresponding to singularities at the origin).

• The ROC, and thus stability and left/right sidedness of a signal depends on the poles, not the zeros of the signal.

For discrete-time signals, a similar set of conditions applies to the z-transform with the role of the jω axis replaced by the unit circle, and left and right s-plane replaced by inside and outside unit circle. Thus for discrete-time signals and z-transforms we have:

• A stable discrete-time function has an ROC that contains the unit circle and vice versa.

• A right-sided discrete-time function possesses an outward going ROC and vice versa.

• A two-sided discrete-time function has a bounded ROC (i.e. the ROC is an annulus) and vice versa.

• The right-sided part of a two-sided discrete-time signal corresponds to the poles in the unit circle.

• To get the right-sided part of a two-sided discrete-time signal, perform partial fraction expansion of the transform and keep the poles with the outward-going ROCs.

• The ROC, and thus stability and left/right sidedness of a signal depends on the poles, not the zeros of the signal.

Some examples serve to illustrate these developments for the continuous-time case.

Example 10.2 Suppose f(t) is given by:

f(t) = e −at

u−1(t) (10.93)

where a > 0. This is a right-going, stable signal. The bilateral Laplace transform is given by:

F(s) = 1

s + a (10.94)

with ROC Re(s) > −a. A sketch of f(t), pole-zero plot, and the corresponding ROC is shown in Figure 10.9

10.4. WIENER FILTERING 229

f ( t ) = e - a t u ( t ) R e ( s )

I m ( s )

- a

Figure 10.9: Function f(t), the pole-zero plot, and the corresponding ROC.

Example 10.3 Suppose f(t) is now given by:

f(t) = −e−atu−1(−t) (10.95)

where a > 0, so it is a left-going, unstable signal. The corresponding bilateral Laplace transform is given by:

F(s) = 1

s + a (10.96)

with ROC Re(s) < −a. A sketch of f(t), pole-zero plot, and the corresponding ROC is shown in Figure 10.10

f ( t ) = - e - a t u ( - t )

R e ( s )

I m ( s )

- a

Figure 10.10: Function f(t), the pole-zero plot, and the corresponding ROC.

Example 10.4 Suppose f(t) is now given by:

f(t) = Ae −at

u−1(t)︸︷︷︸ {f(t)}+

−Bebtu−1(−t)︸︷︷︸ {f(t)}−

(10.97)

where a > 0, b > 0, so it is a two-sided, stable signal. The corresponding bilateral Laplace transform is given by:

F(s) = A

s + a︸︷︷︸ {F(s)}+

+ B

s− b︸︷︷︸ {F(s)}−

= (A + B) s + (aB −Ab)/(A + B)

(s + a)(s− b) (10.98)

with ROC −a < Re(s) < b. A sketch of f(t), pole-zero plot, and the corresponding ROC is shown in Figure 10.11:

Example 10.5 Suppose f(t) is given by:

f(t) = e −a|t|

= e −at

u−1(t)︸︷︷︸ {f(t)}+

+ e at u−1(−t)︸︷︷︸ {f(t)}−

(10.99)

where a > 0. Now, starting with the bilateral Laplace transform and performing a partial fraction expansion:

F(s) = 2a

a2 −s2 =

s + a︸︷︷︸ LHP Pole

− 1

s−a︸︷︷︸ RHP Pole

(10.100)

230 CHAPTER 10. LLSE ESTIMATION OF STOCHASTIC PROCESSES AND WIENER

FILTERING

A e - a t u ( t ) R e ( s )

I m ( s )

- a

- B e b t u ( - t ) b

Figure 10.11: Function f(t), the pole-zero plot, and the corresponding ROC.

e - a t u ( t ) R e ( s )

I m ( s )

- a

e a t u ( - t ) a

Figure 10.12: Function f(t), the pole-zero plot, and the corresponding ROC.

with ROC −a < Re(s) < a. A sketch of f(t), pole-zero plot, and the corresponding ROC is shown in Figure 10.12: Now we have that:

{F(s)} +

= 1

s + a ←→ e−atu−1(t) = {f(t)}+ (10.101)

Example 10.6 Now we consider a case with terms at the origin. Suppose f(t) is given by:

f(t) = δ + e −a|t|

= δ(t) + e −at

u−1(t)︸︷︷︸ {f(t)}+

+ e at u−1(−t)︸︷︷︸ {f(t)}−

(10.102)

where again a > 0. Starting with the bilateral Laplace transform and performing a partial fraction expansion:

F(s) = 1 + 2a

a2 −s2 = 1 +

s + a︸︷︷︸ LHP Pole plus Positive Powers of s

− 1

s−a︸︷︷︸ RHP Pole

(10.103)

{F(s)} +

= 1 + 1

s + a ←→ δ(t) + e−atu−1(t) = {f(t)}+ (10.104)

Example 10.7 Next we consider a non-rational example. Suppose f(t) is given by:

f(t) = e −a(t+T)

u−1(t + T) ←→ F(s) = esT

s + a (10.105)

where again a > 0. Unfortunately, F(s) is not a rational function of s, so partial fraction techniques will not help. Let us examine this example in more detail. There are two cases to consider, depending on the sign of T .

(a) Suppose T > 0. This case is shown in Figure 10.13(a). In this case:

{f(t)} +

= e −a(t+T)

u−1(t + T)u−1(t) = e −a(t+T)

u−1(t) = e −aT

e −at

u−1(t) ←→ e−aT

s + a = {F(s)}

+ (10.106)

(b) Now suppose T < 0. This case is shown in Figure 10.13(b). In this case:

{f(t)} +

= e −a(t+T)

u−1(t + T)u−1(t) = e −a(t+T)

u−1(t + T) = f(t) ←→ F(s) = esT

s + a = {F(s)}

+ (10.107)

Thus we see that there are no simple formulas relating F(s) and {F(s)} +

in the non-rational case and we are forced to apply the definitions and work in the time domain.

10.4. WIENER FILTERING 231

t = 0 t = 0- T - T

( a ) ( b )

Figure 10.13: Plot of f(t) for T > 0 and T < 0.

Causal Whitening Filter

In the previous subsection we found the optimal causal Wiener filter for the case of white noise observations. In this white noise case the causal Wiener filter has a particularly simple form. Unfortunately, in practice our observations are usually not white and we must therefore whiten them. This is precisely the purpose of a whitening filter, as illustrated in Figure 10.14. We have already seen a closely related process of spectral shaping, wherein white noise is passed through a linear time-invariant system to achieve a desired spectral shape. Here we desire the reverse of this procedure.

y ( t ) n ( t )

S y y ( s ) W ( s )

S n n ( s ) = W ( s ) W ( - s ) S y y ( s )

S n n ( s ) = 1

Figure 10.14: Whitening Filter W(s).

In particular, we desire a linear time-invariant filter with system function W(s) that will take a zero- mean wide-sense stationary random process y(t) with spectrum SY Y (s) and turn it into a zero-mean wide- sense stationary unit spectrum white noise process ν(t), termed the innovations, because it contains the unpredictable information in the data. Our idea is that if we can find such a filter, we can use it to whiten our observations, then pass the resulting whitened observations through G(s), the causal Wiener filter for white noise we found in the preceding subsection. To be useful for our estimation problem, however we require a number of additional constraints on W(s) over and above its ability to generate uncorrelated outputs. First, if the processes are to be wide-sense stationary W(s) must be stable. Next, if the overall cascade of W(s) and G(s) is to be causal W(s) itself must be causal. Further, if the estimate based on the whitened observations ν(t) is to be the same as the estimate based on the original observations, then the whitening transformation must be invertible. This requires that W(s) be causally invertible as well.

From Figure 10.14 it is clear that the filter we seek must satisfy:

W(s)W(−s)SY Y (s) = 1 (10.108)

where W(s) is stable, causal, and causally invertible. For general processes and spectral density functions SY Y (s) finding such a filter can be difficult. If SY Y (s) is a rational power spectral density however then a straightforward approach exists based on spectral factorization of the spectrum of the observation process SY Y (s) Let us thus focus on this case of rational SY Y (s). Since SY Y (s) is the spectral density of a real valued random process it possesses certain symmetry properties. In particular, SY Y (jω) must be a finite, real-valued, even function of ω. These symmetry properties constrain the behavior of the poles and zeros of SY Y (s). In particular, for rational spectral densities, they imply that SY Y (s) is the ratio of two polynomials in s2. Thus if s = σi is a zero of SY Y (s) then s = −σi must also be a zero and if s = pi is a pole of SY Y (s) then s = −pi must also be a pole. Further, since SY Y (jω) is a real valued function, these poles and zeros must occur in complex conjugate pairs. Finally, SY Y (s) can have no poles on the jω axis and any zero must occur with even multiplicity. These symmetry properties are illustrated in Figure 10.15.

Given these symmetry properties we can always write any rational system function SY Y (s) in the following form:

SY Y (s) =

M ∏ i

(s−σi)(−s−σi)∏ i

(s−pi)(−s−pi) =

  √ M ∏ i

(s−σi)∏ i

(s−pi)

    √ M ∏ i

(−s−σi)∏ i

(−s−pi)

  (10.109)

232 CHAPTER 10. LLSE ESTIMATION OF STOCHASTIC PROCESSES AND WIENER

FILTERING

Figure 10.15: Illustration of pole-zero symmetry properties of SY Y (s).

where pi are the left-half plane poles and σi are the left-half plane zeros, and M is a positive constant. The first factor in (10.109) is composed of just the left-half poles and zeros while the second factor in (10.109) is composed of only the right-half poles and zeros. These two factors are the key to our solution so we define some special notation for them:

S+Y Y (s) =

  √ M ∏ i

(s−σi)∏ i

(s−pi)

  = Left-half poles and zeros of SY Y (s) (10.110)

S−Y Y (s) =

  √ M ∏ i

(−s−σi)∏ i

(−s−pi)

  = S+Y Y (−s) = Right-half poles and zeros of SY Y (s) (10.111)

SY Y (s) = S + Y Y (s)S

− Y Y (s) (10.112)

Notice that the S+Y Y (s) term is causal and causally invertible, since both its poles and zeros are in the left half plane by construction. This decomposition of a rational spectrum, in this case SY Y (s), into a causal and causally invertible factor S+Y Y (s) and its mirror image S

− Y Y (s), related as above, is termed spectral

factorization. From the above discussion it should be obvious that this process of spectral factorization will provide

us our desired whitening filter W(s). In particular, from (10.108) and (10.110)–(10.112) together with the properties of the factors involved we can see that the desired causal and causally invertible whitening filter can be obtained as:

W(s) = 1

S+Y Y (s) (10.113)

The Overall Causal Wiener Filter

We are now ready to assemble the pieces of the general causal Wiener filter solution. To summarize we have found a causal and causally invertible whitening filter, thus LLSE based on y(τ) or the whitened output ν(τ) is equivalent. In addition, we have found the optimal causal Wiener filter for white noise. The overall filter is found by combining these two pieces as shown in Figure 10.16

y ( t ) n ( t )

W ( s )

H ( s )

G ( s ) S n n ( s ) = 1

$ ( )x t

Figure 10.16: Overall causal Wiener Filter.

Now recall that the causal Wiener filter for white noise G(s) was given by:

G(s) = {SνX(s)}+ (10.114)

10.4. WIENER FILTERING 233

The innovation process ν(t) is the output of a linear time-invariant system with system function W(s) and input y(t), thus using our relationships for random processes through linear systems we find:

SνX(s) = W(−s)SY X(s) = SY X(s)

S+Y Y (−s) = SY X(s)

S−Y Y (s) (10.115)

where in the last equality we have used the fact that S+Y Y (−s) = S − Y Y (s). Now we can combine (10.115) and

(10.114) to get the following expression for G(s):

G(s) =

{ SY X(s)

S−Y Y (s)

} +

(10.116)

Finally we can combine the whitening filter W(s) and the causal Wiener filter for the white noise case G(s) to obtain the overall causal Wiener filter Hc(s):

Hc(s) = W(s) G(s) = 1

S+Y Y (s)︸︷︷︸ Whitening Filter

{ SY X(s)

S−Y Y (s)

} +︸︷︷︸

CWF for In- novations

(10.117)

We summarize this solution in Figure 10.17. Compare this result to that obtained for the noncausal Wiener filter (10.31). Before moving on a word of caution is in order regarding the notation used in (10.117) – the terms S+Y Y (s) and S

− Y Y (s) are the spectral factors of SY Y (s) and

{ SY X(s)/S

− Y Y (s)

} +

is the positive time part

of SY X(s)/S − Y Y (s). Be careful not to confuse these! Terms like S

+ Y Y (s) are found by spectral factorization

while terms like { SY X(s)/S

− Y Y (s)

} +

are found via partial fraction expansion.

$ ( )x t1 S s

y y

( )

S s

y x

y y

( )

( ) -

R S T

U V W +

y ( t ) n ( t )

S n n ( s ) = 1

H ( s )

Figure 10.17: Summary of Causal Wiener Filter.

Finally we need to specify the corresponding estimation error variance or mean square error. We have three methods we can use. First we use a direct application of the general formula (10.18) to the case at hand to obtain:

ΛCWF = KXX(0) − ∫ ∞

hc(τ) KY X(τ) dτ (10.118)

where hc(τ) is the impulse response corresponding to the overall causal Wiener filter Hc(s) given in (10.117). The second expression for the mean square error is based on the following reasoning. As we have argued,

the optimal causal LLSE estimate x̂(t) based on either the original observations y(τ) or the innovations ν(τ) is equivalent. Thus the error covariance based on either observations is equivalent. Thus we can as well apply the formula (10.118) to the innovation-based filter g(τ) applied to the innovation ν(τ) to obtain:

ΛCWF = KXX(0) − ∫ ∞

g(τ) KνX(τ) dτ (10.119)

where g(t) is the impulse response of the causal Wiener filter for the innovations. Now from (10.89) and (10.116), g(t) is given by:

g(t) = KνX(t)u−1(t) ←→ G(s) = {SνX(s)}+ = { SY X(s)

S−Y Y (s)

} +

(10.120)

Using these relationships we get the equivalent expression for the estimation error variance:

ΛCWF = KXX(0) − ∫ ∞

K2νX(τ) dτ = KXX(0) − ∫ ∞

g2(τ) dτ (10.121)

234 CHAPTER 10. LLSE ESTIMATION OF STOCHASTIC PROCESSES AND WIENER

FILTERING

where the impulse g(t) can be obtained as the inverse bilateral Laplace transform of G(s) as specified in (10.120).

Finally, we can obtain a frequency domain expression for the error following the line of argument asso- ciated with equations (10.35)-(10.37). In particular we have:

SEE(s) = SXX(s) −Sx̂x̂(s) (10.122)

But, for the causal Wiener filter we obtain for the second term:

Sx̂x̂(s) = Hc(s)Hc(−s)SY Y (s) = W(s)G(s)W(−s)G(−s)SY Y (s) (10.123)

= SY Y (s)

S+Y Y (s)S + Y Y (−s)

G(s)G(−s) = 1

S+Y Y (s)

S−Y Y (s) SY Y (s)G(s)G(−s) (10.124)

= G(s)G(−s) (10.125)

Thus, for the causal Wiener filter we get the following expression for the power spectral density of the error:

SEE(s) = SXX(s) −G(s)G(−s) (10.126)

If we let s = jω we obtain:

SEE(jω) = SXX(jω) −|G(jω)|2 (10.127)

This expression, though derived for the continuous time case, is also valid for the discrete time case, with appropriate adjustments to the transform definitions. Finally, we have that:

ΛCWF = REE(0) = 1

2π

∫ ∞ −∞

SEE(jω) dω (10.128)

The mean square error can thus be obtained either by finding REE(τ) as the inverse transform of SEE(jω) and then evaluating the result at τ = 0 or by directly evaluating the integral in (10.128).

Now let us consider some examples.

Example 10.8 Suppose the underlying process x(t) is zero mean, wide-sense stationary with KXX(t) = Qe

−α|t| and suppose we observe:

y(t) = x(t) + v(t) (10.129)

where v(t) is a zero mean, wide-sense stationary white noise process, uncorrelated with x(t) and with covariance function KV V (t) = Rδ(t). We wish to find the causal Wiener filter for this problem. We first find SY Y (s) as a function of SXX(s) and SV V (s):

SXX(s) = 2αQ

α2 −s2 (10.130)

SV V (s) = R (10.131)

=⇒ SY Y (s) = SXX(s) + SV V (s) (10.132)

= 2αQ

α2 −s2 + R = R

( β2 −s2

α2 −s2

) β =

( 2Qα

R + α

)1/2 (10.133)

{ R

1/2 s + β

s + α

} ︸︷︷︸

S + YY

(s)

{ R

1/2 β −s α−s

} ︸︷︷︸

S − YY

(s)

(10.134)

Thus we find for the whitening filter W(s):

W(s) = 1

S+Y Y (s) =

{ R −1/2 s + α

s + β

} (10.135)

Now we need to find G(s), the causal Wiener filter for the innovations. To do this we will need SY X(s):

SY X(s) = SXX(s) = 2Qα

α2 −s2 (10.136)

10.4. WIENER FILTERING 235

where we have used the fact that x(t) and v(t) are uncorrelated so KY X(t) = KXX(t). Now we can find G(s):

G(s) =

{ SY X(s)

S−Y Y (s)

} +

(10.137)

{( 2Qα

α2 −s2

)( R −1/2 α−s

β −s

)} +

(10.138)

{ a

s + α +

s−β

} +

a = 2QαR−1/2

α + β , b = −

2QαR−1/2

α + β (10.139)

= a

s + α (10.140)

where we have identified the first term as the positive time part of the signal. Overall we obtain for Hc(s):

Hc(s) = W(s)G(s) (10.141)

= 2Qα

R(α + β)

( 1

s + β

) (10.142)

Taking inverse transforms we obtain for the corresponding causal Wiener filter impulse response:

hc(t) = 2Qα

R(α + β) e −βt

u−1(t) (10.143)

Finally we find the associated estimation error covariance:

KXX(τ) = Qe −α|t| −→ KXX(0) = Q (10.144)

KY X(τ) = KXX(τ) = Qe α|τ|

(10.145)

=⇒ ΛCWF = KXX(0) − ∫ ∞

hc(τ) KY X(τ) dτ (10.146)

= Q− ∫ ∞

2Qα

R(α + β) e −βτ

Qe −α|τ|

dτ (10.147)

= Q− 2Q2α

R(α + β)2

( −e−(α+β)t

)∣∣∣∞ 0

(10.148)

= Q− 2Q2α

R(α + β)2 (10.149)

We can compare this mean square error to that obtained when the noncausal Wiener filter is used:

ΛNCWF = Q− 2Q2α

Rβ(α + β) (10.150)

Note that the noncausal Wiener filter achieves a lower error variance compared to the causal Wiener filter. Indeed this must be the case, since the noncausal Wiener filter uses all the data and the causal Wiener filter only part of the data to generate its estimate. In general, this observation is true.

Example 10.9 Now let us consider a discrete-time example. Suppose the underlying process x(t) is a zero mean, wide-sense stationary first order autoregressive process:

x(t + 1) = 0.8x(t) + w(t), KWW (t) = 0.36δ(t) (10.151)

Using our results for first-order autoregressive processes we find that this process has the following autocovariance function:

KXX(t) = 0.8 |t|

(10.152)

Suppose we observe:

y(t) = x(t) + v(t), KV V (t) = δ(t) (10.153)

where v(t) is a zero mean, wide-sense stationary white noise process, uncorrelated with x(t) and with the given covariance function. We wish to find the causal Wiener filter for this problem. First lets find the whitening filter W(z) = 1/S+Y Y (z). We will need to find SY Y (z) as a function of SXX(z) and SV V (z). Using the independence of x(t) and v(t):

SY Y (z) = SXX(z) + SV V (z) = 0.36

(1 − 0.8z) (1 − 0.8z−1) + 1 (10.154)

{√ 8/5

( 1 − 1

2 z )

(1 − 0.8z)

} ︸︷︷︸

S − YY

(z)

{√ 8/5

( 1 − 1

2 z−1 )

(1 − 0.8z−1)

} ︸︷︷︸

S + YY

(z)

(10.155)

236 CHAPTER 10. LLSE ESTIMATION OF STOCHASTIC PROCESSES AND WIENER

FILTERING

Thus we find for the whitening filter W(z):

W(z) = 1

S+Y Y (z) =

√ 5

( 1 − 0.8z−1

1 − 1 2 z−1

) (10.156)

Now we need to find G(z), the causal Wiener filter for the innovations. To do this we will need SY X(z):

SY X(z) = SXX(z) = 0.36

(1 − 0.8z) (1 − 0.8z−1) (10.157)

where we have used the fact that x(t) and v(t) are uncorrelated so KY X(t) = KXX(t). Now we can find G(z):

G(z) =

{ SY X(z)

S−Y Y (z)

} +

(10.158)

{( 0.36

(1 − 0.8z) (1 − 0.8z−1)

)(√ 5

(1 − 0.8z)( 1 − 1

2 z ) )}

{ 0.36

√ 5/8(

1 − 1 2 z )

(1 − 0.8z−1)

} +

(10.159)

{ −0.72

√ 5/8z−1

(1 − 2z−1) (1 − 0.8z−1)

} +

{ −18/25

√ 5/8z−1

(1 − 2z−1) (1 − 0.8z−1)

} +

(10.160)

{ A

1 − 2z−1 +

1 − 0.8z−1

} +

(10.161)

= B

1 − 0.8z−1 , (10.162)

B = −0.72

√ 5/8z−1

1 − 2z−1

∣∣∣∣∣ z=0.8

= 0.6 √

5/8, A = −0.72

√ 5/8z−1

1 − 0.8z−1

∣∣∣∣∣ z=2

= −0.6 √

5/8 (10.163)

where we have identified the second term in the partial fraction expansion of (10.161) as the positive time part of the signal. Note that while we have provided the value of A in the partial fraction expansion, it is not needed. We obtain for G(z) and g(t):

G(z) = 0.6 √

5/8

1 − 0.8z−1 ⇐⇒ g(t) = 0.6

√ 5/8(0.8)

t u−1(t) (10.164)

Overall we obtain for Hc(z):

Hc(z) = G(z)W(z) (10.165)

( 0.6 √

5/8

1 − 0.8z−1

)(√ 5/8

( 1 − 0.8z−1

) 1 − 1

2 z−1

) (10.166)

( 0.375

1 − 1 2 z−1

) (10.167)

Taking inverse transforms we obtain for the corresponding causal Wiener filter impulse response:

hc(t) = 0.375

( 1

)t u−1(t) (10.168)

Finally we find the associated estimation error covariance or mean square error:

ΛCWF = KXX(0) − ∞∑ k=0

g 2 (k) (10.169)

= 1 − ∞∑ k=0

0.36(5/8)(0.8) 2k

(10.170)

= 1 − 0.36(5/8) 1

1 − (0.8)2 = 1 − 5/8 (10.171)

= 0.375 (10.172)

We will see in Example 4.1 that both the causal Wiener filter and its mean square error are the same as that obtained for the Kalman filter in steady state – i.e. when we apply the Kalman filter to a stationary problem over an infinite time interval, so transients have had a chance to die out.

10.4. WIENER FILTERING 237

10.4.3 Summary

We close this discussion of Wiener filters for wide-sense stationary processes by presenting a table summa- rizing the causal and noncausal Wiener filters.

Wiener Filter Type

Observation Interval

Optimal Filter Estimation Error Variance/MSE

Noncausal [−∞, +∞] Hnc(s) = SY X(s)

SY Y (s) ΛNCWF = KXX(0) −

∫ ∞ −∞

hnc(u) KY X(u) du

Causal [−∞, t] Hc(s) = 1

S+Y Y (s)

{ SY X(s)

S−Y Y (s)

} +

ΛCWF = KXX(0) − ∫ ∞

g2(τ) dτ

Figure 10.18: Summary of Wiener Filter Solutions

Because of its potentially confusing nature, we also give a summary of the notation used with respect to Wiener filter solutions. Suppose F(s) is the bilateral Laplace transform corresponding to the time function f(t), then the following apply:

Transform Domain Time Domain

F(s) ←→ f(t)

F +(s) Spectral Factorization w/ LHP poles and zeros

F−(s) Spectral Factorization w/ RHP poles and zeros

{F(s)}+ ≡ ∫ ∞

0− f(t)e−st dt ←→ f(t)u−1(t) ≡{f(t)}+

PFE terms w/ LHP poles Positive time part

{F(s)}− ≡ ∫ 0− −∞

f(t)e−st dt ←→ f(t)u−1(−t) ≡{f(t)}− PFE terms w/ RHP poles Negative time part

238 CHAPTER 10. LLSE ESTIMATION OF STOCHASTIC PROCESSES AND WIENER

FILTERING

Chapter 11

Series Expansions and Detection of Stochastic Processes

In many cases, it is easier to view a continuous-time stochastic process x(t) defined over a finite interval [0,T] in terms of an infinite set of random coefficients. When the sample paths of x(t) are sufficiently regular (e.g. continuous), one can expand x(t) in a Fourier series, for example. In this section, the properties of such expansions are discussed. To begin the discussion, it is important to review the properties of series expansions of deterministic functions.

11.1 Deterministic Functions

Let x(t) be a determinsitic function defined in a time interval [S,T]. Assume that one is interested in representing x(t) as

x(t) = ∑ i

xifi(t) (11.1)

where fi(t) are a set of basis functions, and xi is a set of associated coefficients. Ideally, one would like to have a set of basis functions which are complete, in that every function x(t) can

be defined as in (11.1), and which are orthonormal in some sense. For instance, consider the set of functions fi(t) =

1√ T−Se

j2πit/(T−S), |i| = 0, 1, . . .. This is the standard Fourier series basis over a finite interval. This basis has the property that

∫ T S

fi(t)fk(t) ∗dt =

T −S

∫ T S

ej2π(i−k)t/(T−S) =

{ 1 if i = k

0 otherwise (11.2)

Indeed, one can show that, for any function x(t) such that ∫T S |x(t)|2dt < ∞, there is a set of coefficients xi

such that

lim n→∞

∫ T S

|x(t) − n∑

i=−n xifi(t)|2dt = 0 (11.3)

The coefficients xi can be obtained from the orthonormal property of the fi(t), by

xi =

∫ T S

x(t)fi(t) ∗dt (11.4)

With an analogy to linear algebra, consider the space of all square-integrable, complex-valued functions

such that ∫T S |x(t)|2dt < ∞. This is a linear space, in that scaled versions of these functions and sums of

these functions are also square-integrable. On this space, define the inner product

< x(t),y(t) >=

∫ T S

x(t)y(t)∗dt (11.5)

240 CHAPTER 11. SERIES EXPANSIONS AND DETECTION OF STOCHASTIC

PROCESSES

An orthonormal basis on this space is a set of functions fi(t) satisfying the property that

< fi,fk >=

∫ T S

fi(t)fk(t) ∗dt =

{ 1 if i = k

0 otherwise (11.6)

A complete orthonormal basis is such that every element x(t) of the space can be expressed as a sum (11.1), where the coefficients are computed as (11.4).

11.2 Series Expansion of Stochastic Processes

Consider now a zero-mean, complex-valued stochastic process x(t), defined on the interval [S,T]. For stochas- tic processes, one can expand the process as the mean-square sense limit of an infinite series, as

x(t) mss = ∑ i

xifi(t) (11.7)

where the random coefficients xi are given by stochastic integrals, as

xi =

∫ T S

x(t)fi(t) ∗dt (11.8)

Note the following properties:

E[xi] =

∫ T S

E[x(t)]fi(t) ∗dt = 0 (11.9)

E[xixj] =

∫ T S

E[x(t)x(s)∗]fi(t)fj(s) ∗dsdt =

∫ T S

Kx(t,s)fi(t)fj(s) ∗dsdt (11.10)

A particular case of interest is when x(t) is white noise, so that Kx(t,s) = δ(t−s). In this case, one sees that

E[xixj] =

∫ T S

δ(t−s)fi(t)fj(s)∗dsdt = ∫ T S

fi(s)fj(s) ∗ds =

{ 1 if i = k

0 otherwise (11.11)

Thus, series expansions of white noise using orthonormal functions are such that the coefficients are orthog- onal. Furthermore, since white noise is a Gaussian process, the coefficients are also Gaussian! The result is that expansion of white noise in orthogonal series results in an independent, identically distributed sequence of coefficients xi, Gaussian with zero mean and unit variance.

Consider now the case where the process x(t) is not white, but has a general autocovariance function Kx(t,s). In this case, the coefficients may not be orthogonal, as

E[xixj] =

∫ T S

Kx(t,s)fi(t)fj(s) ∗ dsdt 6= 0 if i 6= j (11.12)

Thus, for any arbitrary complete, orthonormal basis, the process x(t) can be expanded in a series, but the coefficients will not necessarily be orthogonal. However, there is a special orthonormal basis for which the coefficients would be orthogonal! Consider a basis where the basis functions satisfy the following integral equation: ∫ T

Kx(t,s)fi(s) ds = λifi(t) (11.13)

Then, (11.12) becomes

E[xixj] =

∫ T S

Kx(t,s)fi(t)fj(s) ∗dsdt =

∫ T S

λ∗jfi(t)fj(t) ∗dt =

{ λ∗j if i = j

0 otherwise (11.14)

In this special basis, the coefficients xi are orthogonal; for general stochastic processes, an expansion in an orthonormal basis of the form

x(t) mss = ∑ i

xifi(t) (11.15)

11.2. SERIES EXPANSION OF STOCHASTIC PROCESSES 241

where the basis functions fi(t) satisfy (11.13), and the coefficients xi are orthogonal, is known as a Karhunen- Loeve expansion.

The fact that a basis exists for Karhunen-Loeve expansions is a result from integral equations. The expo- sition below is a brief introduction to the subject. In order to best understand Karhunen-Loeve expansions, it is useful to first consider the case of a vector-valued random variable rather than a stochastic process. One can think of a vector-valued random variable as a discrete-time stochastic process where time has a finite range. Let x be an n-dimensional vector-valued random variable, with covariance matrix Σx. The covariance Σx is a positive semidefinite, symmetric matrix. The idea of a Karhunen-Loeve expansion in this problem is to seek a set of n basis functions ui which have the following properties:

Σxui = λiui (11.16)

uHi uj =

{ 1 if i = j

0 otherwise (11.17)

Note that (11.16) implies that the basis vectors ui are eigenvectors of the matrix Σx with eigenvalues λi. Since Σx is positive semidefinite, its eigenvalues must be nonnegative. From the theory of symmetric matrices, one knows that a symmetric matrix has a complete set of orthonormal eigenvectors (i.e. there exists an orthogonal basis where the matrix is diagonal). In particular, one easy way of finding such a basis is by solving the following problem:

λ1 = max ‖u‖=1

‖Σxu‖ > 0 (11.18)

and u1 is the vector which achieves the maximum. Note that such a maximum must exist, since Kx is a bounded function, and the admissible vectors are a closed and bounded set (i.e. compact), namely the unit ball. Because of the symmetry of Kx, and u1 is also an eigenvector; that is,

Σxu1 = λ1u1,u H 1 Σx = λ1u

H 1 (11.19)

Once the first vector is found, one can form the reduced matrix

Σ1 = Σx −λ1u1u H 1

Theorem 11.1 The matrix Σ1 is also positive semidefinite.

The proof of this result lies in defining the auxiliary random vector x1 = x−(uH1 x)u1. The covariance of x1 is

Σ = E[(x− (uH1 x)u1)(x− (u H 1 x)u1)

= Σx −E[(uH1 x)u1x H] −E[x((uH1 x)u1)

H] + E[(uH1 x)u1u H 1 (u

H 1 x)

∗] (11.20)

Note that (uH1 x) = (x Hu1)

∗ is a scalar, and thus commutes with the matrices it is multiplied against. The above equation can be rearranged to obtain

Σ = Σx −u1u H 1 E[xx

H] −E[xxH]u1u H 1 + u1u

H 1 (u

H 1 E[xx

H]u1)

= Σx − 2λ1u1u H 1 + λ1u1u

H 1 = Σx −λ1u1u

H 1 = Σ1 (11.21)

Hence, since Σ1 is also a covariance matrix, it is also positive semidefinite. Assume now that Σ1 6= 0. In this case, the process can be repeated again, to obtain 0 < λ2 ≤ λ1, and

u2, based on the matrix Σ1. The following result holds:

Theorem 11.2 u2 is an eigenvector of Σx, with eigenvalue λ2. Furthermore, u

H 2 u1 = 0.

To show this, consider the definition of Σ1. Since u2 is an eigenvector of Σ1, the following equation holds:

(Σx −λ1u1u H 1 )u2 = λ2u2 (11.22)

Multiply on the left by uH1 to obtain

uH1 Σxu2 −λ1(u H 1 u1)u

H 1 u2 = λ2u

H 1 u2 (11.23)

242 CHAPTER 11. SERIES EXPANSIONS AND DETECTION OF STOCHASTIC

PROCESSES

Using the fact that uH1 u1 = 1, and that u1 is an eigenvector of Σx, this simplifies to

λ1(u H 1 u2 −u

H 1 u2) = 0 = λ2u

H 1 u2 (11.24)

which implies that uH1 u2 = 0. Substituting into (11.22) establishes that u2 is indeed an eigenvector. The procedure can be continued recursively, until the residual covariance Σi = 0. Once Σi = 0, the

procedure can be stopped, and one has the expansion:

Σx =

i∑ j=1

λjuju H j (11.25)

Note by necessity that i ≤ n, since, after n expansions, it is impossible to find a non-zero vector which is orthogonal to the other n vectors. Furthermore, the above construction establishes that the vector x −∑i j=1 uj(u

H j x) has zero covariance! This establishes the Karhunen-Loeve expansion

x mss =

i∑ j=1

xjuj (11.26)

with random, orthogonal coefficients xj = (u H j x).

Consider now extending the above development to a stochastic process x(t). The principal difference is that, instead of having a vector-valued random variable, the random variable takes the value of an entire function (an infinite-dimensional space). However, the autocovariance function Kx(t,s) is still a positive semidefinite operator, and there are equivalent results in functional analysis which enable the development to carry through. In particular, the following result is proven in many functional analysis texts such as Hille and Yosida, or Riesz and Sz-Nagy:

Theorem 11.3 Let Kx(t,s) be continuous, Hermitian, nonzero and positive semidefinite on the interval [S,T ]. Then, there exists at least one eigenfunction f(t) and one positive eigenvalue λ satisfying∫ T

Kx(t,s)f(s) ds = λf(t) (11.27)

∫ T S

f(s)f(s) ∗ ds < ∞ (11.28)

Using this result, a sequence of eigenvalues and eigenfunctions λi, fi(t) can be constructed as above, with the property that the eigenvalues λi are positive, and the eigenfunctions fi(t) form an orthonormal basis. After n eigenfunctions are found, one has the following expression for the reduced autocovariance Kn(t,s):

Kn(t,s) = Kx(t,s) − n∑ i=1

λifi(t)fi(s) ∗ (11.29)

which is the autocovariance of the residual process x(t) − ∑n i=1 fi(t)

∫T S x(s)fi(s)

∗ds. One needs to show that, as n →∞, the error Kn(t,s) → 0. A simple way to show this is to note that

Kn(t,t) = Kx(t,t) − n∑ i=1

|fi(t)|2 (11.30)

so that ∑n i=1 |fi(t)|

2 is bounded above and monotone, so it converges to a limit ∑∞ i=1 |fi(t)|

2. This also establishes that

∑n i=1 λifi(t)fi(s)

∗ converges to a limit, since

| n∑ i=1

λifi(t)fi(s) ∗ −

m∑ i=1

λifi(t)fi(s) ∗| = |

n∑ i=m

λifi(t)fi(s) ∗|

≤ ( n∑

i=m

λi|fi(t)|2)1/2( n∑

i=m

λi|fi(s)|2)1/2

→ 0 as m,n →∞. (11.31)

11.2. SERIES EXPANSION OF STOCHASTIC PROCESSES 243

Thus, the residual autocovariance approaches a limit K∞(t,s) = Kx(t,s) − ∑∞ i=1 λifi(t)fi(s)

∗. If this limit is not identically zero, then there is a positive eigenvalue and eigenvector which can be added to the sum, to change the value of K∞, which contradicts the assumed convergence.

The above argument establishes that the autocovariance Kx(t,s) can be expanded as

Kx(t,s) =

∞∑ i=1

λifi(t)fi(s) ∗ (11.32)

in terms of the orthonormal eigenfunctions fi(t). The convergence of the sum is uniform, in t,s ∈ [S,T], as implied by the bound in (11.31). This result is known as Mercer’s Theorem. Mercer’s Theorem implies that

∫ T S

x2(t)dt] =

∫ T S

Kx(t,t)dt =

∫ T S

∞∑ i=1

λifi(t)fi(t) ∗dt =

∞∑ i=1

λi (11.33)

Furthermore, the process x(t) can be written as

x(t) mss =

∞∑ i=1

xifi(t) (11.34)

where the coefficients of the expansion xi are orthonormal random variables, given as

xi =

∫ T S

x(t)fi(t) ∗dt (11.35)

This expansion is known as the Karhunen-Loeve expansion. The sequence of eigenfunctions fi(t) constructed above will form a complete orthonormal set if and

only if Kx(t,s) is positive definite. In this case, an arbitrary deterministic function f(t) can be expanded using the series expansion in terms of fi(t). If Kx(t,u) is only postive semidefinite, the construction above must be augmented with enough additional orthogonal functions, corresponding to the zero eigenvalues of Kx(t,u), to form a complete orthonormal set. A simple way of constructing such a set is to construct the eigenfunctions for the autocovariance function Kx(t,s) + Iδ(t− s), where I is the identity matrix; this modified autocovariance function has the same eigenfunctions as Kx(t,s), except that it is positive definite.

Example 11.1 (Karhunen-Loeve Expansion of Wiener Process) Let b(t) be a Wiener process with rate σ2, with autocovariance Kb(t,s) = σ

2 min(t,s) defined on [0,T ]. The integral equation defining the eigenfunctions is:

λf(t) = σ 2

∫ T 0

min(t,s)f(s)ds = σ 2 (

∫ t 0

sf(s)ds + t

∫ T t

f(s)ds) (11.36)

To obtain a differential equation for f(t), differentiate with respect to t:

λ d

dt f(t) = σ

2 (tf(t) +

∫ T t

f(s)ds− tf(t)) = σ2 ∫ T t

f(s)ds (11.37)

Differentiating again yields d2

dt2 f(t) +

σ2

λ f(t) = 0 ≡

dt2 f(t) + a

2 f(t) (11.38)

where a = √ σ2/λ. The solutions to this equation are sinusoids. From the above integral equations, the following boundary

conditions are required:

f(0) = 0; d

dt f(T) = 0

These bondary conditions are sufficient to uniquely specify the unique values of a for which a solution satisfying the boundary conditions exist. The condition f(0) = 0 implies f(t) = K sin at; the other boundary condition implies that aT = (n− 1/2)π for some integer n. Since the eigenvalue must be nonnegative, n is also positive, and so the eigenvalue λ can take the values

λn = σ2T 2

(n− 1/2)2π2 ,n = 1, 2, . . . (11.39)

with corresponding eigenfunctions fn(t) = Kn sin(n− 1/2)πt (11.40)

244 CHAPTER 11. SERIES EXPANSIONS AND DETECTION OF STOCHASTIC

PROCESSES

The constants Kn are chosen to normalize the eigenfunctions, so that∫ T 0

K 2 n sin(n− 1/2)πtdt = 1/2K

2 nT = 1

Thus, Kn = √

2/T .

The main limitation in using Karhunen-Loeve expansions is in determining the eigenvalues and eigenfunc- tions. This limits the applicability of the method for practical problems. However, when the process of interest is white noise, any orthonormal set of eigenfunctions can be used. Thus, solving detection problems involving white noise processes is straightforward, as discussed in the next section.

11.3 Detection of Known Signals in Additive White Noise

As an application of series expansions of random processes, consider the problem of observing a random process y(t) for detecting among the following two hypotheses: Under hypothesis H1, the random process y(t) is described as:

y(t) = s1(t) + w(t) (11.41)

where w(t) is a white noise process, with autocorrelation qδ(t − s), and s1(t) is a known signal. Under hypothesis H0, the random process y(t) is given by

y(t) = s0(t) + w(t) (11.42)

where s0(t) is a known signal and w(t) is again white noise with autocorrelation qδ(t−s). Assume that the process y(t) is observed over the interval [0,T]. The problem is to design an optimal

detector, in terms of either a Bayes’ cost or a Maximum Aposteriori detector. The complicating factor in this problem is that it is difficult to write the likelihood ratio in terms of continuous functions (although it is possible to develop such an expression as a limit of the vector observation likelihood functions discussed previously).

Instead of dealing with the continuous-time observation process, one can use a series expansion to convert y(t) to an infinite vector of coefficients y, where yi is the i − th coefficient in the series expansion. It is important to select the basis functions for the expansion carefully. In particular, consider the following basis function:

f1(t) = 1

( ∫T

0 (s1(s) −s0(s))(s1(s) −s0(s))∗ds)1/2

(s1(t) −s0(t)) (11.43)

and select the rest of the basis functions to form a complete orthonormal set over [0,T], orthogonal to s1(t) −s0(t) Observe the following identities:∫ T

s1(t)f1(t) ∗dt =

∫T 0 s1(s)(s1(s) −s0(s))∗ds

( ∫T

0 (s1(s) −s0(s))(s1(s) −s0(s))∗ds)1/2

∫T 0 s0(s)(s1(s) −s0(s))∗ds +

∫T 0

(s1(s) −s0(s))(s1(s) −s0(s))∗ds ( ∫T

0 (s1(s) −s0(s))(s1(s) −s0(s))∗ds)1/2∫ T

s1(t)fi(t)dt =

∫ T 0

s0(t)fi(t)dt if i > 1∫ T 0

s0(t)f1(t) ∗dt =

∫T 0 s0(s)(s1(s) −s0(s))∗ds

( ∫T

0 (s1(s) −s0(s))(s1(s) −s0(s))∗ds)1/2

(11.44)

One can now convert the waveform y(t) into the corresponding coefficients and, under hypothesis H1, the coefficients are independent of each other and become:

yi =

 

∫ T 0 s0(s)(s1(s)−s0(s))∗ ds+

∫ T 0

(s1(s)−s0(s))(s1(s)−s0(s))∗ ds

( ∫ T 0

(s1(s)−s0(s))(s1(s)−s0(s))∗ds) 1/2 + w1 if i = 1∫T

0 s0(t)fi(t) dt + wi if i 6= 1

(11.45)

11.3. DETECTION OF KNOWN SIGNALS IN ADDITIVE WHITE NOISE 245

due to the orthogonal construction of the basis functions fi(t). Under hypothesis H0, the coefficients are again independent of each other and are given by:

yi =

 

∫ T 0 s0(s)(s1(s)−s0(s))∗ds

( ∫ T 0

(s1(s)−s0(s))(s1(s)−s0(s))∗ds) 1/2 if i = 1∫T

0 s0(t)fi(t)dt + wi otherwise

(11.46)

where the wi are independent, zero-mean Gaussian random variables with covariance q, because they are the coefficients of a white noise expansion using orthonormal eigenfunctions. What the above expansion shows is that the only coefficient which differs in value between the two hypotheses is y1. Furthermore, since the coefficients yi are independent under each hypothesis, observation of any other coefficient yi, i > 1 provides no information concerning the value of coefficient y1, and thus is not useful for discriminating among the hypotheses. Thus, the rest of the coefficients contain no information which is useful for discriminating among the two hypotheses and can be ignored. The coefficient y1 is a sufficient statistic for the detection problem; that is, for the purposes of solving the detection problem, it is sufficient to consider the one-dimensional Gaussian vector consisting of y1! The reduced detection problem becomes:

p(y1|Hi) = { N(a1,q) if H1 is true N(a0,q) if H0 is true

(11.47)

where

a1 =

∫T 0 s0(s)(s1(s) −s0(s))∗ds +

∫T 0

(s1(s) −s0(s))(s1(s) −s0(s))∗ds ( ∫T

0 (s1(s) −s0(s))(s1(s) −s0(s))∗ds)1/2

a0 =

∫T 0 s0(s)(s1(s) −s0(s))∗ds

( ∫T

0 (s1(s) −s0(s))(s1(s) −s0(s))∗ds)1/2

From the previous analysis of Gaussian detection problems, the optimal detector is given by:

m(y1) =

{ H1 if (a1 −a0)y1 > q ln(T) + a21/2 −a20/2 H0 otherwise

(11.48)

Simplifying, note that a1 −a0 is given by

a1 −a0 = ∫T

0 (s1(s) −s0(s))(s1(s) −s0(s))∗ds

( ∫T

0 (s1(s) −s0(s))(s1(s) −s0(s))∗ds)1/2

= (

∫ T 0

(s1(s) −s0(s))(s1(s) −s0(s))∗ds)1/2 (11.49)

and

(a1 −a0)y1 = ( (s1(s) −s0(s)) (s1(s) −s0(s))

∗)1/2 ∫ T 0

y(t)f1(t)dt =

∫ T 0

y(t)(s1(t) −s0(t))∗dt (11.50)

Thus, an equivalent sufficient statistic is S(y) = ∫T

0 y(t)(s1(t) − s0(t))∗dt. This statitstic is formed by

correlating the input y(t) with the known difference between the means of the two hypotheses. This is known as a matched filter.

Example 11.2 (Detection of Bits in Additive White Noise) Assume that one of two signals s0(t) = sin(10t) or s1(t) = sin(10t + π/2) is transmitted over an interval t ∈ [0, 2π] through a noisy channel with output modeled as y(t) = s(t) +n(t), where n(t) is Gaussian white noise with autocovariance δ(t− s). The problem is to design the optimal decoder at the receiving end. This is a problem of binary detection. Let H1 denote the hypothesis that s1(t) was the transmitted signal. The sufficient statistic can be found as:

S(y) =

∫ 2π 0

(sin(10t + π/2) − sin(10t))y(t)dt

Since y(t) is a Gaussian process under either hypotheses, the statistic S(y) will also be Gaussian. The mean of S(y) under both hypotheses is given by

E[S(y)|Hi] = ∫ 2π

(sin(10t + π/2) − sin(10t))si(t)dt =

{ π if H1

−π if H0 and the variance under both cases is given by∫ 2π

(sin(10t + π/2) − sin(10t))2dt = 2π

Using these statistics, the optimal detector can be constructed as in the scalar Gaussian case.

246 CHAPTER 11. SERIES EXPANSIONS AND DETECTION OF STOCHASTIC

PROCESSES

11.4 Detection of Unknown Signals in White Noise

Consider the following detection problem: The observations y(t), t ∈ [0,T] are given by:

y(t) =

{ x(t) + w(t) if H1 is true w(t) otherwise

(11.51)

where x(t) is a zero-mean, finite-variance Gaussian random process with autocovariance Kx(t,s),which is independent of the additive white noise w(t) with autocovariance δ(t−s). The nature of the optimal detector can again be determined using a series expansion. However, the nature of the sufficient statistic is harder to determine, as shown below.

The key to solving the above detection problem is to transform the observations using the Karhunen- Loeve expansion of x(t), as follows: Let fi(t) and λi denote the eigenfunctions and eigenvalues of the Karhunen-Loeve expansion of x(t), with corresponding coefficients xi in the expansion. Then,

yi =

{ xi + wi ∼ N(0, 1 + λi) if H1 is true wi ∼ N(0, 1) if H0 is true

(11.52)

where the coefficients yi are again independent! Unlike the previous section, every coefficient has some information concerning the difference between the hypotheses. Thus, one cannot find a single coefficient

as a sufficient statistic. However, consider the finite vector y N

= [ y1 y2 · · · yN

]T . A suboptimal

hypothesis test can be designed using this finite vector, which only uses the first N coefficients. Note that, as i →∞, the values of λi → 0, so that the coefficients yi have less useful information about the difference in the hypotheses.

The optimal detector using y N

can be computed using the likelihood ratio, which, since the yi are independent, takes the simple form:

L(y N

) =

N∏ i=1

L(yi) = N∏ i=1

e−y 2 i/2(1+λi)

√ 1 + λie

−y2 i /2

(11.53)

Taking logarithms, one obtains the following sufficient statistic:

S(y N

) =

N∑ i=1

y2i λi

1 + λi (11.54)

Indeed, taking limits as N →∞, one obtains the sufficient statistic for the original problem

s(y) =

∞∑ i=1

y2i λi

1 + λi (11.55)

Using this sufficient statistic, an optimal detection threshold can be designed as an extension of the vector case. In practice, it is not useful to evaluate all of the coefficients yi, as their discrimination value decreases, so that a suboptimal detector using only N coefficients is used.

Note that, since the statistic involves the square of the coefficients yi, the statistic will not be Gaussian. Thus, for the case of unknown signal in noise, it is useful to derive the appropriate threshold from the original likelihood ratio expression, rather than to compute it using the likelihood ratio of the sufficient statistic.

11.5 Detection of Known Signals in Colored Noise

Consider the problem of detecting the presence of a known signal s(t) in the interval [0,T], with observations

y(t) =

{ s(t) + n(t) + w(t) if H1 is true

n(t) + w(t) otherwise (11.56)

where n(t) is a zero-mean Gaussian random process with autocovariance Kn(t,s), and w(t) is white noise with autocovariance δ(t−s), independent of n.

11.5. DETECTION OF KNOWN SIGNALS IN COLORED NOISE 247

Again, a solution is possible using series expansions. Let λi,fi(t), and ni denote the eigenvalues, eigen- functions, and coefficients of the Karhunen-Loeve expansion of n(t). Then, the observation function y(t) can be transformed using this orthonormal basis into independent coefficients as

yi =

{ si + wi ∼ N(si, 1 + λi) if H1 is true wi ∼ N(0, 1 + λi) if H0 is true

(11.57)

where the coefficients si are given by

si =

∫ T 0

s(t)fi(t)dt

All of the coefficients yi contain information which is useful for discrimination. Indeed, as i → ∞, the information in the coefficients may improve, since the variance of the noise in each coefficient becomes smaller!

The likelihood ratio for this case takes the simple form

L(y) = ∞∏ i=1

e−(yi−si) 2/2(1+λi)

e−y 2 i /2(1+λi)

Taking logarithms, one obtains the sufficient statistic

S(y) =

∞∑ i=1

siyi 1 + λi

∼

{ N(

s2i 1+λi

, s2i

1+λi ) if H1

N(0, s2i

1+λi ) otherwise

(11.58)

In practice, only a finite set of coefficients is used. However, care must be taken to include enough coefficients so that the approximation of s(t) ≈

∑N i=1 sifi(t) is accurate.

Note the effect of the additive white noise is to guarantee that the covariance of the coefficients y1 is always greater than 1. If the additive white noise were not present, the covariance would approach zero! In this case, it is possible to obtain a perfect detector by observing enough coefficients. This singularity can be avoided by including a small additive white noise in the measurement.

248 CHAPTER 11. SERIES EXPANSIONS AND DETECTION OF STOCHASTIC

PROCESSES

Appendix A

Useful Transforms

The Fourier transform and inverse Fourier transform of aperiodic signals are defined in Table A.1. Notice that the discrete-time Fourier transform (DTFT) is periodic with period 2π in radian frequency or period 1 in cycles/sec. For periodic signals, we have the Fourier series and transform relationships shown in Table A.2. Note that the discrete-time Fourier series coefficients are periodic with period N, and thus X(ejω) is also periodic. In Table A.6 we give useful Laplace transform pairs. In Table A.7 we give z-transform pairs.

Continuous-Time Discrete-Time

FT (radians)

X(jω) = F [x(t)] = ∫ ∞ −∞

x(t) e−jωt dt X(ejω) = F [x(n)] = ∞∑

n=−∞ x(n) e−jωn

FT (Hertz)

X(j2πf) = F [x(t)] = ∫ ∞ −∞

x(t) e−j2πft dt X(ej2πf ) = F [x(n)] = ∞∑

n=−∞ x(n) e−j2πfn

Inverse FT (radians)

x(t) = F−1 [X(jω)] = 1

2π

∫ ∞ −∞

X(jω) ejωt dω x(n) = F−1 [ X(ejω)

] =

2π

∫ π −π

X(ω) ejωn dω

Inverse FT (Hertz)

x(t) = F−1 [X(j2πf)] = ∫ ∞ −∞

X(f) ej2πft df x(n) = F−1 [ X(ej2πf )

] =

∫ 1/2 −1/2

X(ej2πf ) ej2πfn df

Table A.1: Fourier transform and inverse Fourier transform definitions.

In Table A.3 we summarize some useful Fourier transform properties for both the continuous- and discrete- time cases. For compactness, we slightly bend our notation and represent either the continuous or discrete transform in Hertz as X(f) and the transform in radians as X(ω). In Table A.4 we present useful continuous- time Fourier transform pairs, while in Table A.5 we present useful discrete-time Fourier transform pairs.

250 APPENDIX A. USEFUL TRANSFORMS

Continuous-Time Discrete-Time Period T , f0 =

1 T

, ω0 = 2π T

Period N, f0 = 1 N

, ω0 = 2π N

Fourier Series Coefficients (radians)

ak = 1

∫ t0+T t0

x(t) e−jω0kt dt ak = 1

n0+N∑ n=n0+1

x(n) e−jω0kt

Fourier Series Coefficients (Hertz)

ak = 1

∫ t0+T t0

x(t) e−j2πkf0t dt ak = 1

n0+N∑ n=n0+1

x(n) e−j2πkf0n

Fourier Series Representation (radians)

x(t) =

∞∑ k=−∞

ak e jω0kt x(n) =

k0+N∑ k=k0+1

ak e jω0kn

Fourier Series Representation (Hertz)

x(t) =

∞∑ k=−∞

ak e j2πf0kt x(n) =

k0+N∑ k=k0+1

ak e j2πf0kn

Fourier Transform (radians)

X(jω) = 2π

∞∑ k=−∞

ak δ(ω −kω0) X(ejω) = 2π ∞∑

k=−∞

ak δ(ω −kω0)

Fourier Transform (Hertz)

X(j2πf) =

∞∑ k=−∞

ak δ(f −kf0) X(ej2πf ) = ∞∑

k=−∞

ak δ(f −kf0)

Table A.2: Discrete-time Fourier series and transform relationships.

x(t) or x(n) X(f) X(ω) Linearity ax(t) + by(t) aX(f) + bY (f) aX(ω) + bY (ω) Time Shift x(t− t0) e−j2πft0X(f) e−jωt0X(ω) Modulation ej2πf0tx(t) X(f −f0) X(ω − 2πf0) Modulation (alt.) ejω0tx(t) X(f − ω0

2π ) X(ω −ω0)

Time reversal x(−t) X(−f) X(−ω) Conjugate x(t)∗ X∗(−f) X∗(−ω) Convolution x(t) ∗h(t) X(f)H(f) X(ω)H(ω) Real functions x(t) = x∗(t) X(f) = X∗(−f) X(ω) = X∗(−ω)

x(t) Only X(f) X(ω) Multiplication x(t)y(t) X(f) ∗Y (f) 1

2π X(ω) ∗Y (ω)

Time derivative d dt x(t) j2πfX(f) jωX(ω)

Freq derivative tx(t) j 2π

d df X(f) j d

dω X(ω)

Scaling x(at) 1|a|X( f a

) 1|a|X( ω a

)

Zero value ∫∞ −∞x(t) dt = X(0)

∫∞ −∞X(f) df = x(0)

1 2π

∫∞ −∞X(ω) dω = x(0)

Parseval’s Thm ∫∞ −∞ |x(t)|

2 dt = ∫∞ −∞ |X(f)|

2 df = 1 2π

∫∞ −∞ |X(ω)|

2 dω

x(n) Only X(f) X(ω)

Multiplication x(n)y(n) ∫f0+1 f0

X(v)Y (f −v)dv 1 2π

∫ω0+2π ω0

X(v)Y (ω −v)dv Freq derivative nx(n) j

2π d df X(f) j d

dω X(ω)

Zero value X(0) = ∑∞ n=−∞x(n)

∫ 1/2 −1/2 X(f) df = x(0)

1 2π

∫π −π X(ω) dω = x(0)

Parseval’s Thm ∑∞ n=−∞ |x(n)|

2 = ∫ 1/2 −1/2 |X(f)|

2 df = 1 2π

∫π −π |X(ω)|

2 dω

Table A.3: Fourier Transform Properties.

251

x(t) X(f) X(ω) δ(t) 1 1

δ(t− t0) e−j2πft0 e−jωt0 1 δ(f) 2πδ(ω)

cos(2πf0t) 1 2 δ(f −f0) + 12δ(f + f0) πδ(ω − 2πf0) + πδ(ω + 2πf0)

cos(ω0t) 1 2 δ(f −ω0/2π) + 12δ(f + ω0/2π) πδ(ω −ω0) + πδ(ω + ω0)

e−αtu(t), α > 0 1 α+j2πf

1 α+jω

e−α|t|, α > 0 2α α2+(2πf)2

2α α2+ω2

te−αtu(t), α > 0 1 (α+j2πf)2

1 (α+jω)2

|t|e−α|t|, α > 0 2[α 2−(2πf)2]

[α2+(2πf)2]2 2[α2−ω2] [α2+ω2]2

e−πt 2

e−πf 2

e−ω 2/4π

Box: 1 for t ∈ [−T,T] 2T sin(2πfT) 2πfT

2T sin(ωT) ωT

2fc sin(2fct)

2fct Box: 1 for f ∈ [−fc,fc] Box: 1 for ω ∈ [−2πfc, 2πfc]

ωc π

sin(ωct/π) ωct/π

Box: 1 for f ∈ [−ωc/2π,ωc/2π] Box: 1 for ω ∈ [−ωc,ωc] Triangle: 1 − |t|

2T , t ∈ [−2T, 2T] 2T sin

2(2πfT) (2πfT)2

2T sin2(ωT)

(ωT)2∑ m δ(t−mT)

1 T

∑ k δ(f −k/T)

2π T

∑ k δ(ω − 2πk/T)

Table A.4: Useful Continuous-Time Fourier Transform Pairs

x(n) X(f) X(ω) δ(n) 1 1

δ(n−n0) e−j2πfn0 e−jωn0 1

∑ k δ(f + k) 2π

∑ k δ(ω + 2πk)

ejω0n ∑ k δ(f −ω0/2π + k) 2π

∑ k δ(ω −ω0 + 2πk)

ej2πf0n ∑ k δ(f −f0 + k) 2π

∑ k δ(ω − 2πf0 + 2πk)

cos(ω0n + φ) 1 2

∑ k[e

jφδ(f −ω0/2π + k) π ∑ k[e

jφδ(ω −ω0 + 2πk) +e−jφδ(f + ω0/2π + k)] +e

−jφδ(ω + ω0 + 2πk)] cos(2πf0n + φ)

1 2

∑ k[e

jφδ(f −f0 + k) π ∑ k[e

jφδ(ω − 2πf0 + 2πk) +e−jφδ(f + f0 + k)] +e

−jφδ(ω + 2πf0 + 2πk)] anu(n), |a| < 1 1

1−ae−j2πf 1

1−ae−jω (n + 1)anu(n), |a| < 1 1

(1−ae−j2πf )2 1

(1−ae−jω)2

a|n|, |a| < 1 1−a 2

1+a2−2a cos(2πf) 1−a2

1+a2−2a cos(ω)

Box: 1 for n ∈ [−N,N] sin(πf(2N+1)) sin(πf)

sin(ω(2N+1)/2) sin(ω/2)

sin(Wn) πn

; W ∈ [0,π] Periodic (1): 1 for f ∈ [−W/2π,W/2π] Periodic (2π): 1 for ω ∈ [−W,W]∑ m δ(n−mN)

1 N

∑ k δ(f −k/N)

2π N

∑ k δ(ω − 2πk/N)

Table A.5: Useful Discrete-Time Fourier Transform Pairs

252 APPENDIX A. USEFUL TRANSFORMS

x(t) X(s) ROC δ(t) 1 All s

u(t) 1

s Re(s) > 0

−u(−t) 1

s Re(s) < 0

e−αtu(t) 1

s + α Re(s) > −α

−e−αtu(−t) 1

s + α Re(s) < −α

e−α|t|, α > 0 2α

(α2 −s2) |Re(s)| < α

tn−1

(n− 1)! e−αtu(t)

(s + α)n Re(s) > −α

− tn−1

(n− 1)! e−αtu(−t)

(s + α)n Re(s) < −α

δ(t−T) e−sT All s

Table A.6: Useful Laplace Transform Pairs

f(k) F(z) ROC δ(k) 1 All z u(k) (1 −z−1)−1 1 < |z|

ku(k) z−1

(1 −z−1)2 1 < |z|

knu(k)

( −z

)n 1

(1 −z−1) 1 < |z|(

k n

) , n ≤ k

z−n

(1 −z−1)n+1 0 < |z|(

n k

) , 0 ≤ k ≤ n (1 + z−1)n 0 < |z|

αku(k) 1

(1 −αz−1) |α| < |z|

knαku(k)

( −z

)n 1

(1 −αz−1) |α| < |z|

αku(−k − 1) −1

(1 −αz−1) |z| < |α|

knαku(−k − 1) − ( −z

)n 1

(1 −αz−1) |z| < |α|

α|k| 1 −α2

(1 −αz)(1 −αz−1) |α| < |z| <

∣∣∣∣ 1α ∣∣∣∣

k u(k − 1) − ln(1 −z−1) 1 < |z|

cos(αk)u(k) 1 − cos(α)z−1

1 − 2 cos(α)z−1 + z−2 1 < |z|

sin(αk)u(k) sin(α)z−1

1 − 2 cos(α)z−1 + z−2 1 < |z|

(a cos(αk) + b sin(αk)) u(k) a + (b sin(α) −a cos(α))z−1

1 − 2 cos(α)z−1 + z−2 1 < |z|

cosh(αk)u(k) 1 − cosh(α)z−1

1 − 2cosh(α)z−1 + z−2 max{|α|, |1/α|} < |z|

sinh(αk)u(k) sinh(α)z−1

1 − 2cosh(α)z−1 + z−2 max{|α|, |1/α|} < |z|

Table A.7: Useful Z-Transform Pairs

Appendix B

Partial-Fraction Expansions

In this appendix we examine the tool of partial-fraction expansions. Partial-fraction expansions provide a way of representing a rational polynomial or transform as a sum of simplier terms. We first treat the continuous-time problem, then the discrete-time one.

B.1 Continuous-Time Signals

Consider the problem of inverting rational transforms, i.e. those of the form:

X(s) = ams

m + am−1s m−1 + · · ·a1s + a0

sn + dn−1sn−1 + · · · + d1s + d0 (B.1)

If m ≥ n we can use long division to reduce X(s) to the sum of a polynomial in s and a strictly proper rational function, as follows:

X(s) = cm−ns m−n + cm−n−1s

m−n−1 + · · · + c1s + c0 + Xp(s) (B.2)

where Xp(s) is a proper rational function of s:

Xp(s) = αn−1s

n−1 + αn−2s n−2 + · · ·α1s + α0

sn + dn−1sn−1 + · · · + d1s + d0 (B.3)

Thus, the inverse transform x(t) of X(p) is given by:

x(t) = cm−num−n(t) + cm−n−1um−n−1(t) + · · · + c1u1(t) + c0u0(t) + xp(t) (B.4)

where xp(t) is the inverse transform of Xp(s) and uk(t) represents the generalized function of order k – i.e.

uk(t) = dk

dtk δ(t). The above is nothing more than statement of the fact that we can always write a rational

transform as the sum of a polynomial part – which is easy to invert – and a strictly proper part, whose inverse we discuss next.

To find xp(t) we may use partial fraction expansion, which allows us to write the Xp(s) as the sum of a number of simpler, single-pole components. This assumes that Xp(s) is a strictly proper rational function. Suppose there are r distinct poles or roots to the denominator pi, each of muliplicity ki, and that the denominator is factored as:

sn + dn−1s n−1 + · · · + d1s + d0 = (s−p1)k1 (s−p2)k2 · · ·(s−pr)kr (B.5)

254 APPENDIX B. PARTIAL-FRACTION EXPANSIONS

Then Xp(s) can always be rewritten as a partial fraction expansion as follows:

Xp(s) = A11

(s−p1) +

A12 (s−p1)2

+ · · · + A1k1

(s−p1)k1 (B.6)

+ A21

(s−p2) +

A22 (s−p2)2

+ · · · + A2k2

(s−p2)k2 ...

+ Ar1

(s−pr) +

Ar2 (s−pr)2

+ · · · + Arkr

(s−pr)kr

r∑ i=1

ki∑ j=1

Aij (s−pi)j

The coefficients Aij can be obtained by equating the two expressions (B.6) and (B.3), clearing the denomi- nators and matching like powers of s. Alternatively, a closed form expression for the coefficients is given by:

Aij = 1

(ki − j)!

[ dki−j

dski−j (s−pi)kiXp(s)

]∣∣∣∣ s=pi

(B.7)

The inverse transform of (B.6) can then be obtained on a term-by-term basis, since we have split it into simpler terms.

Example B.1 Suppose the signal transform is given by:

X(s) = s + 2

(s + 1)2(s + 3) (B.8)

The transform is already strictly proper so no long division is needed in this case. The partial fraction expansion is given by:

X(s) = A11 s + 1

+ A12

(s + 1)2 +

A21 s + 3

(B.9)

The coefficients are given by:

A11 = 1

(2 − 1)! d

[ (s + 1)

2 X(s)

]∣∣ s=−1 =

4 (B.10)

A12 = [ (s + 1)

2 X(s)

]∣∣ s=−1 =

2 (B.11)

A21 = [(s + 3)X(s)]|s=−3 = − 1

4 (B.12)

Therefore, we have that:

X(s) = s + 2

(s + 1)2(s + 3) =

1 4

s + 1 +

1 2

(s + 1)2 −

1 4

s + 3 (B.13)

Taking inverse transforms, we find:

x(t) =

[ 1

4 e −t

+ 1

2 te −t −

4 e −3t ] u−1(t) (B.14)

B.2 Discrete-Time Signals

We now turn our attention to the problem of inverting rational z-transforms, i.e. those of the form:

X(z) = amz

m + am−1z m−1 + · · ·a1z + a0

zq + dq−1zn−1 + · · · + d1z + d0 (B.15)

As in continuous time, if m ≥ q we can use long division to reduce X(z) to the sum of a polynomial in z and a strictly proper rational function of z, as follows:

X(z) = cm−qz m−q + cm−q−1z

m−q−1 + · · · + c1z + c0 + Xp(z) (B.16)

B.2. DISCRETE-TIME SIGNALS 255

where Xp(z) is a proper rational function of z:

Xp(z) = αq−1z

q−1 + αq−2z q−2 + · · ·α1z + α0

zq + dq−1zq−1 + · · · + d1z + d0 (B.17)

Thus, the inverse transform x(n) of X(z) is given by:

x(n) = cm−qδ(n + m−q) + cm−q−1δ(n + m−q − 1) + · · · + c1δ(n + 1) + c0δ(n) + xp(n) (B.18)

where xp(n) is the inverse transform of Xp(z). Notice that in the discrete case, the higher order generalized functions of the continuous case become positive time shifts.

Now to find xp(n) we may again use partial fraction expansion. This assumes that Xp(z) is a strictly proper rational function, which it is by design. Again, suppose there are r distinct poles or roots to the denominator pi, each of muliplicity ki, and that the denominator is factored as:

zq + dq−1z q−1 + · · · + d1z + d0 = (z −p1)k1 (z −p2)k2 · · ·(z −pr)kr (B.19)

Then Xp(z) can always be rewritten as a partial fraction expansion as follows:

Xp(z) = A11

(z −p1) +

A12 (z −p1)2

+ · · · + A1k1

(z −p1)k1 (B.20)

+ A21

(z −p2) +

A22 (z −p2)2

+ · · · + A2k2

(z −p2)k2 ...

+ Ar1

(z −pr) +

Ar2 (z −pr)2

+ · · · + Arkr

(z −pr)kr

r∑ i=1

ki∑ j=1

Aij (z −pi)j

As in the continuous case, the coefficients Aij can be obtained either by equating the two expressions (B.20) and (B.17) clearing the denominators and matching like powers of z or by using the closed form expression for the coefficients given before:

Aij = 1

(ki − j)!

[ dki−j

dzki−j (z −pi)kiXp(z)

]∣∣∣∣ z=pi

(B.21)

The inverse transform of (B.20) can then be obtained on a term-by-term basis, since we have split it into simpler terms. The only additional tricky part is that discrete-time transform expressions are often given in terms of negative powers of z. For example, suppose m > q:

X(z) = am + am−1z

−1 + · · ·a1z1−m + a0z−m

zq−m + dq−m−1zn−1 + · · · + d1z1−m + d0z−m (B.22)

Clearly such expressions can be converted to the form in (B.15), and then the results above applied. Alter- natively, the change of variables ν = z−1 can be made, the expansion done, then the variable changed back. We will illustrate both approaches below.

Example B.2 Suppose the signal transform is given by:

X(z) = 2

1 − 3 4 z−1 + 1

8 z−2

= 2z2

z2 − 3 4 z1 + 1

= 2z2(

z − 1 2

)( z − 1

) (B.23) The partial fraction expansion of the term in brackets is given by:

X(z) = z 2

[ A11

z − 1 2

+ A21

z − 1 4

] (B.24)

256 APPENDIX B. PARTIAL-FRACTION EXPANSIONS

The coefficients are given by:

A11 =

[( z −

) X(z)

]∣∣∣∣ z= 1

= 2

1 2 − 1

= 8 (B.25)

A21 =

[( z −

) X(z)

]∣∣∣∣ z= 1

= 2

1 4 − 1

= −8 (B.26)

Therefore, we have that:

X(z) = z 2

[ 8

z − 1 2

− 8

z − 1 4

] = z

[ 8

z − 1 2

− 8

z − 1 4

] = z8

[ 1

1 − 1 2 z−1 −

1 − 1 4 z−1

] (B.27)

Taking inverse transforms, we find:

x(n) = 8

[( 1

)n+1 u−1(n + 1) −

( 1

)n+1 u−1(n + 1)

] = 4

( 1

)n u−1(n) − 2

( 1

)n u−1(n) (B.28)

Now the alternative way to solve this is to make the substitution ν = z−1 at the outset:

X(ν) = 2

1 − 3 4 ν + 1

8 ν2

= 2(

1 − 1 2 ν )(

1 − 1 4 ν ) = 2(

1 − 1 2 z−1 )(

1 − 1 4 z−1 ) (B.29)

The partial fraction expansion is given by:

X(ν) = B11

1 − 1 2 ν

+ B21

1 − 1 4 ν

= B11

1 − 1 2 z−1

+ B21

1 − 1 4 z−1

(B.30)

The coefficients are given by:

B11 =

[( 1 −

2 ν

) X(ν)

]∣∣∣∣ ν=2

= 2

1 − 1 2

= 4 (B.31)

B21 =

[( 1 −

4 ν

) X(ν)

]∣∣∣∣ ν=4

= 2

1 − 2 = −2 (B.32)

Therefore, we have that:

X(ν) = 4

1 − 1 2 ν −

1 − 1 4 ν

(B.33)

Equivalently, making the inverse change of variables z−1 = ν:

X(z) = 4

1 − 1 2 z−1 −

1 − 1 4 z−1

(B.34)

Taking inverse transforms, we find:

x(n) = 4

( 1

)n u−1(n) − 2

( 1

)n u−1(n) (B.35)

as before.

Appendix C

Summary of Linear Algebra

Linear algebra is concerned with the solution of sets of simultaneous systems of linear equations. The linear nature of these sets of equations leads naturally to both a convenient notation and a deep connections with the properties of vectors and matrices. These notes are intended to provide a summary and review of the important concepts and notation that arise.

C.1 Vectors and Matrices

A vector is just an array of numbers stacked together:

x =

  x1 x2 ... xn

  (C.1)

where x1,x2, . . . ,xn may be either real or complex numbers. We often denote such column vectors by underlined lowercase letters, as shown in (C.1). The set of all n-dimensional vectors of real numbers is usually denoted by Rn while the set of all n-dimensional vectors of complex numbers is denoted by Cn. The transpose of a column vector x is a row vector:

xT = [x1,x2, . . . ,xn] (C.2)

The sum of two vectors is defined on a component-by-component basis  x1 x2 ... xn

  +

  y1 y2 ... yn

  =

  x1 + y1 x2 + x2

... xn + yn

  (C.3)

Similarly, the product of a vector and a scalar is defined componentwise as:

  x1 x2 ... xn

  =

  αx1 αx2

... αxn

  (C.4)

where α is a real or complex number. A set of vectors {x1, . . . ,xr} in Rn or Cn is termed linearly independent if and only if

α1x1 + α2x2 + . . . + αrxr+ = 0 (C.5)

implies that α1 = α2 = . . . = αr = 0 (C.6)

258 APPENDIX C. SUMMARY OF LINEAR ALGEBRA

where 0 is the n-vector of zeros. Otherwise one of the xi can be written as a linear combination of the others and the set of vectors is termed linearly dependent. In Rn we can have at most n linearly independent vectors in any given set. Conversely, given any set of n linearly independent vectors {x1, . . . ,xn} in Rn, any other vector can be written as a linear combination of the vectors x1, . . . ,xn. Any such a set is termed a basis for Rn.

Given two vectors of the same length, x, y ∈ Rn, we can define the dot or inner product between the vectors:

xTy = 〈x,y〉 = n∑ i=1

xiyi = y Tx ∈ R (C.7)

Two n-vectors x and y are termed orthogonal, denoted x ⊥ y if

xTy = 0 (C.8)

A set of nonzero, mutually orthogonal vectors is always linearly independent. The length or standard norm of the vector x ∈ Rn is

||x|| = √ xTx =

√ x21 + x

2 2 + · · · + x2n (C.9)

Note that the inner product provides information about the angle between the vectors. In particular, xTy = ||x|| ||y||cos(∠(x,y)).

As in the case of vectors, matrices are simply arrays of numbers in a regular grid:

A =

  a11 a12 · · · a1n a21 a22 · · · a2n ...

... ...

am1 am2 · · · amn

  (C.10)

where a11, . . . ,amn may be either real or complex numbers. We can see that vectors are matrices with a special form – they only have a single column or row. Matrices are often denoted by capital letters. The element in the i-th row and j-th column of A will be denoted by aij, [A]ij, or (A)ij, depending on the situation. If A has m rows and n columns we say that A is an m×n matrix. The set of all m×n real-valued matrices is denoted Rm×n while the set of all m×n complex-valued matrices is denoted Cm×n.

If m = n, A is a square matrix. The transpose of an m×n matrix A is the m×n matrix whose elements are aji, i.e. that is rows are exchanged for coluns and vice versa. With A defined as in (C.10) we have:

AT =

  a11 a21 · · · am1 a12 a22 · · · am2 ...

... ...

a1n a2n · · · amn

  (C.11)

A square matrix is said to be symmetric if AT = A. A diagonal square matrix only has nonzero entries along its diagonal and is of the form

A =

  a1 0 0 · · · 0 0 a2 0 · · · 0 ...

... ...

... 0 0 0 · · · an

  (C.12)

This is sometimes denoted as diag(a1, . . . ,an). The identity matrix is denoted by I and is the diagonal matrix with ones along its diagonal:

I = diag(1, ..., 1) (C.13)

On occasion we will write In to make clear the size (i.e. n×n) of the identity matrix. The trace of a square matrix A is the sum of its diagonal elements:

tr(A) = n∑ i=1

aii (C.14)

C.1. VECTORS AND MATRICES 259

In particular, we have for a square matrix A that tr(A) = tr(AT ). Note that

||x||2 = xTx = tr ( xTx

) = tr

( xxT

) (C.15)

We now consider operations involving matrices. As with vectors we define addition between matrices and multiplication of a matrix by a scalar on a component by component basis:

A + B =

 

a11 a12 · · · a1n a21 a22 · · · a2n ...

... ...

am1 am2 · · · amn

  +

 

b11 b12 · · · b1n b21 b22 · · · b2n ...

... ...

bm1 bm2 · · · bmn

  (C.16)

 

a11 + b11 a12 + b12 · · · a1n + b1n a21 + b21 a22 + b22 · · · a2n + b2n

... ...

... am1 + bm1 am2 + bm2 · · · amn + bmn

  (C.17)

αA =

 

αa11 αa12 · · · αa1n αa21 αa22 · · · αa2n

... ...

... αam1 αam2 · · · αamn

  (C.18)

Let A be an m×n matrix and B an n×p matrix. Then the matrix product of A and B is denoted by C = AB where C is an m×p matrix whose elements are given by

cij =

n∑ k=1

aikbkj (C.19)

Note that A and B must satisfy some dimensional constraints for the above expression to make any sense. In particular, the number of columns of A must equal the number of rows of B for AB to be defined. One implication is that BA may not be defined even if AB is (consider the case of m = 2, n = 3, p = 4). Note also, that even if BA is defined it may not be of the same size as AB. For example, if A is 2 × 3 and B is 3×2, then AB is 2×2, but BA is 3×3. In general, AB 6= BA so the order of matrix multiplication is very important. Some other important relationships are:

AI = IA = A (C.20)

so the identity matrix is the identity element of matrix multiplication. It can be verified that the transpose operation behaves as follows:

(AB)T = BTAT (C.21)

Also, if A ∈ Rm×n and x ∈ Rn then Ax ∈ Rm. In addition, if both AB and BA are defined,

tr(AB) = tr(BA) (C.22)

Further, note that the tr is a linear operation, so that:

tr(A + B + C) = tr(A) + tr(B) + tr(C) (C.23)

Let x ∈ Rn and y ∈ Rm. Then the dyadic or outer product of the vectors x and y is the n×m matrix

xyT =

  x1y1 x1y2 · · · x1ym x2y1 x2y2 · · · x2ym ...

... ...

xny1 xny2 · · · xnym

  (C.24)

On occasion we will find it useful to deal with matrices written in block form

A =

[ A11 A12 A21 A22

] (C.25)

260 APPENDIX C. SUMMARY OF LINEAR ALGEBRA

where A11 is m1 ×n1, A12 is m1 ×n2, A21 is m2 ×n1, A22 is m2 ×n2. The product of two matrices in block form is computed in a manner analogous to usual matrix multiplication, only the block become the basic elements. For example[

A11 A12 A21 A22

][ B11 B12 B21 B22

] =

[ A11B11 + A12B21 A11B12 + A12B22 A21B11 + A22B21 A21B12 + A22B22

] (C.26)

where the blocks on the left-hand side must be partitioned in a compatible fashion, and the order of multi- plication of the various terms on the right-hand side is important.

C.2 Matrix Inverses and Determinants

An n×n matrix is invertible or nonsingular if the only solution of the equation Ax = 0 is x = 0. That is, the only vector producing zero output is the zero vector. If this is the case, then there exists another n×n matrix A−1 called the inverse of A, so that

AA−1 = A−1A = I (C.27)

If no such matrix exists A is termed non-invertible or singular. The property of invertibility is related to the solution of sets of equations. To see this, consider the set of equations

Ax = y (C.28)

where A is n × n. This equation has a unique solution x for any y if and only if A is invertible (in which case the solution is A−1y). Conversely, if A is singular, then there exists a non-zero vector xN such that AxN = 0. In this case, we can add any multiple of xN to a solution of (C.28) and produce another solution. Thus if A is singular, the system of equations will not have a unique solution.

The determinant of a square matrix A, denoted by |A| or det(A), can be defined recursively. If A is 1×1, then |A| = A. If A is n×n , then we can compute |A| by “expanding by minors” using any row or column. For example, using the i-th row:

|A| = ai1Ai1 + ai2Ai2 + · · · + ainAin (C.29)

or using the j-th column |A| = a1jA1j + a2jA2j + · · · + anjAnj (C.30)

where Aij = (−l)i+j det(Mij) (C.31)

where Mij is the (n− 1) × (n− 1) matrix obtained from A by deleting the i-th row and j-th column. For example ∣∣∣∣ a11 a12a21 a22

∣∣∣∣ = a11a22 −a12a21 (C.32) As a more complex example we compute∣∣∣∣∣∣∣∣

2 0 0 3 1 1 0 0 1 1 1 0 5 1 1 9

∣∣∣∣∣∣∣∣ = 2(−1)1+1

∣∣∣∣∣∣ 1 0 0 1 1 0 1 1 9

∣∣∣∣∣∣ + 0(−1)1+2 ∣∣∣∣∣∣

1 0 0 1 1 0 5 1 9

∣∣∣∣∣∣ + 0(−1)1+3 ∣∣∣∣∣∣

1 1 0 1 1 0 5 1 9

∣∣∣∣∣∣ + 3(−1)1+4 ∣∣∣∣∣∣

1 1 0 1 1 1 5 1 1

∣∣∣∣∣∣ = 2(−1)1+1

∣∣∣∣ 1 01 9 ∣∣∣∣− 3(−1)1+1

∣∣∣∣ 1 11 1 ∣∣∣∣− 3(−1)1+2

∣∣∣∣ 1 15 1 ∣∣∣∣

= 2 · 9 − 3 · 0 + 3 · (−4) = 6 (C.33)

Several properties of determinants are |AB| = |A||B| (C.34)

C.2. MATRIX INVERSES AND DETERMINANTS 261

|αA| = αn|A| (C.35)

∣∣AT∣∣ = |A| (C.36) ∣∣A−1∣∣ = 1

|A| (C.37)

The invertibility of a matrix A is equivalent to each of the following statements:

1. |A| 6= 0

2. All of the columns of A are linearly independent.

3. All of the rows of A are linearly independent.

The inverse of A can be expressed as

A−1 = 1

|A| CT (C.38)

where Cij = Aij as defined in (C.31). For example[ a11 a12 a21 a22

]−1 =

a11a22 −a12a21

[ a22 −a12 −a21 a11

] (C.39)

Some properties of inverses are ( AT )−1

= ( A−1

)T (C.40)

(AB) −1

= B−1A−1 (C.41)

A = diag (µ1, . . . ,µn) =⇒ A−1 = diag (

µ1 , · · · ,

µn

) (C.42)

A matrix P is orthogonal if P−1 = PT (C.43)

If we think of P as consisting of a set of columns, i.e.

P = [ x1 x2 · · · xn

] (C.44)

then in general

PTP =

  xT1 x1 x

T 1 x2 · · · xT1 xn

xT2 x1 x T 2 x2 · · · xT2 xn

... ...

... xTnx1 x

T nx2 · · · xTnxn

  (C.45)

Consequently, we see that P is orthogonal if and only if its columns are orthogonal, i.e. xi ⊥ xj, i 6= j, and ||xi|| = I.

There are also some useful results for block matrices. For example, for a block diagonal matrix

A = diag (F1, . . . ,Fr) ⇒ A−1 = diag ( F−11 , . . . ,F

−1 r

) (C.46)

Also, the formulas[ A11 A12 A21 A22

]−1 =[ (

A11 −A12A−122 A21 )−1

− ( A11 −A12A−122 A21

)−1 A12A

−1 22

−A−122 A21 ( A11 −A12A−122 A21

)−1 A−122 + A

−1 22 A21

( A11 −A12A−122 A21

)−1 A12A

−1 22

] (C.47)

262 APPENDIX C. SUMMARY OF LINEAR ALGEBRA

det

[ A11 A12 A21 A22

] = ∣∣A11 −A12A−122 A21∣∣ |A22| (C.48)

which are valid if A22 is nonsingular, are verified by noting that[ I −A12A−122 0 I

][ A11 A12 A21 A22

][ I 0

−A−122 A21 I

] =

[ A11 −A12A−122 A21 0

0 A22

] (C.49)

Similarly, if A11 is nonsingular[ A11 A12 A21 A22

]−1 =[

A−111 + A −1 11 A12

( A22 −A21A−111 A12

)−1 A21A

−1 11 −A

−1 11 A12

( A22 −A21A−111 A12

)−1 − ( A22 −A21A−111 A12

)−1 A21A

−1 11

( A22 −A21A−111 A12

)−1 ]

(C.50)

Comparison of the above yields the useful result( A11 −A12A−122 A21

)−1 = A−111 + A

−1 11 A12

( A22 −A21A−111 A12

)−1 A21A

−1 11 (C.51)

C.3 Eigenvalues and Eigenvectors

Let A be an n×n real matrix. A scalar λ is called an eigenvalue of A with associated nonzero eigenvector x if

Ax = λx (C.52)

The above equation can be rewritten as

(λI −A)x = 0 (C.53)

Thus λ is an eigenvalue of A if and only if (C.53) has a solution x 6= 0. This will be the case if and only if λI −A is singular, i.e. if and only if λ is a solution of the characteristic equation

pA(λ) = |λI −A| = 0 (C.54)

Here pA(λ) is the characteristic polynomial of A and is of the form

pA(λ) = λ n + an−1λ

n−1 + . . . + a1λ + a0 = (λ−λ1) · · ·(λ−λn) (C.55)

Here λ1,λ2, . . . ,λn are the n eigenvalues, which may or may not be distinct. Some of the λi may in general be complex, in which case they occur in complex conjugate pairs. However, if A is symmetric, the λi are always real. Also note that

|A| = (−1)npA(0) = (−1)nα0 = λ1 · · ·λn (C.56)

so that A is invertible if and only if all of the eigenvalues of A are nonzero. In addition one can show that

tr(A) = −αn−1 = λ1 + λ2 + · · · + λn (C.57)

If λi is an eigenvalue of A, then we can determine an associated eigenvector by solving the set of linear equations

Ax = λix (C.58)

Note that if x is an eigenvector, so is αx for any scalar α. Consequently, we can always adjust the length of the eigenvectors arbitrarily. Note that each distinct λi has a linearly independent xi corresponding to it. If λi has multiplicity k > 1, i.e. if λi is a k-th order root of pA(λ), then there may be anywhere from 1 to k linearly independent eigenvectors associated with λi. If A is symmetric, however, there are always a full set of linearly independent eigenvectors. Furthermore, these eigenvectors can be taken to be orthogonal and in fact orthonormal.

C.4. SIMILARITY TRANSFORMATION 263

C.4 Similarity Transformation

Let A be an n×n matrix, and let P be an invertible matrix of the same size. We can then define a similarity transformation of A

B = PAP−1 (C.59)

We sometimes say that “B is similar to A”. A similarity transformation corresponds essentially to a change of coordinates. Specifically, suppose

y = Ax (C.60)

and consider a change of coordinates u = Px, v = Py (C.61)

(so that each component of u, for example, is a weighted sum of components of x and vice versa, since x = P−1u). Then

v = Bu (C.62)

Note that

pB(λ) = |λI −B| = |λPP−1 −PAP−1| = |P−1(λI −A)P | = |P−1||λI −A||P | = |λI −A| = pA(λ) (C.63)

so the eigenvalues of B and A are the same. Also

tr(B) = tr(PAP−1 = tr(P−1PA) = tr(A) (C.64)

Suppose that the n×n matrix A has a full set of linearly independent eigenvectors x1, . . . ,xn, so that

Axi = λixi, i = 1, . . . ,n (C.65)

The existence of such a complete set of eigenvectors is guaranteed, for example, if the λi are all distinct or if A is symmetric.

We can rewrite (C.65) as one equation

A [ x1 x2 · · · xn

] = [ x1 x2 · · · xn

]   λ1 0 0 · · · 0 0 λ2 0 · · · 0 ... 0 0 0 · · · λn

  (C.66)

Let p−1 =

[ x1 x2 · · · xn

] which is invertible, since the columns x1, . . . ,xn are linearly independent. Then (C.66) implies that

PAP−1 = diag(λ1,λ2, . . . ,λn) (C.67)

Note that if A is symmetric we can choose the xi to be orthonormal so that P −1 = PT .

C.5 Positive-Definite Matrices

A symmetric square matrix A is positive semidefinite, written A ≥ 0, if and only if

xTAx ≥ 0 (C.68)

for all vectors x. This matrix A is positive definite, written A > 0, if

xTAx > 0 for all x (C.69)

It is not difficult to see that a positive semidefinite matrix is positive definite if and only if it is invertible. Some basic facts about positive semidefinite matrices are the following:

264 APPENDIX C. SUMMARY OF LINEAR ALGEBRA

(i) If A ≥ 0 and B ≥ 0, then A + B > 0, since

xT (A + B)x = xTAx + xTBx (C.70)

(ii) If either A or B in (i) is positive definite, then so is A + B. This again follows from (C.70).

(iii) If A > 0, then A−1 > 0, since

xTA−1x = (A−1x)TA(A−1x) > 0 if x 6= 0 (C.71)

(iv) If Q ≥ 0 then FTQF ≥ 0 for any (not necessarily square) matrix for which FTQF is defined. This follows from

xT (FTQF)x = (Fx)TQ(Fx) ≥ 0 (C.72)

(v) If Q > 0 and F is invertible, FTQF > 0. This also follows from (C.72).

One test for positive definiteness is Sylvester’s Test. Let

A =

  a11 a12 · · · a1n a12 a22 · · · a2n ...

... ...

a1n a2n · · · ann

  (C.73)

Then A is positive semidefinite (positive definite) if and only if

a11 ≥ 0 (> 0)∣∣∣∣ a11 a12a12 a22 ∣∣∣∣ ≥ 0 (> 0)∣∣∣∣∣∣

a11 a12 a13 a12 a22 a23 a13 a23 a33

∣∣∣∣∣∣ ≥ 0 (> 0) (C.74) etc.

Let A = AT , and let P be the orthogonal matrix of eigenvectors so that

PAPT = diag(λ1,λ2, . . . ,λn) (C.75)

Then xTAx = xTPT

( PAPT

) Px = λ1z

2 1 + λ2z

2 2 + · · · + λnz

2 n (C.76)

where z = Px (C.77)

and we have used (C.75). From this we can conclude that

• A = AT is positive semidefinite if and only if all its eigenvalues are nonnegative.

• A = AT is positive definite if and only if all its eigenvalues are strictly positive.

Note that we now also show that if A ≥ 0 then A has a square root matrix F so that

A = FTF (C.78)

and specifically from (C.75) we see that we can take

F = diag (√

λ1, √ λ2, . . . ,

√ λn

) P (C.79)

Note that F in (C.79) is invertible if and only if A > 0. Also note that the square root matrix as defined in (C.78) is far from unique. Specifically, let Q be any orthogonal matrix, and let

F̂ = QF (C.80)

Then F̂TF̂ = FTQTQF = FTIF = FTF = A (C.81)

C.6. SUBSPACES 265

x d x T

= 0n s

x 1

x 2

C.6 Subspaces

A subset S ⊆ Rn is a subspace if S is closed under vector addition and scalar multiplication. Examples of subspaces of R2 are1

S1 =

{[ a 0

] | a ∈ R

} (C.82)

S2 =

{[ a 2a

] | a ∈ R

} (C.83)

The dimension of a subspace equals the maximum number of vectors in S that can form a linearly independent set.

Let K be any subset of Rn. The orthogonal complement of K is defined as follows:

K⊥ = { x ∈ Rn | x ⊥ y ∀y ∈ K

} (C.84)

K⊥ is a subspace whether or not K is, since if x1,x2 ∈ K⊥, y ∈ K

(x1 + x2) T y = xT1 y + x

T 2 y = 0 (C.85)

(αx1) T y = αxT1 y = 0 (C.86)

so (x1 + x2) ∈ K⊥ and αx1 ∈ K⊥. Let d be a single nonzero vector in Rn and consider {d}⊥. This is a subspace of dimension n− 1. For

example, as illustrated in Figure C.6, when n = 2 the set of x such that dTx = 0 is a line through the origin perpendicular to d. In 3-dimensions this set is a plane through the origin, again perpendicular to d. Note that the subspace {d}⊥ splits Rn into two half-spaces, one corresponding to those x for which dTx > 0, the other to those x for which dTx < 0.

C.7 Vector Calculus

First consider a vector-valued function of a scalar real variable, denoted f(x). Calculus operations for functions of this type are defined component-wise, as follows:

dx f(x) =

 

d dx f1(x)

d dx f2(x) ...

d dx fM (x)

  (C.87)

Conversely, now consider a scalar valued function of a vector argument, i.e. a function of n-real variables

f(x) = f(x1, . . . ,xn) = f

  x1...

  (C.88)

1Here R denotes the set of real numbers.

266 APPENDIX C. SUMMARY OF LINEAR ALGEBRA

Partial derivatives, integrals, etc., are defined in terms of the vector:

∂f

∂x (x) = f

x (x) =

 

∂f ∂x1

(x) ...

∂f ∂xn

(x)

  (C.89)

The second-order derivative can also be defined

∂2f

∂x2 (x) =

 

∂2f ∂x21

(x) · · · ∂ 2f

∂x1∂xn (x)

... ...

∂2f ∂xn∂x1

(x) · · · ∂ 2f

∂xn∂xn (x)

  (C.90)

Finally, let f(x) be an m× 1 vector-valued function of the n× 1 vector variable x. We can define:

∂f

∂x (x) =

 

∂f1 ∂x1

(x) · · · ∂fm ∂x1

(x) ...

... ∂f1 ∂xn

(x) · · · ∂fm ∂xn

(x)

  = fx(x) (C.91)

If f(X) is a scalar function of the n×m matrix X, then the derivative of f(X) with respect to X is given by:

∂f

∂X =

 

∂f ∂X11

· · · ∂f ∂X1n

... ...

∂f ∂Xm1

· · · ∂f ∂Xmn

  (C.92)

If F(X) is a p×q matrix and X is an m×n matrix, then the derivative of F(X) with respect to X is given by:

∂F

∂X =

 

∂F ∂X11

· · · ∂F ∂X1n

... ...

∂F ∂Xm1

· · · ∂F ∂Xmn

  (C.93)

Examples of common calculations involving a vector x and matrices A, B, and X include:

∂x

∂x = I (C.94)

∂

∂x (Ax) = AT (C.95)

∂

∂x

( xTA

) = A (C.96)

∂

∂x

( xTAx

) = ( A + AT

) x (C.97)

∂2

∂x2 ( xTAx

) = A + AT (C.98)

∂tr (AX)

∂X = AT (C.99)

∂tr ( XTA

) ∂X

= A (C.100)

∂tr ( XTAXB

) ∂X

= AXB + ATXBT (C.101)

Appendix D

The non-zero mean case

Here we consider LLSE of stochastic processes in the non-zero mean case and show that the approach of estimating mean subtracted process X̃(t) = X(t) − mx(t) based on the mean subtracted observation Ỹ (τ) = Y (τ) − my(τ) really does produce the correct results. For simplicity we consider the vector case. Suppose we want to estimate the non-zero mean vector X based on the non-zero mean vector Y . By assumption the estimate will be of the form:

x̂ = Ly + b (D.1)

and our task reduces to finding L and b. The solution is given by the orthogonality conditions:

E [x̂] = E [x] (D.2)

E [ (x− x̂) yT

] = 0 (D.3)

Using the first equation we find that: b = mx −Lmy (D.4)

i.e. that b only depends on the means and is zero for the zero-mean case. Further the form of the estimate can be now seen to be:

x̂ = mx + L ( y −my

) (D.5)

Now applying the second orthogonality constraint, we know that e ⊥ y, where e = (x− x̂). That is:

0 = E [ (x− x̂) yT

] (D.6)

= E [{ x−mx −L(y −my)

}{ (y −my) + my

}] (D.7)

= E [{ x−mx −L(y −my)

} (y −my)

] + E

[{ x−mx −L(y −my)

} my ]︸︷︷︸

(D.8)

= Kxy −LKyy (D.9) =⇒ L = KxyKyy−1 (D.10)

Thus we have that x̂ = mx + KxyK

−1 yy (y −my) (D.11)

Now lets find the estimation error covariance. This is given by:

E [ eeT ]

= E [ (x− x̂) (x− x̂)T

] = E

[ (x− x̂)

{ x−mx −L(y −my)

}T] (D.12)

= E [ (x− x̂) (x−mx)

T ] −E

[ (x− x̂) yTLT )

]︸︷︷︸ = 0 since e ⊥ y

+ E [ (x− x̂) mTy L

T ]︸︷︷︸

= 0 since mx = E[x̂]

(D.13)

= E [{

(x−mx) −L(y −my) }

(x−mx) T ]

(D.14)

= Kxx −LKyx = Kxx −KxyK−1yy K T xy (D.15)

268 APPENDIX D. THE NON-ZERO MEAN CASE

Note that this is the same result as for the zero mean case. Finally let us compare these results to what we would find by estimating X̃(t) = X(t) −mx(t) based on

Ỹ (τ) = Y (τ) −my(τ) and substituting the definitions of X̃(t) and Ỹ (τ) in at the end. Note that:

̂̃x = Kx̃ỹK−1ỹỹ ỹ (D.16) But

Kx̃ỹ = Kxy, Kỹỹ = Kyy, Kx̃x̃ = Kxx (D.17)

Thus (x̂−mx) = KxyK−1yy (y −my) (D.18)

Note that this is the same estimate we obtained by direct calculation. Thus indeed estimating (x − mx) based on (y −my) then adding the means back in produces the same result. Finally, we can calculate the error covariance:

ΛL = E[ẽẽ T ] = Kx̃x̃ −Kx̃ỹK−1ỹỹ Kx̃ỹ (D.19)

= Kxx −KxyK−1yy K T xy (D.20)

Again this is the same result we obtained via direct calculation.

Introduction to Probability

Axioms of Probability
Conditional Probability and Independence of Events
Random Variables
Characterization of Random Variables
Important Random Variables

Discrete-valued random variables
Continuous-valued random variables

Transformations of a Random Variable

Method of equivalent events
Jacobian method

Pairs of Random Variables
Conditional Probabilities, Densities, and Expectations
Random Vectors

Transformation of random vectors
Expectations of functions of a random vector

Properties of the Covariance Matrix
Gaussian Random Vectors
Inequalities for Random Variables

Markov inequality
Chebyshev inequality
Chernoff Inequality
Jensen's Inequality
Moment Inequalities

Sequences of Random Variables

Convergence Concepts for Random Sequences
The Central Limit Theorem and the Law of Large Numbers
Advanced Topics in Convergence
Martingale Sequences
Extensions of the Law of Large Numbers and the CLT
Large Deviations
Spaces of Random Variables

Estimation of Parameters

Introduction
Quick Review of Random Vectors
General Bayesian Estimation

General Bayes Decision Rule
General Bayes Decision Rule Performance

Bayes Least Square Estimation
The Orthogonality Principle for Least Squares Estimation
Bayes Maximum A Posteriori (MAP) Estimation
Bayes Absolute Error Estimation
Bayes Linear Least Square (LLSE) Estimation
Nonrandom Parameter Estimation

Cramer-Rao Bound
Maximum-Likelihood Estimation
Comparison to MAP estimation

Recursive LLSE: The Kalman Filter

Introduction
Historical Context
Recursive Estimation of a Random Vector
The Discrete-Time Kalman Filter

Initialization
Measurement Update Step
Prediction Step
Summary
Additional Points
Example

Detection Theory

Bayesian Binary Hypothesis Testing

Bayes Risk Approach and the Likelihood Ratio Test
Special Cases
Examples

Performance and the Receiver Operating Characteristic

Properties of the ROC
Detection Based on Discrete-Valued Random Variables

Other Threshold Strategies

Minimax Hypothesis Testing
Neyman-Pearson Hypothesis Testing

M-ary Hypothesis Testing

Special Cases
Examples
M-Ary Performance Calculations

Gaussian Examples

Stochastic Processes and their Characterization

Introduction
Complete Characterization of Stochastic Processes
First and Second-Order Moments of Stochastic Processes
Special Classes of Stochastic Processes
Examples of Stochastic Processes

The Random Walk
The Poisson Process
Digital Modulation: Phase-Shift Keying
The Random Telegraph Process
The Wiener Process and Brownian Motion

Stationarity of Stochastic Processes
Moment Functions of Vector Processes
Moments of Wide-sense Stationary Processes
Power Spectral Density of Wide-Sense Stationary Processes

Discrete State Markov Processes

Discrete-time, Discrete Valued Markov Processes

Process Description
Hitting probabilities and mean hitting times
Steady state behavior of discrete time Markov chains

Continuous-Time, Finite Valued Markov Processes

Process Description
Hitting probabilities and mean hitting times
Steady state behavior of continuous time Markov chains.

Birth-Death Processes
Queuing Systems
Inhomogeneous Poisson Processes
Applications of Poisson Processes

Mean-Square Calculus for Stochastic Processes

Continuity of Stochastic Processes
Mean-Square Differentiation
Mean-Square Integration
Integration and Differentiation of Gaussian Processes
Generalized Mean-Square Calculus
Ergodicity of Stationary Random Processes

Linear Systems and Stochastic Processes

Introduction
Review of Continuous-time Linear Systems
Review of Discrete-time Linear Systems
Extensions to Multivariable Systems
Second-order Statistics for Vector-Valued Wide-Sense Stationary Processes
Continuous-time Linear Systems with Random Inputs

LLSE Estimation of Stochastic Processes and Wiener Filtering

Introduction
Historical Context
LLSE Problem Solution: The Wiener-Hopf Equation
Wiener Filtering

Noncausal Wiener Filtering (Wiener Smoothing)
Causal Wiener Filtering
Summary

Series Expansions and Detection of Stochastic Processes

Deterministic Functions
Series Expansion of Stochastic Processes
Detection of Known Signals in Additive White Noise
Detection of Unknown Signals in White Noise
Detection of Known Signals in Colored Noise

Useful Transforms
Partial-Fraction Expansions

Continuous-Time Signals
Discrete-Time Signals

Summary of Linear Algebra

Vectors and Matrices
Matrix Inverses and Determinants
Eigenvalues and Eigenvectors
Similarity Transformation
Positive-Definite Matrices
Subspaces
Vector Calculus

The non-zero mean case