Probability

set-12.pdf

Home >Mathematics homework help >Probability homework help >Probability

Statistical Inference and Useful Limits

Statistical Inference ■  Why? We need to draw conclusions from the information contained in the

data we obtain in our experiments. ■  Often, we cannot consider all the data of the group at at hand (called the

population) and must draw conclusions from a sample of such data. ■  Hence, we are interested in sample statistics:

■  Statistical inference: To infer results for a population from its samples: ■  Model of population is known: hypothesis testing, random-variable

estimation, confidence interval estimation, significance testing ■  Model of population is unknown: Parameter estimation

■  Sampling: Process of obtaining meaningful samples.

Examples ■  Draw conclusions about the heights, weights, pizza eating preferences, or

other interesting facts of the student population in all UC campuses by examining only hundreds of students.

■  Draw conclusions about the fairness of a coin by tossing it repeatedly. The population is the set of all possible tosses of the coin. A sample is the first 1000 tosses, and noting the percentages of heads and tails.

■  Draw conclusions on the reliability of a link by transmitting the same message repeatedly. The population is the set of all possible transmissions. A sample is the first 10,000 transmissions, and noting the percentages of successful and failed transmissions.

Sampling ■  We can do sampling with or without replacement. ■  Sampling without replacement: A member of a population can be

selected only once. ■  Sampling with replacement: A member of a population can be selected

more than once. ■  A finite population that is sampled with replacement can be considered

infinite. ■  In most practical cases, sampling with or w/o replacement from a large but

finite population can be approximated as sampling from an infinite population. ■  We seek random samples, where each member of the population has the

same chance of being in the sample

Population ■  A population consists of the totality of the observations with which we are

concerned. ■  A population is “known” when we know the probability distribution (meaning

the cdf, pdf or pmf) of the random variable associated with the population. ■  Examples:

❐  If X is the r.v. whose values are the heights of UCSC students, we say that X has probability distribution f(x) or fX(x).

❐  If X is normally distributed, we say that the population is normally distributed, or that we have a normal population.

■  Population parameters: The quantities related to fX(x) (what are these?).

What Are Parameters? ■  We denote a parameter by θ ■  Parameters of well-known distributions we have studied:

❐  Bernoulli: θ = p and p = probability of success in a trial ❐  Uniform: θ = (α, β) and α, β = boundaries for pdf > 0 ❐  Binomial: θ = (n, p) and n = number of trials and p = success probability in a trial ❐  Geometric: θ = (k, p) and p = success prob. in a trial k = number of trials needed for success ❐  Poisson: θ = λ and λ = arrival rate and mean ❐  Normal: θ = (µ, σ2) and µ = mean, σ2 = variance

■  There are “parametric models.” Given a model, its parameters yield the actual distribution.

Sample Statistic ■  In most cases, it is impossible or impractical to observe an entire population. ■  Hence, we usually do not know the parameters of a population. ■  Examples: number of times a coin in your pocket comes up tails, number of

people in the US that get up at 9am. ■  We must rely on a subset of observations from the population to make

decisions about it. ■  Sample: A subset of observations selected from a population. ■  We take random samples from the population and use them to obtain

values that estimate the population parameters

Sample Statistic ■  For each population parameter, there is a statistic to be

computed from the sample. ■  Sampling distribution of a statistic:

The probability distribution (cdf, pmf) of the sample statistic ■  Estimates of parameters of the population enable us to:

❐  Better understand the process that produces the data ❐  Make predictions [using probability theory] based on the model

using the data that have been collected ❐  Simulate the process that produces the data

Estimates ■  Consider n random variables {Xi}, i = 1, 2, …, n ■  From the observations of {Xi} we produce a sequence of estimates of a

parameter θ denoted {Ri} , i = 1, 2, …, n ■  Each estimate Ri is a random variable itself and a function of the

sequence {Xk}, k = 1, 2, …, i

Definition: The sequence of estimates {Ri} , i = 1, 2, …, n of the parameter θ is said to be consistent if

lim n→∞

= P Rn −θ ≥ε ⎡ ⎣

⎤ ⎦= 0 for ε>0

■  The estimates of a parameter θ are random variables that differ from the value of θ in different observations.

■  Although each estimate Ri is random, it is undesirable for estimates to be typically larger or typically smaller than θ

■  We seek unbiased estimates

Definition: An estimate Rn of parameter θ is said to be unbiased if E(Rn ) = θ ; otherwise, Rn is biased.

Estimates

Covariance ■  Consider two random variables X and Y. ■  The covariance of X and Y provides a measure of how much the

two vary together. ■  Definition: Cov (X, Y) = E[ ( X – E[X] ) ( Y – E[Y] ) ]

The covariance will be positive if both random variables attain large deviations from their means together, and will be negative if their deviations from their means are opposite.

Properties of Covariance ■  Consider two random variables X and Y. Cov (X, Y) = Cov (Y, X) Cov (X, X) = E[( X )( X ) ] – E[X ]E[X ] = Var(X ) Cov(aX + b, Y) = a Cov(X, Y) ■  Let {Xi}, i = 1, 2, …, n and {Yj}, j = 1, 2, …, m be random variables, then:

Cov Xi i=1

∑ , Yj j=1

∑ "

# $$

& ''= Cov(Xi,Yj)

j=1

∑ i=1

∑

Cov (X, Y) = E[XY] – E[X]E[Y]

■  The covariance of two random variables X and Y may be zero, depending on the way in which the r.v.s deviate from their means.

■  However, if the random variables X and Y are independent, then their covariance must be 0: Cov (X, Y) = E[XY] – E[X]E[Y] = E[X]E[Y] – E[X]E[Y] = 0

Variance of Sum of Variables Consider the sum of n random variables Xi such that all Xi and Xj independent for i ≠ j:

Var Xi i=1

∑ "

# $

& '= Var(Xi

i=1

∑ )

Proof:

Var Xi i=1

∑ "

# $

& '= Var(Xi

i=1

∑ )+2 Cov(Xi,Xj) j=i+1

∑ i=1

∑

Cov(X i , X

j ) = 0 because all X

i and X

j are independent, and hence

Var X i

i=1

∑ ⎛

⎝ ⎜⎜

⎞

⎠ ⎟⎟= Var(Xi

i=1

∑ ) + 0

Can be shown

The Sample Mean and Its Properties ■  Consider n random variables {Xi}, i = 1, 2, …, n ■  The r.v.s are independent and identically distributed (i.i.d.) ■  They all have the same cdf F ■  A sequence of {Xi} is a sample of the cdf F with mean µ and

variance σ2. ■  The average of the values of this sample is called the

sample mean and is defined by: X = Xn =

1 n

Xi i=1

∑

E(Xn)= µ■  The mean of the sample mean is:

E(X n)= E 1 n

Xi i=1

∑ ⎛

⎝ ⎜⎜

⎞

⎠ ⎟⎟= 1 n E Xi

i=1

∑ ⎛

⎝ ⎜⎜

⎞

⎠ ⎟⎟= 1 n

E(Xi) i=1

∑ = 1 n

µ i=1

∑ = µ

Because E(X ) = µ we say that X is an unbiased estimate of the mean µ ■  Given that the r.v.s are i.i.d. with variance σ2:

Var(X n)=Var 1 n

Xi i=1

∑ ⎛

⎝ ⎜⎜

⎞

⎠ ⎟⎟=

1 n2 Var Xi

i=1

∑ ⎛

⎝ ⎜⎜

⎞

⎠ ⎟⎟=

1 n2

Var(Xi) i=1

∑ = 1 n2 (nσ 2)=

σ 2

The Sample Mean and Its Properties

■  The variance of the sample mean is 1/n the variance of the underlying distribution of the r.v.s.

Var(Xn)=σ 2 /n

Variance of the Sample: Sample Variance ■  Consider a sample of n i.i.d. random variables {Xi }, i = 1, 2, …n. ■  Each Xi has a distribution F with E[Xi] = µ and Var(Xi) = σ2 ■  The sample mean is

Xn = 1 n

Xi i=1

∑ and E(Xn ) = µ

■  The sample deviation for Xi is Xn − Xi for i =1, 2,...,n

Sn 2 = 1 n

(Xi − Xn) 2

i=1

∑■  The sample variance is defined by

Variance of the Sample: Sample Variance Sn 2 = 1 n

(Xi − Xn) 2

i=1

∑

For each i we have that Xi − Xn = Xi − Xn −µ +µ = (Xi −µ) − (Xn −µ). Hence,

(Xi − Xn ) 2 = [(Xi −µ) − (Xn −µ)]

2 = (Xi −µ) 2 − 2(Xi −µ)(Xn −µ) + (Xn −µ)

(Xi − Xn ) 2

i=1

∑ = (Xi −µ) 2 − 2(Xn −µ) (Xi −µ) + (Xn −µ)

i=1

∑ i=1

∑

= (Xi −µ) 2 − 2(Xn −µ)[n(Xn −µ)]+n(Xn −µ)

i=1

∑

= (Xi −µ) 2 −n(Xn −µ)

i=1

∑ (a)

Variance of the Sample: Sample Variance Taking the expected value for Equation (a) we have

E (Xi − Xn ) 2

i=1

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥= E (Xi −µ)

2 −n(Xn −µ) 2

i=1

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥= E (Xi −µ)

i=1

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥−nE (Xn −µ)

2⎡ ⎣

⎤ ⎦

=nσ 2 −n σ 2

⎛

⎝ ⎜

⎞

⎠ ⎟= (n−1)σ 2 (b)

Given that Sn 2 =

1 n

(Xi − Xn ) 2

i=1

∑ , we obtain from Eq. (b) that Sn 2 = (n−1) n

σ 2

Note: Most textbooks use a sample variance equal to n Sn2/(n – 1) so that its mean is just σ2

The sample variance nearly equals σ2 [and hence is an unbiased estimate] for large values of n. In practice, n ≥ 30 is large enough.

Correlation of Random Variables ■  We need a measure of association among random variables ■  We want our measure to be normalized ■  The correlation of two random variables X and Y is denoted and

defined as follows: ρ(X, Y) = Cov(X,Y) / σXσY where σX and σY are the standard deviations of X and Y.

■  This new measure ρ is called the correlation coefficient. ■  Alternatively: ρ(X, Y) = Cov(X,Y) / [σ2Xσ2Y]½

Correlation and Causation Let X be the US annual spending on science, space and technology; and Y be the number of annual suicides by hanging, strangulation, and suffocation between 1999 and 2009:

Correlation does not imply causality! Correlation may be just a coincidence

“The data clearly show that X and Y are correlated (> 99%).”

“Hence, either more spending on science, space and tech led to more suicides by hanging, strangulation and suffocation from 99 to 2009 or more of such suicides led to more US spending in science, space and tech” … The horror!

Causation ! Correlation but not the other way around! http://www.tylervigen.com/spurious-correlations

BREAK

Why Limits Are Useful ■  In many practical cases we do not know the true form of a probability

distribution ❐  Example: Nobody knows what the distribution of arrivals of UCSC students to

different classrooms is for different days of the week. ■  We may know some information that can help us think about properties of

probability distributions of our interest. ■  Limits allow us to gain some insight, even when we are missing key

information, and obtain new results. ■  Some limits are better than others! ■  Some limits allow us to reason about approximations of the real distributions.

What Can We Say Knowing Only the Mean of a R.V. that Assumes Non-Negative Values?

■  Example: Let X be the random variable denoting the final grade in CMPE 107. Suppose that statistics from all previous offerings of CMPE 107 allow us to compute the sample mean of the final grades, which equals 86.7 out of 120. What is P[X ≥ 100]?

Guess: 5%? 10%? 20% 30%? Our guesses would tend to assume a “bell curve,” right?

Markov’s Inequality Markov’s Inequality: Let X a random variable that takes non-negative values, then

P(X ≥a) ≤ E[X ] a

for all a> 0

Proof: For a> 0 , let I = 1 X ≥a 0 otherwise

⎧ ⎨ ⎪

⎩⎪

Note that, because X ≥ 0, X ≥ I ×a and hence I ≤ X / a (1)

From Eqs. (2) and (3) we obtain: P(X ≥a) = E[I] ≤ E[X ] / a Q.E.D. Using the definition of the mean: E[I] = P[X ≥ a] (1) + P[X < a] (0) = P[X ≥ a] (3). Taking expectations in Eq. (1) we obtain: E[I] ≤ E[X ] / a (2)

From Markov’s inequality: Approximating the mean with the sample mean, we have P[X ≥ 100] ≤ 86.7/100 = 0.867 Did 86.7% of the CMPE 107 class actually scored 100/120 or better? Possible but unlikely… This is a loose upper bound!

What Can We Say Knowing Only the Mean of a R.V. that Assumes Non-Negative Values?

Chebyshev’s Inequality Chebyshev’s Inequality: Let X a random variable with mean µ and variance σ2, then, for any value k > 0,

P(| X −µ | ≥ k)≤ σ 2

k2 = σ k

% &

( ) 2

Proof: Given that Y = (X – µ)2 is a non-negative random variable, we can use Markov’s inequality P[Y ≥ a ] ≤ E[Y]/a with a = k2.

However, (X – µ)2 ≥ k2 if and only if | X – µ | ≥ k.

Q.E.D. Therefore, Eq. (2) is equivalent to P[ | X −µ | ≥ k ] ≤ (σ / k) 2

Note that E[(X −µ)2 ] =σ 2 and substituting in (1): P[(X −µ)2 ≥ k2 ] ≤ σ 2

k2 (2)

We obtain: P[Y ≥a] = P[Y ≥ k2 ] = P[(X −µ)2 ≥ k2 ] ≤ E[(X −µ)2 ]

k2 = E(Y ) a

(1)

Example ■  Let X be a uniformly-distributed random variable over the interval (0, 10).

What is the probability that the difference between X and its mean is at least 4? ■  We have µ = 5 and σ2 = 25/3 [check it out!] ■  From Chebyshev’s inequality, we then have: P[ | X – 5 | ≥ 4 ] ≤ (25/3)/ 42 = 25/48 ≈ 0.52 ■  This is a loose bound to the exact result from the pdf:

These inequalities are useful tools in proving mathematical results, such as order results [in the limit] and the law of large numbers

P(| X −5 |≥ 4) =1−P[| X −5 |< 4] =1−P(−4 < X −5 < 4) =1−P(1< X < 9)

=1− 1

10 dx

∫ =1− 1

10 (9 −1) =

2 10

= 0.2

E(X)= β +α 2

Var(X)= (β −α)2

The Weak Law of Large Numbers Weak Law of Large Numbers: Consider n i.i.d. random variables {Xi}, i = 1, 2, …, n, each having a finite mean µ. Then, for any value ε > 0, P X1 +...+ Xn

n −µ ≥ε

$ %

' (

n→∞+ →++ 0

Proof: We will assume that all r.v.s have a finite variance σ2.

From Chebyshev’s inequality, we have:

The result follows by making n go to infinity. Q.E.D.

Note that: E[(X1+…+Xn) / n] = µ and Var[(X1+…+Xn) / n] = σ2 / n

P X1 +...+ Xn

n −µ ≥ε

$ %

' (≤

σ 2 /n ε2

= σ 2

nε2

The Weak Law of Large Numbers Weak Law of Large Numbers: Consider n i.i.d. random variables {Xi}, i = 1, 2, …, n, each having a finite mean µ and variance σ2.

Let Xn denote the sample mean of the random variables. Then, for any ε > 0,

P X n −µ ≥ε⎡ ⎣⎢

⎤ ⎦⎥

n→∞⎯ →⎯⎯ 0 or P X n −µ <ε⎡ ⎣⎢

⎤ ⎦⎥

n→∞⎯ →⎯⎯ 1

As the size n of the sample increases, the value of the sample mean approaches the value of the actual distribution mean.

The Strong Law of Large Numbers Strong Law of Large Numbers: Consider n i.i.d. random variables {Xi}, i = 1, 2, …, n, each having a finite mean µ and finite variance, then

P lim n→∞ (X n)= µ⎡⎣

⎤ ⎦=1

As the size n of the sample increases, the value of the sample mean will eventually approach and stay close to µ with probability 1.

The Central Limit Theorem Central Limit Theorem: Consider n i.i.d. random variables {Xi}, i = 1, 2, …, n, each having a mean µ and variance σ2. Then, for any value a, P

(X1 +...+ Xn)−nµ nσ 2

≤ a #

$ %

' (

n→∞+ →++ 1 2π

e−x 2/2

−∞

∫ dx

Proof: We skip the proof. It is rather involved, but based on our previous results on limits. Note that the RHS of the equation corresponds to the CDF of N(0, 1)

The Central Limit Theorem Central Limit Theorem: Consider n i.i.d. random variables {Xi}, i = 1, 2, …, n, each having a mean µ and variance σ2. Let denote the sample mean of the random variables. Then,

Xn n→∞⎯ →⎯⎯ N(µ,

σ 2

n )

Intuitive version:

X n

■  It provides a simple method for computing approximate probabilities for the aggregated effect of multiple independent r.v.s

■  Helps explain empirical evidence that so many natural phenomena exhibit bell- shaped (normal) distributions.

Xi i=1

∑ n→∞⎯ →⎯⎯ N(nµ, nσ 2)

Simple To Use: Example 1 ■  The mean age of a UCSC student [undergraduates and graduates considered] is

22.3 years and the standard deviation is 4 years. ■  A [small] random sample of 64 students is drawn. What is the probability that

the average age of these students is greater than 23 years? Solution: Xi denotes the age of the ith student. We need to find P[X > 23] From the Central Limit Theorem: X ≈ N(22.3, 16 / 64) = N(22.3, 0.25)

Let Z = (X − 22.3) / 0.25 P(X > 23)= P Z > 23−22.3 0.25

# $

& '= P Z >1.4( ) =1−P(Z ≤1.4)

We just assume that the aggregate of independent r.v.s is a normal r.v.

From tables of Φ(Z) we have P(X > 23) =1−Φ(1.4) = 0.081

Example 2: Useful Even When We Know The Distribution The number of students who enroll in a course AWESOME 101 is a Poisson random variable X with mean 100. We do not have enough large classrooms in UCSC; hence, the department decides that, if the number of students enrolling for the course in Spring 18 is 120 or more, then two sections of the course will be offered. What is the probability that the department will have to offer two sections? Exact Solution: We simply consider enrollments as arrivals and use the definition of the Poisson distribution:

P[X ≥120]= 100k

k!k=120

∞

∑ e−100 = e−100 100k

k!k=120

∞

∑

To avoid the computation we can use an approximation to the exact result…

or P[X ≥120] =1−P[X ≤119] =1−e−100 100k

k!k=0

119

∑

Useful: Example 2 (cont.) Approximate Solution: ■  We take advantage of the fact that a Poisson random variable with mean 100 is simply the

sum of 100 independent Poisson random variables, each with mean and variance of 1. ■  We apply the Central Limit Theorem to the sum of 100 independent Poisson random

variables, and approximate this sum with Y = N(100, 100). ■  We also use the “continuity correction” [not necessary] and compute P[Y ≥ 120 – ½] = P[Y ≥ 119.5]

P[Y ≥119.5]= P Y −100 100

≥ 119.5−100

100

⎛

⎝ ⎜

⎞

⎠ ⎟= P

Y −100 100

≥1.95 ⎛

⎝ ⎜

⎞

⎠ ⎟

P[Y ≥119.5]=1−P Y −100 100

<1.95 ⎛

⎝ ⎜

⎞

⎠ ⎟ P[Y ≥119.5]=1−Φ(1.95) ≈1−0.9744 ≈ 0.256

Normal Approximation of Random Variables: The Binomial Case

■  Consider a binomial random variable X with parameters n and p. ■  The number of trials is n and the success prob. in each trial is p. ■  Each of the n trials is a Bernoulli trial with parameter p (probability of success), and Yi is

the random variable denoting success in the ith trial. ■  X = Y1 + Y2 + …+Yn states the sum of successes of the sample observations of such trials. ■  Each Yi has mean and variance equal to: µ = p and σ2 = p(1 – p) ■  The mean and variance of their sum (X) is np and np(1 – p) ■  From the Central Limit Theorem, we have: X ≈ N( np, np(1 – p) ) ■  Considered a good approximation for np ≥ 5 and np(1 – p) ≥ 5.

Number of successes in 25 trials

Normal Approximation of Random Variables: The Binomial Case

■  X ≈ N( np, np(1 – p) ) ■  Bars correspond to the binomial r.v. B(25, 0.4) ■  Hence,

❐  µ = 0.4(25)=10 ❐  σ2 = 25(0.4)(1-0.4) = 6

■  Blue curve shows N(10, 6) ■  As n increases we have a better match ■  Techniques exist to cope with differences

Summary of a few results

Sampling Distribution of Means ■  Consider a population from which a sample of size n is drawn. ■  Let f(x) be the probability distribution of the population. ■  The probability distribution of the sample mean is called the sampling

distribution for the sample mean also called the sampling distribution of means:

■  Theorem: If µ is the mean of the population, then the mean of the sampling distribution for the sample mean is given by: Hence, the expected value of the sample mean is the population mean

E(X)= µX = µ

Sampling Distribution of Means ■  Theorem: If a population is infinite or sampling is with replacement, and σ2 is

the variance of the population, then the variance of the sampling distribution of means is given by:

■  Theorem: If the population is of size N <∞, sampling is without replacement, and the sample size is n ≤ N, and σ2 is the variance of the population, then the variance of the sampling distribution of means is given by:

E X −µ( ) 2⎡

⎣ ⎢

⎤

⎦ ⎥=σ

2 X = σ 2

E X −µ( ) 2"

#$ % &' =σ 2X =

σ 2

n N −n N −1

(

) *

, -

Sampling Distribution of Means ■  Theorem: If a population is infinite or sampling is with replacement, and the

mean and variance of the population are µ and σ2, then Z is asymptotically normal as the sizes of the sample n increases, where

■  In the above, asymptotically normal means that, as the size of the sample size n goes to infinity, Z becomes the standard normal random variable N(0, 1)

■  This is a direct result of the central limit theorem.

Z = X −µ σ / n

= X −µ σ 2 /n

Additional Material

END

Illustrative Example 1 ■  A population consists of the following five numbers: 1, 2, 3, 4, 5 ■  All are equally likely ■  Consider the possible samples of size two with replacement that can

be drawn from this population. ■  Find the mean and standard deviation of the population. ■  Find

❐  The mean of the sampling distribution of means ❐  The standard deviation of the sampling distribution of means

Illustrative Example 1 ■  The population is: 1, 2, 3, 4, 5 ■  The mean of the population: ■  The variance of the population:

µ = 1 5 (1+2+3+4+5)=

15 5 =3

σ 2 = 1 5 (1−3)2 +(2−3)2 +(3−3)2 +(4−3)2 +(5−3)2⎡⎣

⎤ ⎦= 1 5 (4+1+0+1+4)=2

Therefore, σ = 1.41

Illustrative Example 1 There are 5×5= 25 samples of size two that can be drawn with replacement. They are:

(1,1) (1,2) (1,3) (1,4) (1,5) (2,1) (2,2) (2,3) (2,4) (2,5) (3,1) (3,2) (3,3) (3,4) (3,5) (4,1) (4,2) (4,3) (4,4) (4,5) (5,1) (5,2) (5,3) (5,4) (5,5)

1.0 1.5 2.0 2.5 3.0 1.5 2.0 2.5 3.0 3.5 2.0 2.5 3.0 3.5 4.0 2.5 3.0 3.5 4.0 4.5 3.0 3.5 4.0 4.5 5.0

The corresponding sample means are:

As should be expected,

µ X =

1 25

(1.0 +...1.5+...+ 4.5+5)

= 75 / 25 = 3

σ X 2 =

1 25

(i−3)2 i ∑ =

σ 2

Illustrative Example 2 ■  A population of 3000 male UCSC students has a normal distribution with mean 68.0

inches in height and standard deviation of 3.0 inches. ■  Assume 80 samples of 25 students each are drawn from this population. ■  What is the expected mean and standard deviation of the resulting sampling distribution

of means if sampling is done with replacement? Solution: ■  The possible number of samples of size 25 from 3000 is C(3000, 25), which is far larger

than 80. ■  We do not have a true sampling distribution of means, only an experimental sample

distribution of means. ■  The number of samples is large enough and we can approximate. We have:

µX = µ = 68.0 inches σ X =σ / n = 3 / 25 = 0.6 inches

END