answer

profilesmiedr
L11_lesson11CI.pptx

Confidence Intervals

1

Overview

Determining difference

P-values

Confidence intervals

Population and Samples

Calculating CI

In this lesson we are going to talk about determining significance from a statistical perspective. Please keep in mind that the statistical significance of a study and the significance of a study are not the same thing, and the presence or absence of one does not ensure or preclude the other.

2

In analytical epi studies, we always need to answer two questions…

1. What is the point estimate?

2. Is this point estimate significant?

3

In any analytical epi study we want to know two things: the point estimate, and whether or not this point estimate represents a significant difference between our groups on the outcome of interest. So far this semester, we’ve done a bunch of calculations to determine the point estimates. Now it’s time to determine differences. Fun times ahead!

Groups: how people were selected, either by exposure status (f-up) or disease status (ca-co).

Outcomes: what we’re looking to see is different, either disease status (f-up) or exposure status (ca-co)

Determining Difference

1. Use the p value

2. Use confidence intervals

4

When determining if our groups are different, we have two basic options: p-value and confidence intervals. Let’s take a look at each.

p-value

Based on hypothesis testing

Null hypothesis = no difference

Calculated probability of null being true

Size matters!

But before I yammer on about p-values, let’s first talk about null hypotheses. Remember those from stats class? In an analytical epi study, a null hypothesis basically states that there is no difference between the groups you’re studying on the outcome of interest. So, for example, in a f-up study, the null would say that incidence of a disease would not differ by exposure status.

When testing the null, we calculate the probability that our null hypothesis is true. This calculation is known as the p-value. When our p-value is high, it means that the probability of our null being true is high, meaning there’s a high probability our two groups really do not differ on the outcome of interest. When our p-value is very low, it means the probability of our null being true is very low, meaning there’s a very low probability that our two groups do not differ on the outcome of difference.

A decision to reject a null hypothesis is based on how confident you want to be that what we’ve observed is “the truth.” Most often, the “level of confidence” (aka alpha) is typically set at 95%, but there’s no magic to this number. The confidence level can be set higher or lower depending on what it is you’re investigating.

With a 95% level of confidence, we would reject our null if our p-value was less than 0.05. Essentially, our confidence level tells people how willing we are to live with being wrong about rejecting our null. With a 95% level of confidence, we are willing to be wrong 5% of the time, or 1 in 20 times.

When we reject the null, we are technically stating that our two groups are not the same on the outcome of interest to the level of confidence we have used. What we typically say, though, is that they groups are different, or that we have a statistically significant finding of difference.

5

The problem with p-values

Point Estimate Study Size p-value Statistical Significance?
Far from null Large Low yes
Close to null Large High or low Yes or no
Far from null Small High or low Yes or no
Close to null Small High No

6

The problem with p-values is that we are not able to interpret whether the presence or absence of statistical significance, as reflected in the p-value, is mostly a function of the effect size (distance from the null) or the study size, as illustrated by the chart on this slide.

So, for RR or OR that is well above or well below the null value of 1, you may not have a p-value that indicates statistical significance if there are few people in your study (row 3). Therefore, we may overlook a truly significant finding just because we were limited by sample size.

Conversely, you could have a RR that is close to 1, meaning the groups are not really different, but because there are a lot of people in the sample (row 2), the p value could be low enough to indicate statistical significance, even though the groups don’t really differ.

Rows 1 and 4 don’t really concern us, because they likely represent the “truth.”

Confidence Interval

Range of values around point estimate

In analytical epi studies, we often use a confidence interval (CI) to determine differences. The CI is a calculated range of values around the point estimate, which represents the probability of including the true effect value of a population. So, what exactly does this mean? Well, to explain, I need to first talk a bit about populations and samples.

7

Population & Sample

A population is all members of a defined group that we are studying. For example, in a f-up study, we are interested in everyone who was exposed to a certain agent, or in a ca-co study, everyone who had a specific disease. Because is it usually impossible (and extremely cost-prohibitive) to include every member of a population in an analytical study, we select a subset of the population to study. This subset is known as a sample. We then use this sample to make inferences about the population.

8

What CI tells us

Statistical significance

Magnitude

9

You can easily tell whether or not statistical significance has been reached using CI, just as you can in a hypothesis test. How? If the confidence interval does not enclose the null value, this represents a difference that is statistically significant. But, if the confidence interval includes the null value, the point estimate is statistically non-significant.

The advantage CI has over p-values, though, is a confidence interval tells us how accurate our point estimate is likely to be. Because samples do not perfectly reflect the “truth” about the entire population, they will be off by a little bit, and possibly by a great deal. The width between the upper and lower bounds of the CI tells us give us information on the magnitude -- how big or small -- the true effect might plausibly be given our selected level of confidence (typically 95%, as with p-values).

Width of CI

Each line on the slide represents sample data from the same population. The blue dot represents the point estimate, and the black line represents where the population truth may be. The distance between the left end of the line (lower limit) and right end of the line (upper limit) represents the width of the CI. When the distance between these two points is short, we say it is “narrow.” A narrow CI is stronger (has more magnitude) and thus closer to the population truth than a CI that is longer (called wide).

So, you may be wondering how two different samples from the same population could yield such different CIs. Two factors affect the CI: variability and sample size. Variability is how different or similar people in the population are to one another. When members of your population are similar, there is low variability which means the samples that you select will more closely resemble one another than when population members are very different from one another. Sample size is how many people are selected. When you have low variability, sample size is less of an issue than it is when there is great variability.

10

Formula for determining Confidence Intervals

CI = RR 1+ (z/)

Where

RR = point estimate

Z = level of significance

(chi) = Cell A – [(E+ x D+)/N] ÷ √ [(E+ x E- x D+ x D-) ÷ N2 x (N-1)]

11

The derivation of the formula for confidence limits of the point estimate (upper and lower bounds) is beyond the scope of this class. Through the power of trust, however, I give to you the formula itself so that you, too, can calculate confidence intervals with the pros! This formula, for those of you who are budding epi purists, is known as the "test-based" formula.

 

RR 1+ (z/)

 

Where RR is the point estimate

Z corresponds to the level of significance (researcher's choice, often 95%)

(chi) is a very big formula that I am going to try to describe in words to supplement and explain the formula presented on the slide. It is calculated using numbers from the

Chi Numerator: the numerator is the observed cell a value minus the product (E+ row total multiplied by D+ column total) divided by N.

Chi Denominator: the denominator is the square root of all of the following: the product of (E+ row total multiplied by E – row total multiplied by D+ column total x D- column total) divided by [N2 multiplied by (N minus 1)]

Note: Chi is based on the data layout in the 2 x 2 table  

Note: There is a different formula for calculating confidence limits around the risk difference. I'm not teaching that here because it is rarely used. If you have a burning desire to learn it though, I will be happy to provide you with the information. Just email me.

Yet another note: Recall from statistics that the z score is a measure of the distance in standard deviations of a sample from the mean. Each level of confidence has its own z score. For the purposes of this class, always assume a 95% level of confidence, where z = 1.96.

Z Scores for Commonly Used Confidence Intervals

Desired Confidence Interval Z Score

90% 1.645

95% 1.96

99% 2.576

Note: I’m using RR as an example. It’s the same with OR.

Determining Width of CI

Distance between:

lower confidence limit: RR 1 - (z/)

upper confidence limit: RR 1+ (z/)

12

The “width” of the confidence interval is the distance between the lower confidence limit and the upper confidence limit.

Example

We want to determine how much more common occupational benzene exposure in among people with leukemia. We conducted a case-control study in which 85 of the 125 workers with leukemia were exposed to benzene on the job. Conversely, among the 125 controls, only 40 had been exposed

Let’s use a 95% level of confidence…

13

Calculating CI

Step 1: Set up data table

Ca Co total
E+ 85 (a) 40 (b) 125 (E+)
E- 40 (c) 85 (d) 125 (E-)
total 125 (D+) 125 (D-) 250(N)

14

Step 2: Calculate Point Estimate

OR = ad/bc

= (85x85)/(40x40)

= 7225/1600

= 4.5

The odds of exposure is 4.5 times more common among workers with leukemia.

15

In this case, the point estimate is the Odds Ratio

OR = ad/bc

= (85 x 85) ÷ (40 x 40)

= 7225 ÷ 1600

= 4.5

Step 3: Calculate Chi

(chi) = Cell A – [(E+ x D+)/N] ÷√(E+ x E - x D+ x D-) / [N2 x (N-1)]

16

Let’s start by calculating the numerator for Chi: Cell A minus [(E+ total multiplied by D+ total) divided by N]

= 85 – [(125 x 125) ÷ 250]

= 85 – (15,625 ÷ 250)

= 85 – 62.5

= 22.5

Now, let’s calculate the denominator: square root of all of the following [(E+ total multiplied by E – total multiplied by D+ total multiplied by D- total) divided by (N2 multiplied by N minus 1)

= √[(125 x 125 x 125 x 125) ÷ [2502 x (250 – 1)]

= √[244140625 ÷ (62500 x 249)]

= √(244140625 ÷ 15,562,500)

= √15.69

=3.96

Now, let’s calculate Chi from our numerator and denominator = 22.5 ÷ 3.96 = 5.68

Note: There are very simple to use square root and exponent calculators online. One that I use is calculator.net.

Step 4: Calculate Width

Lower Limit = OR 1- (z/)

= 4.5 1 – (1.96/5.68)

= 4.5 .655

= 2.68

Upper Limit = OR 1 + (z/)

= 4.5 1 + (1.96/5.68)

= 4.5 1.345

= 7.56

Next we will calculate the width of our CI by first calculating our lower and upper limits.

Lower Limit = OR 1- (z/)

= 4.5 1 – (1.96/5.68)

= 4.5 .655

= 2.68

Upper Limit = OR 1 + (z/)

= 4.5 1 + (1.96/5.68)

= 4.5 1.345

= 7.56

The width of the confidence interval is the difference between the upper limit (UL) and lower limit (LL), which we express with this equation:

CI = UL - LL 

= 7.56 – 2.68

= 4.88

17

Step 5: State and Interpret Findings

OR = 4.5 (2.68, 7.56)

CI = 4.88

18

Because 1 is not included between our upper and lower bounds, we can say that there is a statistically significant difference in the odds of exposure among our workers with leukemia, and that we are 95% confident that the population measure lies between 2.68 and 7.56.

Because there is great variability between epidemiologists on what constitutes a narrow or wide interval, we shall leave it at that for now.

NOTE: While you’ll have to do calculations for your exercises and your quiz, you won’t have to do the math or memorize this formula on your exam. You’re welcome. =)