Biostastics Final Project

profileHmm1234
HypothesisTestingSTATA2.pptx

Data Analysis

Steps in Data Analysis

Defining the question

Collecting the data

Cleaning the data

Analyzing the data

Sharing your results

Embracing failure

Summary

Cleaning Data

Removing major errors, duplicates, and outliers—all of which are inevitable problems when aggregating data from numerous sources.

Removing unwanted data points—extracting irrelevant observations that have no bearing on your intended analysis.

Bringing structure to your data—general ‘housekeeping’, i.e. fixing typos or layout issues, which will help you map and manipulate your data more easily.

Filling in major gaps—as you’re tidying up, you might notice that important data are missing. Once you’ve identified gaps, you can go about filling them.

Exploratory Data analysis and Descriptive Statistics

Used to Summarize data

Tabular methods

Ex. Table summary with frequency and/or percent frequency

Graphical methods

Ex. Histogram

Numerical methods

Ex. Average or mean

Hudson’s average cost of parts, based on the 50 tune-ups studied, is $79

Tables and Graphs

Qualitative Data

Frequency Distribution (Relative Frequency or Percent Frequency)

Bar Graph

Pie Chart

Quantitative

Frequency Distribution (Relative Frequency or Percent Frequency)

Histogram

Box Plot

Stem-and-leaf plot

Dot plot

Two variables

Crosss-tabulations

Scatter plots (or Scatter diagrams)

What can graphs show us?

Describing data

Center, variation, distribution, outliers, time

Exploring data

We look for features of the graph that reveal some useful and or interesting characteristics of the data set

Comparing data

Construct similar graphs that make it easy to compare data sets.

Important Characteristics of Data

Center

A representative or average value that indicates where the middle of the data set is located (mean, median, mode)

Variation

A measure of the amount that the data values vary among themselves (SD, variance, IQR, range)

Distribution

The nature or shape of the distribution of the data (such as bellshaped, uniform, or skewed).

Outliers

Sample values that lie very far away from the vast majority of the other sample values.

Time

Changing characteristics of the data over time (is there a trend)

Analyzing Data: Hypothesis Testing in general

Faculty note: Remind students that choosing the significance level occurs before looking at the data, and must occur before computing your statistic from your data. Same for choosing one-sided test or two sided test.

8

Steps for Hypothesis testing

Point for lecturer: on step 4 bullet 2, tell the students alpha is usually 0.05, and more will be discussed about it as we use it in later slides.

9

Hypothesis Testing with CI

Emphasize mu0 is the value for the mean from the null hypothesis.

10

Margin of Error Increases if Margin of Error Decreases if
Sample size decreases Sample size increases
Standard deviation (or variance) increases Standard deviation (or variance) decreases
Confidence level (α) increases Confidence level (α) decreases
If margin of error increases, interval width increases If margin of error decreases, interval width decreases

Margin of Error and the Interval Estimate

Notes about forming the Hypothesis:

In this class you can write the null as strictly equal to for the one-sided tests

One-sided vs two-sided:

two-sided if only interested in whether the mean or proportion is different (no direction)

one-sided if we are interested in a certain direction (less than or greater than)

If Testing for a Research Hypothesis

Alternative hypothesis is the research hypothesis

Rejecting the null means that our research hypothesis is true

If Testing the Validity of a Claim (i.e. a manufacturer’s claim)

Null hypothesis is the claim (giving their claim the benefit of the doubt)

Rejecting the null means that their claim is false

If Testing in Decision making situations

One course of action is the null and the other is the alternative

If we fail to reject the null, then we choose course one

If we reject the null, then we choose course two

Decision errors hypothesis testing (3)

Faculty note: With larger sample sizes, we have more flexibility in setting these type 1 and type 2 error rates

13

In a Picture

Decision

Truth

Ho

Ha

No Reject

Reject

Critical Value

Do Not Reject

Reject

14

Faculty note: This flow diagram is for only 2 variables. >2 variables is not covered in PH1690 except for a brief introduction to multiple linear regression.

Y-> response, outcome, dependent

X-> explanatory, predictor, independent

*** a lot of these terms suggest a directional relationship or causal relationship, but you don’t need a causal link to use these tests. But it helps to use this vocabulary when looking at studies with exposures or clinical trials.

Paired or correlated data → Simply not independent. Typically, we’re looking at the difference in a measure that was taken more than one time from the same person.

PH1700: Intermediate Biostatistics

PH1830: Categorical Data analysis

PH1918: Statistical Methods in Correlated Outcome Data

15

Tests for Categorical Data

Two Sample Proportion Test (Normal Theory method)

Two Sample Proportion Test

Note: Variable must be coded “0” and “1”

Grouping variable

Two-sided p-value

Test statistic

Chi Square Test

Chi-square/Fisher’s Exact

Note: if Expected values are <5, then use Fisher’s Exact p-value

Quantitative (Numerical) Data Two group comparison

Assessing Normality

There are various ways to assess the normality assumption. No single method can tell the whole story so we usually rely on several methods:

1) Graphical (histograms, box plots, normal probability plots, quantile-quantile plots)

2) Statistical tests (Shapiro-Wilk tests and others)

Shapiro Wilk test

H0: The sample comes from a normal distribution

H1: The sample does not come from a normal distribution

Note: if the sample size is small, then even if the null hypothesis is not rejected, there is a possibility that we have a type II error (not rejecting the null hypothesis when we should have)

Alternatively, if the sample size is very large then the null hypothesis may be easily rejected for small deviations from normality, when in practice the assumption of normality is close enough to proceed with the analysis.

Box plot by group

Example: graph box sbp, by(salt)

General: graph box variable, by(group)

Histogram by group

Example: histogram sbp, by(salt)

General: histogram variable, by(group)

QQ plot by group

qnorm variable if group==1

qnorm varable if group==2

Shapiro Wilk by group

General: by group, sort : swilk var

P-values: The larger the better

Two-Sample Independent T-test: Not Assuming Equal Standard Deviations (σ1≠ σ2)

Two-Sample Independent T-test: unequal variance assumption

Grouping variable

Continuous variable

Note: Must include this option if not assuming equal variance

Test statistic

Degrees of freedom

Two-sided p-value

Stata commands for all T-tests

Independent samples

paired samples

Independent samples

paired samples

30

Wilcoxon Rank Sum (Non-parametric)

For independent samples when the underlying distribution is unknown (or can't be assumed).

nonparametric equivalent of the two sample independent t-test.

Assumptions:

samples are random, independent, and not normal.

Hypotheses:

H0: There is no difference in the distributions of sample 1 and sample 2.

Ha: There is a difference in the distributions of sample 1 and sample 2.

Wilcoxon Rank Sum

P-value

Test statistic

Continuous variable

Grouping variable

Quantitative (Numerical) Data Multiple group comparison

ANOVA

Assess conditions/assumptions

Independence

Normality of each group

Graphically

Shapiro Wilk test

Equal variance

graphically

Stata tests equal variance as part of the ANOVA

2) Assessing Normality: histogram

histogram age, bin(10) frequency by(group)

Can also use the drop down menu for the histogram. The very last tab says “by”. This is where you tell Stata you want individual histograms for each of the 3 treatment groups.

36

Check normality assumption with QQ plots:

Assessing Normality: QQ plots

qnorm age if group==1

qnorm age if group==2

qnorm age if group==3

qnorm age, ytitle() yscale(titlegap(4)) xtitle() xscale(titlegap(4)) title(All 68 people) scheme(s2colork)

Note that the three groups were not the same size. Did membership in Control group 2 make it difficult to come up with 25 people who met criteria and were willing to participate?

37

Surgical

Control 1

Control 2

The normality of the 3 groups is not great (but the groups are pretty small). The graph of all patients is pretty good.

39

Assessing Normality: Shapiro Wilk

Surgical

Control 1

Control 2

Each of the 3 groups fail to reject the null, and are normal enough considering these p-values and the figures above.

Assess Equal variances graphically

graph box age, over(group) title(Comparison of age at time of study by treatment group)caption(Treatment Group, position(6))

Can also create this using the dropdown menu (Graphics/Box Plot). Note, if we use “by” we get 3 separate plots as we did for the histograms. If we use “over groups” we get the single plot above.

Since boxplots seem close to the same spread, variance looks close.

41

Assessing Equal Variances

Stata ANOVA automatically includes Bartlett’s test for equal variances

Bartlett H0: variances are equal

Bartlett Ha: variances are not equal

Here, there Is not enough evidence to say variances are not equal, so it is okay to perform and interpret ANOVA

Faculty note: explain what details go into the command (oneway; group = variable containing grouping information, and age = variable of interest

Bartlett’s test is a test of equal variances

across the groups

Fill in the blank “there Is not enough evidence to say variances are not equal, so it is okay to perform and interpret anova.”

First we will discuss the details behind the ANOVA table.

ANOVA table (in Stata) for age

Since all the assumptions hold, we can then interpret the p-value. Since the p-value is less than 0.05, we can conclude that the data provide convincing evidence that the average age is different for at least one group

Faculty note: explain what details go into the command (oneway tells stata to do a one-way anova; group = variable containing grouping information, and age = variable of interest

Bartletts test is a test of equal variances across the groups

Fill in the blank “there Is not enough evidence to say variances are not equal, so it is okay to interpret anova.

NOTE: Poll coming on the next slides to interpret what the significant pvalue means.

If p-value is small (less than α), reject H0. The data provide convincing evidence that at least one mean is different from (but we can't tell which one).

Interpreting ANOVA in general

If p-value is large, fail to reject H0. The data do not provide convincing evidence that at least one pair of means are different from each other, the observed differences in sample means are attributable to sampling variability (or chance).

Notice ANOVA alternative hypothesis is in terms of “at least one mean is different”, but doesn’t say which one(s).

Once an ANOVA is significant, how do we tell which means differ?

Pairwise comparisons with Bonferroni correction in Stata

2

1

1

Options:

“tab” provides table of summary statistics

“bon” provides pairwise comparisons with Bonferroni correction

This is the output from oneway after including bonferonni adjusted pairwise comparisons

First check, are variances equal? Want to fail to reject the null hypothesis in Bartlett’s test (look at std dev). Then, check F-test. Is statistically significant, so reject null hypothesis that the ages of the three groups are the same. For Bonferroni test, 7.72 is the difference between the two means, 0.000 is the p-value (p<0.001). So difference between surgical and control1 and between control1 and control2.

47

8

0

9

0

1

0

0

1

1

0

1

2

0

1

3

0

no saltsalt

s

b

p

Graphs by salt

0

.

0

5

8010012014080100120140

no saltsalt

D

e

n

s

i

t

y

sbp

Graphs by salt

8

0

1

0

0

1

2

0

1

4

0

s

b

p

708090100110120

Inverse Normal

7

0

8

0

9

0

1

0

0

1

1

0

s

b

p

80859095100105

Inverse Normal