Biostastics Final Project
Data Analysis
Steps in Data Analysis
Defining the question
Collecting the data
Cleaning the data
Analyzing the data
Sharing your results
Embracing failure
Summary
Cleaning Data
Removing major errors, duplicates, and outliers—all of which are inevitable problems when aggregating data from numerous sources.
Removing unwanted data points—extracting irrelevant observations that have no bearing on your intended analysis.
Bringing structure to your data—general ‘housekeeping’, i.e. fixing typos or layout issues, which will help you map and manipulate your data more easily.
Filling in major gaps—as you’re tidying up, you might notice that important data are missing. Once you’ve identified gaps, you can go about filling them.
Exploratory Data analysis and Descriptive Statistics
Used to Summarize data
Tabular methods
Ex. Table summary with frequency and/or percent frequency
Graphical methods
Ex. Histogram
Numerical methods
Ex. Average or mean
Hudson’s average cost of parts, based on the 50 tune-ups studied, is $79
Tables and Graphs
Qualitative Data
Frequency Distribution (Relative Frequency or Percent Frequency)
Bar Graph
Pie Chart
Quantitative
Frequency Distribution (Relative Frequency or Percent Frequency)
Histogram
Box Plot
Stem-and-leaf plot
Dot plot
Two variables
Crosss-tabulations
Scatter plots (or Scatter diagrams)
What can graphs show us?
Describing data
Center, variation, distribution, outliers, time
Exploring data
We look for features of the graph that reveal some useful and or interesting characteristics of the data set
Comparing data
Construct similar graphs that make it easy to compare data sets.
Important Characteristics of Data
Center
A representative or average value that indicates where the middle of the data set is located (mean, median, mode)
Variation
A measure of the amount that the data values vary among themselves (SD, variance, IQR, range)
Distribution
The nature or shape of the distribution of the data (such as bellshaped, uniform, or skewed).
Outliers
Sample values that lie very far away from the vast majority of the other sample values.
Time
Changing characteristics of the data over time (is there a trend)
Analyzing Data: Hypothesis Testing in general
Faculty note: Remind students that choosing the significance level occurs before looking at the data, and must occur before computing your statistic from your data. Same for choosing one-sided test or two sided test.
8
Steps for Hypothesis testing
Point for lecturer: on step 4 bullet 2, tell the students alpha is usually 0.05, and more will be discussed about it as we use it in later slides.
9
Hypothesis Testing with CI
Emphasize mu0 is the value for the mean from the null hypothesis.
10
| Margin of Error Increases if | Margin of Error Decreases if |
| Sample size decreases | Sample size increases |
| Standard deviation (or variance) increases | Standard deviation (or variance) decreases |
| Confidence level (α) increases | Confidence level (α) decreases |
| If margin of error increases, interval width increases | If margin of error decreases, interval width decreases |
Margin of Error and the Interval Estimate
Notes about forming the Hypothesis:
In this class you can write the null as strictly equal to for the one-sided tests
One-sided vs two-sided:
two-sided if only interested in whether the mean or proportion is different (no direction)
one-sided if we are interested in a certain direction (less than or greater than)
If Testing for a Research Hypothesis
Alternative hypothesis is the research hypothesis
Rejecting the null means that our research hypothesis is true
If Testing the Validity of a Claim (i.e. a manufacturer’s claim)
Null hypothesis is the claim (giving their claim the benefit of the doubt)
Rejecting the null means that their claim is false
If Testing in Decision making situations
One course of action is the null and the other is the alternative
If we fail to reject the null, then we choose course one
If we reject the null, then we choose course two
Decision errors hypothesis testing (3)
Faculty note: With larger sample sizes, we have more flexibility in setting these type 1 and type 2 error rates
13
In a Picture
Decision
Truth
Ho
Ha
No Reject
Reject
Critical Value
Do Not Reject
Reject
14
Faculty note: This flow diagram is for only 2 variables. >2 variables is not covered in PH1690 except for a brief introduction to multiple linear regression.
Y-> response, outcome, dependent
X-> explanatory, predictor, independent
*** a lot of these terms suggest a directional relationship or causal relationship, but you don’t need a causal link to use these tests. But it helps to use this vocabulary when looking at studies with exposures or clinical trials.
Paired or correlated data → Simply not independent. Typically, we’re looking at the difference in a measure that was taken more than one time from the same person.
PH1700: Intermediate Biostatistics
PH1830: Categorical Data analysis
PH1918: Statistical Methods in Correlated Outcome Data
15
Tests for Categorical Data
Two Sample Proportion Test (Normal Theory method)
Two Sample Proportion Test
Note: Variable must be coded “0” and “1”
Grouping variable
Two-sided p-value
Test statistic
Chi Square Test
Chi-square/Fisher’s Exact
Note: if Expected values are <5, then use Fisher’s Exact p-value
Quantitative (Numerical) Data Two group comparison
Assessing Normality
There are various ways to assess the normality assumption. No single method can tell the whole story so we usually rely on several methods:
1) Graphical (histograms, box plots, normal probability plots, quantile-quantile plots)
2) Statistical tests (Shapiro-Wilk tests and others)
Shapiro Wilk test
H0: The sample comes from a normal distribution
H1: The sample does not come from a normal distribution
Note: if the sample size is small, then even if the null hypothesis is not rejected, there is a possibility that we have a type II error (not rejecting the null hypothesis when we should have)
Alternatively, if the sample size is very large then the null hypothesis may be easily rejected for small deviations from normality, when in practice the assumption of normality is close enough to proceed with the analysis.
Box plot by group
Example: graph box sbp, by(salt)
General: graph box variable, by(group)
Histogram by group
Example: histogram sbp, by(salt)
General: histogram variable, by(group)
QQ plot by group
qnorm variable if group==1
qnorm varable if group==2
Shapiro Wilk by group
General: by group, sort : swilk var
P-values: The larger the better
Two-Sample Independent T-test: Not Assuming Equal Standard Deviations (σ1≠ σ2)
Two-Sample Independent T-test: unequal variance assumption
Grouping variable
Continuous variable
Note: Must include this option if not assuming equal variance
Test statistic
Degrees of freedom
Two-sided p-value
Stata commands for all T-tests
Independent samples
paired samples
Independent samples
paired samples
30
Wilcoxon Rank Sum (Non-parametric)
For independent samples when the underlying distribution is unknown (or can't be assumed).
nonparametric equivalent of the two sample independent t-test.
Assumptions:
samples are random, independent, and not normal.
Hypotheses:
H0: There is no difference in the distributions of sample 1 and sample 2.
Ha: There is a difference in the distributions of sample 1 and sample 2.
Wilcoxon Rank Sum
P-value
Test statistic
Continuous variable
Grouping variable
Quantitative (Numerical) Data Multiple group comparison
ANOVA
Assess conditions/assumptions
Independence
Normality of each group
Graphically
Shapiro Wilk test
Equal variance
graphically
Stata tests equal variance as part of the ANOVA
2) Assessing Normality: histogram
histogram age, bin(10) frequency by(group)
Can also use the drop down menu for the histogram. The very last tab says “by”. This is where you tell Stata you want individual histograms for each of the 3 treatment groups.
36
Check normality assumption with QQ plots:
Assessing Normality: QQ plots
qnorm age if group==1
qnorm age if group==2
qnorm age if group==3
qnorm age, ytitle() yscale(titlegap(4)) xtitle() xscale(titlegap(4)) title(All 68 people) scheme(s2colork)
Note that the three groups were not the same size. Did membership in Control group 2 make it difficult to come up with 25 people who met criteria and were willing to participate?
37
Surgical
Control 1
Control 2
The normality of the 3 groups is not great (but the groups are pretty small). The graph of all patients is pretty good.
39
Assessing Normality: Shapiro Wilk
Surgical
Control 1
Control 2
Each of the 3 groups fail to reject the null, and are normal enough considering these p-values and the figures above.
Assess Equal variances graphically
graph box age, over(group) title(Comparison of age at time of study by treatment group)caption(Treatment Group, position(6))
Can also create this using the dropdown menu (Graphics/Box Plot). Note, if we use “by” we get 3 separate plots as we did for the histograms. If we use “over groups” we get the single plot above.
Since boxplots seem close to the same spread, variance looks close.
41
Assessing Equal Variances
Stata ANOVA automatically includes Bartlett’s test for equal variances
Bartlett H0: variances are equal
Bartlett Ha: variances are not equal
Here, there Is not enough evidence to say variances are not equal, so it is okay to perform and interpret ANOVA
Faculty note: explain what details go into the command (oneway; group = variable containing grouping information, and age = variable of interest
Bartlett’s test is a test of equal variances
across the groups
Fill in the blank “there Is not enough evidence to say variances are not equal, so it is okay to perform and interpret anova.”
First we will discuss the details behind the ANOVA table.
ANOVA table (in Stata) for age
Since all the assumptions hold, we can then interpret the p-value. Since the p-value is less than 0.05, we can conclude that the data provide convincing evidence that the average age is different for at least one group
Faculty note: explain what details go into the command (oneway tells stata to do a one-way anova; group = variable containing grouping information, and age = variable of interest
Bartletts test is a test of equal variances across the groups
Fill in the blank “there Is not enough evidence to say variances are not equal, so it is okay to interpret anova.
NOTE: Poll coming on the next slides to interpret what the significant pvalue means.
If p-value is small (less than α), reject H0. The data provide convincing evidence that at least one mean is different from (but we can't tell which one).
Interpreting ANOVA in general
If p-value is large, fail to reject H0. The data do not provide convincing evidence that at least one pair of means are different from each other, the observed differences in sample means are attributable to sampling variability (or chance).
Notice ANOVA alternative hypothesis is in terms of “at least one mean is different”, but doesn’t say which one(s).
Once an ANOVA is significant, how do we tell which means differ?
Pairwise comparisons with Bonferroni correction in Stata
2
1
1
Options:
“tab” provides table of summary statistics
“bon” provides pairwise comparisons with Bonferroni correction
This is the output from oneway after including bonferonni adjusted pairwise comparisons
First check, are variances equal? Want to fail to reject the null hypothesis in Bartlett’s test (look at std dev). Then, check F-test. Is statistically significant, so reject null hypothesis that the ages of the three groups are the same. For Bonferroni test, 7.72 is the difference between the two means, 0.000 is the p-value (p<0.001). So difference between surgical and control1 and between control1 and control2.
47
8
0
9
0
1
0
0
1
1
0
1
2
0
1
3
0
no saltsalt
s
b
p
Graphs by salt
0
.
0
5
8010012014080100120140
no saltsalt
D
e
n
s
i
t
y
sbp
Graphs by salt
8
0
1
0
0
1
2
0
1
4
0
s
b
p
708090100110120
Inverse Normal
7
0
8
0
9
0
1
0
0
1
1
0
s
b
p
80859095100105
Inverse Normal