statistical analysis SAS
Framingham Heart Study: Statistical Analysis
Industry Applied Activity
Framingham Heart Study: Statistical Analysis
This activity focuses on fitting logistic regression models to determine whether the relationships between incidence of cardiovascular disease and predictors of interest are statistically significant.
This activity can be performed using any SAS programming environment, including SAS Studio in SAS OnDemand for Academics.
This activity aligns with the healthcare and life sciences industry. It uses data from a clinical study conducted to identify characteristics contributing to cardiovascular disease.
Table of Contents Framingham Heart Study: Statistical Analysis 1 Purpose 1 SAS Software 1 Industry Alignment 1 Activity Notes and Requirements 3 Learning Objectives 3 Estimated Completion Time 3 Experience Level 3 Prerequisite Knowledge 3 Software 3 Content Knowledge 3 Additional Notes 3 Data Source 4 Introduction 4 Description of variables 4 Framingham Heart Study: Descriptive Analysis Activity 5 Part 1: Data Setup 5 Part 2: Statistical Analysis of Relationships Between CHD Outcome and Predictors 5 Appendix 9 Appendix A: Access Software 9 Appendix B: Data Information 9 Appendix C: Helpful Documentation 9 Appendix D: Recommended Learning 9
Activity Notes and Requirements
Learning Objectives
This activity is designed to practice statistical analyses appropriate for a binary outcome. The objectives are to:
· Fit a logistic regression model.
· Compute and interpret odds ratios and associated confidence intervals.
· Fit several logistic regression models using backwards and forwards model selection techniques and identify a best fitting model.
Estimated Completion Time
This activity will take students approximately 3 hours to complete.
Experience Level
To complete this activity students should have the following levels of experience:
· Intermediate skill in SAS programming
· Intermediate skill in statistics
Prerequisite Knowledge
Software
Students should have experience with the following:
· SAS procedures such as PROC LOGISTIC that are appropriate for modeling data with a binary outcome.
Content Knowledge
Students should have experience/knowledge with the following concepts:
· Logistic regression models incorporating one or more predictors.
· Odds ratios and their interpretation.
· Model selection techniques such as forwards and backwards selection.
· Comparing models using the Akaike information criteria (AIC)
Additional Notes
This activity pairs well with the following activities in the Academic Hub:
· Framingham Heart Study: Data Preparation, Industry Applied Activity
· Framingham Heart Study: Descriptive Analysis, Industry Applied Activity
Data Source
Introduction
This activity uses the MYHEART dataset found in the Academic Hub where this activity was downloaded. This dataset was created in the Framingham Heart Study: Data Preparation Activity, Industry Applied Activity found on the Academic Hub. The data has been modified from the landmark Framingham Heart Study ( https://framinghamheartstudy.org/). The purpose of the Framingham Heart Study was to identify characteristics contributing to cardiovascular disease. For more information on this study, please refer to the Appendix B: Data Information.
Description of variables
The variables contained in this dataset are:
|
Variable |
Description |
|
Status |
Alive or dead |
|
DeathCause |
Cause of death |
|
AgeCHDdiag |
Age at which CHD was diagnosed |
|
Sex |
Male or female |
|
AgeAtStart |
Age at the entry into the Framingham Heart Study |
|
Height |
Height in inches |
|
Weight |
Weight in pounds |
|
Diastolic |
Diastolic blood pressure |
|
Systolic |
Systolic blood pressure |
|
MRW |
Metropolitan Relative Weight |
|
Smoking |
Number of packs of cigarettes smoked per week |
|
AgeatDeath |
Age at death |
|
Cholesterol |
Total cholesterol |
|
Chol_Status |
Total cholesterol categorized into groups |
|
BP_Status |
Diastolic and systolic blood pressure categorized into groups |
|
Weight_Status |
Height and weight categorized into groups |
|
Smoking_Status |
Number of packs of cigarettes smoked per week categorized into groups |
|
CHD |
Diagnosis of coronary heart disease (1=Yes, 0=No) |
|
Chol_StatusNew |
Variable Chol_Status categorized into re-ordered groups |
|
Sex_New |
Variable Sex categorized into re-ordered groups |
|
Weight_StatusNew |
Variable Weight_Status categorized into re-ordered groups |
|
Smoking_StatusNew |
Variable Smoking_Status categorized into re-ordered groups |
Framingham Heart Study: Descriptive Analysis Activity
Part 1: Data Setup
Before beginning this activity, create a SAS library called HEARTLIB and store the MYHEART dataset in this library. If you have not previously completed the Framingham Heart Study: Data Preparation, Industry Applied Activity, make sure to familiarize yourself with the MYHEART dataset prior to attempting Part 2.
Part 2: Statistical Analysis of Relationships Between CHD Outcome and Predictors
The Framingham Heart Study aimed to identify characteristics contributing to cardiovascular disease. Numerous potential predictors were collected as well as whether the patient was diagnosed with coronary heart disease. This activity focuses on fitting and comparing logistic regression models to determine whether the relationships between incidence of cardiovascular disease and predictors of interest are statistically significant. These statistical analyses are conducted in two stages. You will first fit univariate logistic regression models relating incidence of CHD to each predictor. Following these analyses, you will fit several multivariate logistic regression models to determine the best set of predictors which together model incidence of CHD.
1. Construct a univariate logistic regression model with CHD=1 as the outcome and one of several categorical variables as the predictor.
a. Use HEARTLIB.MYHEART as the input dataset.
b. Construct a univariate model for each of these predictors:
i. BP_Status
ii. Smoking_StatusNew
iii. Weight_StatusNew
iv. Chol_StatusNew
v. Sex_New
c. Note the AIC model fit statistic for each model.
d. Compute odds ratios and associated 95% confidence interval for odds of CHD=1 for all pair-wise comparisons of levels of each predictor. For example, for BP_Status, compare High vs Normal, High vs. Optimal, and Normal vs. Optimal. Designate CHD=0 as the reference level so that all odds ratios are comparing the odds of CHD=1 in the numerator to the odds of CHD=0 in the denominator.
e. After completing parts a-d, answer these questions:
i. Which model had the lowest AIC value, indicating the best fitting model?
ii. Which pairwise combinations of variable levels have odds ratios that show the levels are statistically different from each other in incidence of CHD?
iii. Which variable had the largest pairwise odds ratio comparing any two of its levels?
iv. What do you conclude about the univariate relationships between CHD and the categorical predictors of interest?
2. Construct a univariate logistic regression model with CHD=1 as the outcome and one of several continuous variables as the predictor.
a. Use HEARTLIB.MYHEART as the input dataset.
b. Construct a univariate model for each of these predictors:
i. Diastolic
ii. Systolic
iii. Smoking
iv. MRW
v. Cholesterol
c. Previous descriptive analyses show a quadratic relationship between the logit of CHD and each of Diastolic, Systolic, MRW, and Cholesterol. Given this, include linear and quadratic terms in the models including the predictors Diastolic, Systolic, MRW, and Cholesterol. Include only a linear term in the model including the predictor Smoking.
d. State the AIC model fit statistic for each model.
e. Compute odds ratios and associated 95% confidence intervals for the following predictor values and units. Designate CHD=0 as the reference level so that all odds ratios are comparing the odds of CHD=1 in the numerator to the odds of CHD=0 in the denominator.
i. At Diastolic= 80, 90, and 100 compared to a 10-unit increase.
ii. At Systolic= 120, 130, and 140 compared to a 10-unit increase.
iii. At Smoking=0 compared to a 5-unit increase.
iv. At MRW=100 and 150 compared to a 25-unit increase.
v. At Cholesterol=150 and 200 compared to a 50-unit increase.
f. After completing parts a-e, answer these questions:
i. Which model had the lowest AIC value, indicating the best fitting model?
ii. Which computed odds ratios showed statistically significant relationships?
iii. What do you conclude about the univariate relationships between CHD and the continuous predictors of interest?
g. After completing question 1, parts a-d, and question 2, parts a-e, answer these questions:
i. Considering models fit in both questions 1 and 2, compare the AIC for the model for each continuous predictor generated in question 2 to the model for its associated categorical predictor(s) generated in question 1. For example, compare the AICs from the models for BP_Status, Diastolic, and Systolic. Which model has the lowest AIC? In general, do models with continuous or categorical predictors fit better?
ii. Over all models fit in questions 1 and 2, which model and associated predictor had the lowest AIC?
3. Perform stepwise model selection to fit a multivariate logistic regression model with the best set of predictors for modeling the incidence of CHD.
a. Use HEARTLIB.MYHEART as the input dataset.
b. Include in the set of predictor variables the variable Sex_New as well as all continuous predictors examined in question 2.
c. Make sure to include the quadratic predictors included in models for question 2.
d. Use significance level as the criteria to determine which variables enter or leave the model. Use 0.05 as the significance level to enter the model and 0.05 as the significance level to stay in the model.
e. Make sure to note the AIC of the final model.
f. Which predictors are included in the final model?
4. Repeat question 3 using backwards model selection.
a. Which predictors are included in the final model when using backwards model selection? Are these different than the predictors selected via stepwise model selection?
b. Compare the AIC values of the final models generated in questions 3 and 4. Which one has the lower AIC? How do these AIC values compare to the AIC values of the models generated in questions 1 and 2?
c. Overall, what do you conclude from the best models obtained in questions 3 and 4 about the most important predictors of CHD?
Congratulations- you have completed a statistical analysis of the predictors of coronary heart disease! Additional analyses of interest in the actual Framingham heart study include a survival analysis of predictors of time to coronary heart disease as well as longitudinal analyses of data collected over the 32 biennial exams in the study.
Appendix
Appendix A: Access Software
SAS OnDemand for Academics (ODA) is a free, full suite of cloud-based software that supports the analytics life cycle- from data, to discovery, to deployment. Students can use SAS OnDemand for Academics to get access to SAS Studio for free. Click here to access ODA.
Note: You need to have an established SAS profile linked to an academic affiliation. If you don't have a SAS Profile, click here to set one up.
Check out Frequently Asked Questions for more support.
Appendix B: Data Information
The original cohort of the Framingham Heart study consisted of 5,209 men and women between the ages of 28 and 62 living in Framingham, Massachusetts. The first visit of data collection for participants in this cohort occurred between 1948 and 1953, and participants were assessed every two years thereafter through April 2014—almost 7 decades! Important links between cardiovascular disease and high blood pressure, high cholesterol levels, cigarette smoking, and many other health factors were first established using its data.
The complete Framingham Heart Study data consists of hundreds of datasets taken over time at 32 biennial exams and has led to over 3000 (wow!) published journal articles. To simplify analyses for illustrative purposes, the SASHELP.HEART dataset includes a snapshot of selected primary study variables taken at one of the biennial exams.
Appendix C: Helpful Documentation
Below are helpful links to documentation regarding the procedures used in the activity.
Appendix D: Recommended Learning
The SAS Global Academic Program offers free e-learning courses for students to learn SAS through the Student Skill Builder. The following e-learning courses and paths available are recommended to help with this activity:
· Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Alternatively, the SAS Learning Subscription grants you access to an extensive library of SAS eLearning courses. Sign up for a free 30-day trial.
1