Framingham Heart Study: Descriptive Analysis
Framingham Heart Study: Descriptive Analysis
This activity focuses on performing descriptive analyses to investigate relationships between incidence of coronary heart disease and several predictors of interest.
This activity can be performed using any SAS programming environment, including SAS Studio in SAS OnDemand for Academics.
This activity aligns with the healthcare and life sciences industry. It uses data from a clinical study conducted to identify characteristics contributing to cardiovascular disease.
Table of Contents Framingham Heart Study: Descriptive Analysis 1 Purpose 1 SAS Software 1 Industry Alignment 1 Activity Notes and Requirements 3 Learning Objectives 3 Estimated Completion Time 3 Experience Level 3 Prerequisite Knowledge 3 Software 3 Content Knowledge 3 Additional Notes 3 Data Source 3 Introduction 3 Description of variables 4 Framingham Heart Study: Descriptive Analysis Activity 5 Part 1: Data Setup 5 Part 2: Descriptive Analysis of Relationships Between CHD Outcome and Predictors 5 Section 1 5 Section 2 6 Section 3 8 Appendix 10 Appendix A: Access Software 10 Appendix B: Data Information 10 Appendix C: Helpful Documentation 10 Appendix D: Recommended Learning 10
Framingham Heart Study: Descriptive Analysis
Industry Applied Activity
1
Activity Notes and Requirements
Learning Objectives
This activity is designed to practice descriptive analyses appropriate for relating a binary outcome to categorical and continuous predictors. The objectives are to:
· Produce logit plots relating a binary outcome to predictors.
· Interpret several logit plots to determine the strongest predictors of a binary outcome.
Estimated Completion Time
This activity will take students approximately 3 hours to complete.
Experience Level
To complete this activity students should have the following levels of experience:
· Intermediate skill in SAS programming
· Beginner to intermediate skill in statistics
Prerequisite Knowledge
Software
Students should have experience with the following:
· Foundations of programming with the SAS Data Step including using functions.
· SAS procedures such as PROC SGPLOT, PROC MEANS, and PROC RANK.
Content Knowledge
Students should have experience/knowledge with the following concepts:
· Descriptive statistics such as mean, median, counts, and percentages
· Logit calculations
Additional Notes
This activity pairs well with the following activities in the Academic Hub:
· Framingham Heart Study: Data Preparation, Industry Applied Activity
· Framingham Heart Study: Statistical Analysis, Industry Applied Activity
Data Source
Introduction
This activity uses the MYHEART dataset found in the Academic Hub where this activity was downloaded. This dataset was created in the Framingham Heart Study: Data Preparation Activity, Industry Applied Activity found on the Academic Hub. The data has been modified from the landmark Framingham Heart Study ( https://framinghamheartstudy.org/). The purpose of the Framingham Heart Study was to identify characteristics contributing to cardiovascular disease. For more information on this study, please refer to the Appendix B: Data Information.
Description of variables
The variables contained in this dataset are:
|
Variable |
Description |
|
Status |
Alive or dead |
|
DeathCause |
Cause of death |
|
AgeCHDdiag |
Age at which CHD was diagnosed |
|
Sex |
Male or female |
|
AgeAtStart |
Age at the entry into the Framingham Heart Study |
|
Height |
Height in inches |
|
Weight |
Weight in pounds |
|
Diastolic |
Diastolic blood pressure |
|
Systolic |
Systolic blood pressure |
|
MRW |
Metropolitan Relative Weight |
|
Smoking |
Number of packs of cigarettes smoked per week |
|
AgeatDeath |
Age at death |
|
Cholesterol |
Total cholesterol |
|
Chol_Status |
Total cholesterol categorized into groups |
|
BP_Status |
Diastolic and systolic blood pressure categorized into groups |
|
Weight_Status |
Height and weight categorized into groups |
|
Smoking_Status |
Number of packs of cigarettes smoked per week categorized into groups |
|
CHD |
Diagnosis of coronary heart disease (1=Yes, 0=No) |
|
Chol_StatusNew |
Variable Chol_Status categorized into re-ordered groups |
|
Sex_New |
Variable Sex categorized into re-ordered groups |
|
Weight_StatusNew |
Variable Weight_Status categorized into re-ordered groups |
|
Smoking_StatusNew |
Variable Smoking_Status categorized into re-ordered groups |
Framingham Heart Study: Descriptive Analysis Activity
Part 1: Data Setup
Before beginning this activity, create a SAS library called HEARTLIB and store the MYHEART dataset in this library. If you have not previously completed the Framingham Heart Study: Data Preparation, Industry Applied Activity, make sure to familiarize yourself with the MYHEART dataset prior to attempting Part 2.
Part 2: Descriptive Analysis of Relationships Between CHD Outcome and Predictors
The Framingham Heart Study aimed to identify characteristics contributing to cardiovascular disease. Numerous potential predictors were collected as well as whether the patient was diagnosed with coronary heart disease. As a researcher it would be natural to start with descriptive analyses describing the relationships between the outcome and predictors of interest prior to performing formal modeling.
Relating a binary outcome directly to predictors usually results in a nonlinear relationship. Because of this, the binary outcome is often not summarized directly, and instead the log odds or logit of the binary outcome is used. This transformation linearizes the relationship between the outcome and predictors.
The logit is computed as: , where p is the probability of the outcome of interest. Begin your descriptive analysis by creating logit plots which relate the binary CHD outcome to the various predictors of interest.
Section 1
Start by looking at the relationship between CHD and blood pressure through logit plots for CHD=1 by levels of BP_Status, Diastolic, and Systolic.
1. First, calculate p, the probability of CHD=1 in each level of BP_Status. Take advantage of the fact that the mean of a 0/1 variable is the probability or percentage of observations equaling 1.
a. Use HEARTLIB.MYHEART as the input dataset.
b. Compute the mean of CHD for each level of BP_Status and output these to WORK.OUT1.
c. How many rows are in your dataset? Are there any extra rows in your output dataset that you did not anticipate?
d. What is the name of the variable that contains the probability of CHD=1 for each level of BP_Status?
2. Next, create a variable for the logit of CHD.
a. Use WORK.OUT1 as the input dataset.
b. Create a new output dataset named WORK.OUT2. This dataset should have one row for each level of BP_Status.
c. Create a variable called Logit which equals log(Mean CHD) / (1 – Mean CHD)) for each level of BP_Status.
d. To check your work, print WORK.OUT2.
3. Use the WORK.OUT2 dataset to create a logit plot.
a. Plot the Logit variable on the y-axis and BP_Status on the x-axis using a series plot.
b. Label the y-axis “ Logit of Developing CHD” and the x-axis “ Blood Pressure Status”.
c. Specify that the y-axis should take values from -2 to 0.25 with tick marks in increments of 0.25.
d. Using the graph you produced, what do you conclude about the relationship between CHD and each of the blood pressure groups?
Section 2
Produce a similar logit plot for CHD and each of the Diastolic and Systolic variables.
4. Since Diastolic and Systolic variables are continuous, the variables must first be binned to create logit plots.
a. Use HEARTLIB.MYHEART as the input dataset.
b. Create an output dataset named WORK.OUT3.
c. Rank each of the variables Diastolic and Systolic into 20 groups. The output dataset should contain two new variables Dia_Group and Sys_Group that contain the 20 groups for each of Diastolic and Systolic.
i. Do lower or higher values of Dia_Group and Sys_Group correspond to higher values of diastolic and systolic blood pressure?
d. To check your work:
i. Generate descriptive statistics for Diastolic for each level of Dia_Group.
ii. Generate descriptive statistics for Systolic for each level of Sys_Group.
iii. Are the number of observations in each level of Dia_Group and Sys_Group approximately equal? Do you see any surprises in the descriptive statistics generated?
5. Next, compute probabilities and logits for CHD=1 and create a logit plot for CHD and each of the Diastolic and Systolic variables using similar methods as discussed in questions 1 - 3.
a. Use WORK.OUT3 as the input dataset.
b. Compute the probability of CHD=1 in each of the ranked groups in Dia_Group using the steps followed in question 1.
i. Create a new output dataset named WORK.OUT4a.
ii. Create a variable named Dia_Mean which holds the mean of variable Diastolic in each Dia_Group. This is needed for the logit plot later.
c. Repeat the previous step for the variable Sys_Group and create a new output dataset named WORK.OUT4b.
d. Compute new variables Logit_Dia and Logit_Sys using the steps followed in question 2. Create new output datasets WORK.OUT5a and WORK.OUT5b.
e. Create logit plots for CHD versus diastolic blood pressure and CHD versus systolic blood pressure using the steps in question 3. The x-axis should be Dia_Mean or Sys_Mean not Dia_Group or Sys_group. Additionally, overlay a non-parametric loess curve onto these plots.
f. Using the graphs you produced, what do you conclude about the relationship between probability of CHD and levels of diastolic blood pressure and systolic blood pressure?
g. Are the relationships between CHD and diastolic blood pressure and between CHD and systolic pressure well described by a line? Should higher-order polynomial terms be used to describe either relationship?
6. Compare the graphs produced in questions 3 and 5. Do your conclusions about the relationship between CHD and blood pressure differ across the graphs? If so, how? Which of the categorical or continuous predictors shows the stronger relationship?
Section 3
Now explore descriptive relationships between CHD and all other predictors in the dataset in the same way as performed for CHD and blood pressure.
7. Repeat questions 1 through 6 for the following groups of variables:
a. Smoking_StatusNew and Smoking
b. Weight_StatusNew, MRW, Height, and Weight
c. Chol_StatusNew and Cholesterol
d. Sex (repeat only through question 3)
8. Using the graphs produced in questions 1 -7, describe (in words) the relationship between CHD and each variable of interest.
a. Which variables look to be most strongly related to incidence of CHD?
b. Are the most strongly related variables categorical or continuous?
c. Are any variables not related to CHD?
d. Are the relationships between CHD and each continuous predictor well described by a line? Should higher-order polynomial terms be used to describe any of the relationships with continuous predictors?
Appendix
Appendix A: Access Software
SAS OnDemand for Academics (ODA) is a free, full suite of cloud-based software that supports the analytics life cycle- from data, to discovery, to deployment. Students can use SAS OnDemand for Academics to get access to SAS Studio for free. Click here to access ODA.
Note: You need to have an established SAS profile linked to an academic affiliation. If you don't have a SAS Profile, click here to set one up.
Check out Frequently Asked Questions for more support.
Appendix B: Data Information
The original cohort of the Framingham Heart study consisted of 5,209 men and women between the ages of 28 and 62 living in Framingham, Massachusetts. The first visit of data collection for participants in this cohort occurred between 1948 and 1953, and participants were assessed every two years thereafter through April 2014—almost 7 decades! Important links between cardiovascular disease and high blood pressure, high cholesterol levels, cigarette smoking, and many other health factors were first established using its data.
The complete Framingham Heart Study data consists of hundreds of datasets taken over time at 32 biennial exams and has led to over 3000 (wow!) published journal articles. To simplify analyses for illustrative purposes, the SASHELP.HEART dataset includes a snapshot of selected primary study variables taken at one of the biennial exams.
Appendix C: Helpful Documentation
Below are helpful links to documentation regarding the procedures used in the activity.
· ODS Graphics Procedures Guide
Appendix D: Recommended Learning
The SAS Global Academic Program offers free e-learning courses for students to learn SAS through the Student Skill Builder. The following e-learning courses and paths available are recommended to help with this activity:
· SAS Programming 1: Essentials
· SAS Programming 2: Data Manipulation Techniques
· Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression
Alternatively, the SAS Learning Subscription grants you access to an extensive library of SAS eLearning courses. Sign up for a free 30-day trial.