Framingham Heart Study: Descriptive Analysis

profileFaridak3
CaseStudy232-SASDoctoralLevel.docx

Framingham Heart Study: Descriptive Analysis

Purpose

This activity focuses on performing descriptive analyses to investigate relationships between incidence of coronary heart disease and several predictors of interest.

SAS Software

This activity can be performed using any SAS programming environment, including SAS Studio in SAS OnDemand for Academics.

Industry Alignment

This activity aligns with the healthcare and life sciences industry. It uses data from a clinical study conducted to identify characteristics contributing to cardiovascular disease.

Table of Contents Framingham Heart Study: Descriptive Analysis 1 Purpose 1 SAS Software 1 Industry Alignment 1 Activity Notes and Requirements 3 Learning Objectives 3 Estimated Completion Time 3 Experience Level 3 Prerequisite Knowledge 3 Software 3 Content Knowledge 3 Additional Notes 3 Data Source 3 Introduction 3 Description of variables 4 Framingham Heart Study: Descriptive Analysis Activity 5 Part 1: Data Setup 5 Part 2: Descriptive Analysis of Relationships Between CHD Outcome and Predictors 5 Section 1 5 Section 2 6 Section 3 8 Appendix 10 Appendix A: Access Software 10 Appendix B: Data Information 10 Appendix C: Helpful Documentation 10 Appendix D: Recommended Learning 10

Framingham Heart Study: Descriptive Analysis

Industry Applied Activity

1

Activity Notes and Requirements

Learning Objectives

This activity is designed to practice descriptive analyses appropriate for relating a binary outcome to categorical and continuous predictors. The objectives are to:

· Produce logit plots relating a binary outcome to predictors.

· Interpret several logit plots to determine the strongest predictors of a binary outcome.

Estimated Completion Time

This activity will take students approximately 3 hours to complete.

Experience Level

To complete this activity students should have the following levels of experience:

· Intermediate skill in SAS programming

· Beginner to intermediate skill in statistics

Prerequisite Knowledge

Software

Students should have experience with the following:

· Foundations of programming with the SAS Data Step including using functions.

· SAS procedures such as PROC SGPLOT, PROC MEANS, and PROC RANK.

Content Knowledge

Students should have experience/knowledge with the following concepts:

· Descriptive statistics such as mean, median, counts, and percentages

· Logit calculations

Additional Notes

This activity pairs well with the following activities in the Academic Hub:

· Framingham Heart Study: Data Preparation, Industry Applied Activity

· Framingham Heart Study: Statistical Analysis, Industry Applied Activity

Data Source

Introduction

This activity uses the MYHEART dataset found in the Academic Hub where this activity was downloaded. This dataset was created in the Framingham Heart Study: Data Preparation Activity, Industry Applied Activity found on the Academic Hub. The data has been modified from the landmark Framingham Heart Study ( https://framinghamheartstudy.org/). The purpose of the Framingham Heart Study was to identify characteristics contributing to cardiovascular disease. For more information on this study, please refer to the Appendix B: Data Information.

Description of variables

The variables contained in this dataset are:

Variable

Description

Status

Alive or dead

DeathCause

Cause of death

AgeCHDdiag

Age at which CHD was diagnosed

Sex

Male or female

AgeAtStart

Age at the entry into the Framingham Heart Study

Height

Height in inches

Weight

Weight in pounds

Diastolic

Diastolic blood pressure

Systolic

Systolic blood pressure

MRW

Metropolitan Relative Weight

Smoking

Number of packs of cigarettes smoked per week

AgeatDeath

Age at death

Cholesterol

Total cholesterol

Chol_Status

Total cholesterol categorized into groups

BP_Status

Diastolic and systolic blood pressure categorized into groups

Weight_Status

Height and weight categorized into groups

Smoking_Status

Number of packs of cigarettes smoked per week categorized into groups

CHD

Diagnosis of coronary heart disease (1=Yes, 0=No)

Chol_StatusNew

Variable Chol_Status categorized into re-ordered groups

Sex_New

Variable Sex categorized into re-ordered groups

Weight_StatusNew

Variable Weight_Status categorized into re-ordered groups

Smoking_StatusNew

Variable Smoking_Status categorized into re-ordered groups

Framingham Heart Study: Descriptive Analysis Activity

Part 1: Data Setup

Before beginning this activity, create a SAS library called HEARTLIB and store the MYHEART dataset in this library. If you have not previously completed the Framingham Heart Study: Data Preparation, Industry Applied Activity, make sure to familiarize yourself with the MYHEART dataset prior to attempting Part 2.

Part 2: Descriptive Analysis of Relationships Between CHD Outcome and Predictors

The Framingham Heart Study aimed to identify characteristics contributing to cardiovascular disease. Numerous potential predictors were collected as well as whether the patient was diagnosed with coronary heart disease. As a researcher it would be natural to start with descriptive analyses describing the relationships between the outcome and predictors of interest prior to performing formal modeling.

Relating a binary outcome directly to predictors usually results in a nonlinear relationship. Because of this, the binary outcome is often not summarized directly, and instead the log odds or logit of the binary outcome is used. This transformation linearizes the relationship between the outcome and predictors.

The logit is computed as: , where p is the probability of the outcome of interest. Begin your descriptive analysis by creating logit plots which relate the binary CHD outcome to the various predictors of interest.

Section 1

Start by looking at the relationship between CHD and blood pressure through logit plots for CHD=1 by levels of BP_Status, Diastolic, and Systolic.

1. First, calculate p, the probability of CHD=1 in each level of BP_Status. Take advantage of the fact that the mean of a 0/1 variable is the probability or percentage of observations equaling 1.

a. Use HEARTLIB.MYHEART as the input dataset.

b. Compute the mean of CHD for each level of BP_Status and output these to WORK.OUT1.

c. How many rows are in your dataset? Are there any extra rows in your output dataset that you did not anticipate?

d. What is the name of the variable that contains the probability of CHD=1 for each level of BP_Status?

2. Next, create a variable for the logit of CHD.

a. Use WORK.OUT1 as the input dataset.

b. Create a new output dataset named WORK.OUT2. This dataset should have one row for each level of BP_Status.

c. Create a variable called Logit which equals log(Mean CHD) / (1 – Mean CHD)) for each level of BP_Status.

d. To check your work, print WORK.OUT2.

3. Use the WORK.OUT2 dataset to create a logit plot.

a. Plot the Logit variable on the y-axis and BP_Status on the x-axis using a series plot.

b. Label the y-axis “ Logit of Developing CHD” and the x-axis “ Blood Pressure Status”.

c. Specify that the y-axis should take values from -2 to 0.25 with tick marks in increments of 0.25.

d. Using the graph you produced, what do you conclude about the relationship between CHD and each of the blood pressure groups?

Section 2

Produce a similar logit plot for CHD and each of the Diastolic and Systolic variables.

4. Since Diastolic and Systolic variables are continuous, the variables must first be binned to create logit plots.

a. Use HEARTLIB.MYHEART as the input dataset.

b. Create an output dataset named WORK.OUT3.

c. Rank each of the variables Diastolic and Systolic into 20 groups. The output dataset should contain two new variables Dia_Group and Sys_Group that contain the 20 groups for each of Diastolic and Systolic.

i. Do lower or higher values of Dia_Group and Sys_Group correspond to higher values of diastolic and systolic blood pressure?

d. To check your work:

i. Generate descriptive statistics for Diastolic for each level of Dia_Group.

ii. Generate descriptive statistics for Systolic for each level of Sys_Group.

iii. Are the number of observations in each level of Dia_Group and Sys_Group approximately equal? Do you see any surprises in the descriptive statistics generated?

5. Next, compute probabilities and logits for CHD=1 and create a logit plot for CHD and each of the Diastolic and Systolic variables using similar methods as discussed in questions 1 - 3.

a. Use WORK.OUT3 as the input dataset.

b. Compute the probability of CHD=1 in each of the ranked groups in Dia_Group using the steps followed in question 1.

i. Create a new output dataset named WORK.OUT4a.

ii. Create a variable named Dia_Mean which holds the mean of variable Diastolic in each Dia_Group. This is needed for the logit plot later.

c. Repeat the previous step for the variable Sys_Group and create a new output dataset named WORK.OUT4b.

d. Compute new variables Logit_Dia and Logit_Sys using the steps followed in question 2. Create new output datasets WORK.OUT5a and WORK.OUT5b.

e. Create logit plots for CHD versus diastolic blood pressure and CHD versus systolic blood pressure using the steps in question 3. The x-axis should be Dia_Mean or Sys_Mean not Dia_Group or Sys_group. Additionally, overlay a non-parametric loess curve onto these plots.

f. Using the graphs you produced, what do you conclude about the relationship between probability of CHD and levels of diastolic blood pressure and systolic blood pressure?

g. Are the relationships between CHD and diastolic blood pressure and between CHD and systolic pressure well described by a line? Should higher-order polynomial terms be used to describe either relationship?

6. Compare the graphs produced in questions 3 and 5. Do your conclusions about the relationship between CHD and blood pressure differ across the graphs? If so, how? Which of the categorical or continuous predictors shows the stronger relationship?

Section 3

Now explore descriptive relationships between CHD and all other predictors in the dataset in the same way as performed for CHD and blood pressure.

7. Repeat questions 1 through 6 for the following groups of variables:

a. Smoking_StatusNew and Smoking

b. Weight_StatusNew, MRW, Height, and Weight

c. Chol_StatusNew and Cholesterol

d. Sex (repeat only through question 3)

8. Using the graphs produced in questions 1 -7, describe (in words) the relationship between CHD and each variable of interest.

a. Which variables look to be most strongly related to incidence of CHD?

b. Are the most strongly related variables categorical or continuous?

c. Are any variables not related to CHD?

d. Are the relationships between CHD and each continuous predictor well described by a line? Should higher-order polynomial terms be used to describe any of the relationships with continuous predictors?

Appendix

Appendix A: Access Software

SAS OnDemand for Academics (ODA) is a free, full suite of cloud-based software that supports the analytics life cycle- from data, to discovery, to deployment. Students can use SAS OnDemand for Academics to get access to SAS Studio for free. Click here to access ODA.

Note: You need to have an established SAS profile linked to an academic affiliation. If you don't have a SAS Profile, click here to set one up.

Check out Frequently Asked Questions for more support.

Appendix B: Data Information

The original cohort of the Framingham Heart study consisted of 5,209 men and women between the ages of 28 and 62 living in Framingham, Massachusetts. The first visit of data collection for participants in this cohort occurred between 1948 and 1953, and participants were assessed every two years thereafter through April 2014—almost 7 decades! Important links between cardiovascular disease and high blood pressure, high cholesterol levels, cigarette smoking, and many other health factors were first established using its data.

The complete Framingham Heart Study data consists of hundreds of datasets taken over time at 32 biennial exams and has led to over 3000 (wow!) published journal articles. To simplify analyses for illustrative purposes, the SASHELP.HEART dataset includes a snapshot of selected primary study variables taken at one of the biennial exams.

Appendix C: Helpful Documentation

Below are helpful links to documentation regarding the procedures used in the activity.

· The MEANS procedure

· The RANK procedure

· Base SAS Procedures Guide

· The SGPLOT procedure

· ODS Graphics Procedures Guide

Appendix D: Recommended Learning

The SAS Global Academic Program offers free e-learning courses for students to learn SAS through the Student Skill Builder. The following e-learning courses and paths available are recommended to help with this activity:

· SAS Programming 1: Essentials

· SAS Programming 2: Data Manipulation Techniques

· Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression

Alternatively, the SAS Learning Subscription grants you access to an extensive library of SAS eLearning courses. Sign up for a free 30-day trial.

image1.tiff

image2.tiff