statistical analysis SAS

profileFaridak3
CaseStudy233-SASDoctoralLevel.docx

Framingham Heart Study: Statistical Analysis

Industry Applied Activity

Framingham Heart Study: Statistical Analysis

Purpose

This activity focuses on fitting logistic regression models to determine whether the relationships between incidence of cardiovascular disease and predictors of interest are statistically significant.

SAS Software

This activity can be performed using any SAS programming environment, including SAS Studio in SAS OnDemand for Academics.

Industry Alignment

This activity aligns with the healthcare and life sciences industry. It uses data from a clinical study conducted to identify characteristics contributing to cardiovascular disease.

Table of Contents Framingham Heart Study: Statistical Analysis 1 Purpose 1 SAS Software 1 Industry Alignment 1 Activity Notes and Requirements 3 Learning Objectives 3 Estimated Completion Time 3 Experience Level 3 Prerequisite Knowledge 3 Software 3 Content Knowledge 3 Additional Notes 3 Data Source 4 Introduction 4 Description of variables 4 Framingham Heart Study: Descriptive Analysis Activity 5 Part 1: Data Setup 5 Part 2: Statistical Analysis of Relationships Between CHD Outcome and Predictors 5 Appendix 9 Appendix A: Access Software 9 Appendix B: Data Information 9 Appendix C: Helpful Documentation 9 Appendix D: Recommended Learning 9

Activity Notes and Requirements

Learning Objectives

This activity is designed to practice statistical analyses appropriate for a binary outcome. The objectives are to:

· Fit a logistic regression model.

· Compute and interpret odds ratios and associated confidence intervals.

· Fit several logistic regression models using backwards and forwards model selection techniques and identify a best fitting model.

Estimated Completion Time

This activity will take students approximately 3 hours to complete.

Experience Level

To complete this activity students should have the following levels of experience:

· Intermediate skill in SAS programming

· Intermediate skill in statistics

Prerequisite Knowledge

Software

Students should have experience with the following:

· SAS procedures such as PROC LOGISTIC that are appropriate for modeling data with a binary outcome.

Content Knowledge

Students should have experience/knowledge with the following concepts:

· Logistic regression models incorporating one or more predictors.

· Odds ratios and their interpretation.

· Model selection techniques such as forwards and backwards selection.

· Comparing models using the Akaike information criteria (AIC)

Additional Notes

This activity pairs well with the following activities in the Academic Hub:

· Framingham Heart Study: Data Preparation, Industry Applied Activity

· Framingham Heart Study: Descriptive Analysis, Industry Applied Activity

Data Source

Introduction

This activity uses the MYHEART dataset found in the Academic Hub where this activity was downloaded. This dataset was created in the Framingham Heart Study: Data Preparation Activity, Industry Applied Activity found on the Academic Hub. The data has been modified from the landmark Framingham Heart Study ( https://framinghamheartstudy.org/). The purpose of the Framingham Heart Study was to identify characteristics contributing to cardiovascular disease. For more information on this study, please refer to the Appendix B: Data Information.

Description of variables

The variables contained in this dataset are:

Variable

Description

Status

Alive or dead

DeathCause

Cause of death

AgeCHDdiag

Age at which CHD was diagnosed

Sex

Male or female

AgeAtStart

Age at the entry into the Framingham Heart Study

Height

Height in inches

Weight

Weight in pounds

Diastolic

Diastolic blood pressure

Systolic

Systolic blood pressure

MRW

Metropolitan Relative Weight

Smoking

Number of packs of cigarettes smoked per week

AgeatDeath

Age at death

Cholesterol

Total cholesterol

Chol_Status

Total cholesterol categorized into groups

BP_Status

Diastolic and systolic blood pressure categorized into groups

Weight_Status

Height and weight categorized into groups

Smoking_Status

Number of packs of cigarettes smoked per week categorized into groups

CHD

Diagnosis of coronary heart disease (1=Yes, 0=No)

Chol_StatusNew

Variable Chol_Status categorized into re-ordered groups

Sex_New

Variable Sex categorized into re-ordered groups

Weight_StatusNew

Variable Weight_Status categorized into re-ordered groups

Smoking_StatusNew

Variable Smoking_Status categorized into re-ordered groups

Framingham Heart Study: Descriptive Analysis Activity

Part 1: Data Setup

Before beginning this activity, create a SAS library called HEARTLIB and store the MYHEART dataset in this library. If you have not previously completed the Framingham Heart Study: Data Preparation, Industry Applied Activity, make sure to familiarize yourself with the MYHEART dataset prior to attempting Part 2.

Part 2: Statistical Analysis of Relationships Between CHD Outcome and Predictors

The Framingham Heart Study aimed to identify characteristics contributing to cardiovascular disease. Numerous potential predictors were collected as well as whether the patient was diagnosed with coronary heart disease. This activity focuses on fitting and comparing logistic regression models to determine whether the relationships between incidence of cardiovascular disease and predictors of interest are statistically significant. These statistical analyses are conducted in two stages. You will first fit univariate logistic regression models relating incidence of CHD to each predictor. Following these analyses, you will fit several multivariate logistic regression models to determine the best set of predictors which together model incidence of CHD.

1. Construct a univariate logistic regression model with CHD=1 as the outcome and one of several categorical variables as the predictor.

a. Use HEARTLIB.MYHEART as the input dataset.

b. Construct a univariate model for each of these predictors:

i. BP_Status

ii. Smoking_StatusNew

iii. Weight_StatusNew

iv. Chol_StatusNew

v. Sex_New

c. Note the AIC model fit statistic for each model.

d. Compute odds ratios and associated 95% confidence interval for odds of CHD=1 for all pair-wise comparisons of levels of each predictor. For example, for BP_Status, compare High vs Normal, High vs. Optimal, and Normal vs. Optimal. Designate CHD=0 as the reference level so that all odds ratios are comparing the odds of CHD=1 in the numerator to the odds of CHD=0 in the denominator.

e. After completing parts a-d, answer these questions:

i. Which model had the lowest AIC value, indicating the best fitting model?

ii. Which pairwise combinations of variable levels have odds ratios that show the levels are statistically different from each other in incidence of CHD?

iii. Which variable had the largest pairwise odds ratio comparing any two of its levels?

iv. What do you conclude about the univariate relationships between CHD and the categorical predictors of interest?

2. Construct a univariate logistic regression model with CHD=1 as the outcome and one of several continuous variables as the predictor.

a. Use HEARTLIB.MYHEART as the input dataset.

b. Construct a univariate model for each of these predictors:

i. Diastolic

ii. Systolic

iii. Smoking

iv. MRW

v. Cholesterol

c. Previous descriptive analyses show a quadratic relationship between the logit of CHD and each of Diastolic, Systolic, MRW, and Cholesterol. Given this, include linear and quadratic terms in the models including the predictors Diastolic, Systolic, MRW, and Cholesterol. Include only a linear term in the model including the predictor Smoking.

d. State the AIC model fit statistic for each model.

e. Compute odds ratios and associated 95% confidence intervals for the following predictor values and units. Designate CHD=0 as the reference level so that all odds ratios are comparing the odds of CHD=1 in the numerator to the odds of CHD=0 in the denominator.

i. At Diastolic= 80, 90, and 100 compared to a 10-unit increase.

ii. At Systolic= 120, 130, and 140 compared to a 10-unit increase.

iii. At Smoking=0 compared to a 5-unit increase.

iv. At MRW=100 and 150 compared to a 25-unit increase.

v. At Cholesterol=150 and 200 compared to a 50-unit increase.

f. After completing parts a-e, answer these questions:

i. Which model had the lowest AIC value, indicating the best fitting model?

ii. Which computed odds ratios showed statistically significant relationships?

iii. What do you conclude about the univariate relationships between CHD and the continuous predictors of interest?

g. After completing question 1, parts a-d, and question 2, parts a-e, answer these questions:

i. Considering models fit in both questions 1 and 2, compare the AIC for the model for each continuous predictor generated in question 2 to the model for its associated categorical predictor(s) generated in question 1. For example, compare the AICs from the models for BP_Status, Diastolic, and Systolic. Which model has the lowest AIC? In general, do models with continuous or categorical predictors fit better?

ii. Over all models fit in questions 1 and 2, which model and associated predictor had the lowest AIC?

3. Perform stepwise model selection to fit a multivariate logistic regression model with the best set of predictors for modeling the incidence of CHD.

a. Use HEARTLIB.MYHEART as the input dataset.

b. Include in the set of predictor variables the variable Sex_New as well as all continuous predictors examined in question 2.

c. Make sure to include the quadratic predictors included in models for question 2.

d. Use significance level as the criteria to determine which variables enter or leave the model. Use 0.05 as the significance level to enter the model and 0.05 as the significance level to stay in the model.

e. Make sure to note the AIC of the final model.

f. Which predictors are included in the final model?

4. Repeat question 3 using backwards model selection.

a. Which predictors are included in the final model when using backwards model selection? Are these different than the predictors selected via stepwise model selection?

b. Compare the AIC values of the final models generated in questions 3 and 4. Which one has the lower AIC? How do these AIC values compare to the AIC values of the models generated in questions 1 and 2?

c. Overall, what do you conclude from the best models obtained in questions 3 and 4 about the most important predictors of CHD?

Congratulations- you have completed a statistical analysis of the predictors of coronary heart disease! Additional analyses of interest in the actual Framingham heart study include a survival analysis of predictors of time to coronary heart disease as well as longitudinal analyses of data collected over the 32 biennial exams in the study.

Appendix

Appendix A: Access Software

SAS OnDemand for Academics (ODA) is a free, full suite of cloud-based software that supports the analytics life cycle- from data, to discovery, to deployment. Students can use SAS OnDemand for Academics to get access to SAS Studio for free. Click here to access ODA.

Note: You need to have an established SAS profile linked to an academic affiliation. If you don't have a SAS Profile, click here to set one up.

Check out Frequently Asked Questions for more support.

Appendix B: Data Information

The original cohort of the Framingham Heart study consisted of 5,209 men and women between the ages of 28 and 62 living in Framingham, Massachusetts. The first visit of data collection for participants in this cohort occurred between 1948 and 1953, and participants were assessed every two years thereafter through April 2014—almost 7 decades! Important links between cardiovascular disease and high blood pressure, high cholesterol levels, cigarette smoking, and many other health factors were first established using its data.

The complete Framingham Heart Study data consists of hundreds of datasets taken over time at 32 biennial exams and has led to over 3000 (wow!) published journal articles. To simplify analyses for illustrative purposes, the SASHELP.HEART dataset includes a snapshot of selected primary study variables taken at one of the biennial exams.

Appendix C: Helpful Documentation

Below are helpful links to documentation regarding the procedures used in the activity.

· The LOGISTIC Procedure

· SAS/STAT User’s Guide

Appendix D: Recommended Learning

The SAS Global Academic Program offers free e-learning courses for students to learn SAS through the Student Skill Builder. The following e-learning courses and paths available are recommended to help with this activity:

· Statistics 1: Introduction to ANOVA, Regression, and Logistic Regression

Alternatively, the SAS Learning Subscription grants you access to an extensive library of SAS eLearning courses. Sign up for a free 30-day trial.

1

image1.tiff

image2.tiff