Biostatistics linear regression

profiledivya.konyala258
Assignment34.pdf

Biostatistical Method I: Linear Regression and ANOVA Assignment 3

Note: Please submit your assignment via Blackboard by the due date. No need to submit your SAS log and SAS output. The instruction for submitting the assignment is written on the course syllabus.

Assignment Format (2 points)

• Your assignment consists of two files: a SAS file and a WORD document. The names of these two files should consist of your full name and assignment number. For example: John Doe assign1.sas and John Doe assign1.doc(x).

• Please write-down your full name on top of each file.

• Make sure to write your answers precisely. A lengthy result without getting to the point will result in point deductions.

• Your write-up should consist of complete sentences in paragraphs, not in bullet format.

• Please do not paste any parts of your SAS output in this file.

If you use SAS Studio or SAS Enterprise Guide, you can access the data by using the following path: libname mydata "/courses/d93a3ee5ba27fe300/Biostatistics I/";

Problem 1 (10 points)

You will use SICKLE.SAS7BDAT for this problem.

• LEVEL: The outcome variable, which is haemoglobin level

• TYPE: there are three types of sickle cell disease: HB SS, HB ST, and HB SC

1. Use the Kruskall Wallis test to test for differences in median haemoglobin levels for patients with different types of sickle cell disease. Make sure to report the median haemoglobin levels for each type of sickle cell disease.

2. Use the method in section (A General Approach to Nonparametric Analysis) to repeat the analysis. That is to say, you need to use the RANK and GLM procedures to complete this question.

Problem 2 (10 points)

You will use PSYCHO.SAS7BDAT for this problem. A psychiatrist wants to know whether the level of pathology (Y) in psychotic patients 6 months after the treatment can be predicted with reasonable accuracy from knowledge of pre-treatment symptom ratings of thinking disturbance (X1) and hostile suspiciousness (X2). Based on this data set, identify the observations that are potential outliers and/or influential points (Use log transformed Y as the outcome).

To identify the observations that influence the model, you can follow the example in Program 5.4 (See below:)

proc reg data=chol2;

model logtg = bmi female hdl age/influence r;

id id;

ods output OutputStatistics = OutputStatistics nobs=nobs;

1

Biostatistical Method I: Linear Regression and ANOVA Assignment 3

run;

data _null_;

set nobs;

call symput (’N’, nobsused); *N is the number of observations;

run;

%let k=4; *k is number of variables in the model;

data diagnostics;

set OutputStatistics;

jackknife_flag = abs(RStudent)>2;

leverage_flag =HatDiagonal>2*&k/&N;

CooksD_flag = CooksD>4/&N;

DFFITS_flag = abs(DFFITS)>2*sqrt(&k/&N);

DFB_Intercept_flag = abs(DFB_Intercept) > 2/sqrt(&N);

DFB_bmi_flag = abs(DFB_bmi) > 2/sqrt(&N);

DFB_female_flag = abs(DFB_female) > 2/sqrt(&N);

DFB_hdl_flag = abs(DFB_hdl) > 2/sqrt(&N);

DFB_age_flag = abs(DFB_age) > 2/sqrt(&N);

flag_sum = sum(jackknife_flag, leverage_flag, CooksD_flag, DFFITS_flag,

DFB_Intercept_flag, DFB_bmi_flag, DFB_female_flag, DFB_hdl_flag, DFB_age_flag);

run;

proc print data=diagnostics ;

where flag_sum >=5;

run;

I highlighted the sections that you need to change for this problem. For example, you need to change variable names. There are four variables in Program 5.4, but we only have two for this problem. In the ID statement, you should change to the variable PATIENT. The macro variable K is to specify the number of variables in the model; you should change this value to 2.

The FLAG SUM variable in Program 5.4 is summing the number of times each observation violates the rule of thumb for identifying the outliers and/or influential points. The observations that violate 5 or more rule of thumbs are printed by using PROC PRINT.

For this problem, you need to identify which patients violate 5 or more rule of thumbs (In a real-life situation, you might want to use other values). Next, remove these patients from the data set and re- run the regression again. Report how significant each of the predictors is related to the outcome after removing these observations. Should you be worried about any of the observations identified by these diagnostic statistics? Please make sure to use log(Y) in the model.

Note: to remove patients, you can use the WHERE statement PROC REG. For example, suppose that you are removing patients 1, 5, and 7, you can include the following statement in the model:

where patient not in (1, 5, 7);

Problem 3 (8 points)

Use the same data in Problem 2 and re-run the regression by using the method in section (A General Approach to Nonparametric Analysis). That is to say, you need to use the RANK and REG procedures

2

Biostatistical Method I: Linear Regression and ANOVA Assignment 3

to complete this question. How significant each of the predictors is related to the outcome by using this approach? Note: do not remove any observations and use the original Y value instead of log(Y) for the analysis.

3