Statistics (using R-studio)

project.pdf

Home >Mathematics homework help >Statistics homework help >Statistics (using R-studio)

Statistics 108

Project

Due : 12:00 PM Tuesday, June 05, in MSB 4117

Diabetes data. This data consist of 19 variables on 403 subjects from 1046 subjects who were interviewed in a study to understand the prevalence of obesity, diabetes, and other car- diovascular risk factors in central Virginia for African Americans. We will consider building regression models with glyhb as the response variable as Glycosolated Hemoglobin greater than 70 is often taken as a positive diagnostics of diabetes. Use the data set diabetes.txt on the Canvas directory project. The directory also contains a description of the data and variables (DiabetesExplanation.pdf).

Please read the file discussion.pdf on directory project first before answering questions. Separate all the code from the main text and attach it in an appendix.

Processing and exploration of the data.

1. Read in this data in R. Replace the missing values in the variable frame, indicated by an empty string ‘’, by ‘NA’. Also, use function droplevels() to take away the old class ‘’.

2. Which of the variables glyhb, ratio, bp.1s, age, gender and frame are quantitative variables? Which are qualitative variables? Draw histogram for each quantitative vari- able and comment on its distribution. Draw pie chart for each qualitative variable and comment on how its classes are distributed.

3. Regress glyhb on ratio, bp.1s, age, gender and frame (Model 1). Obtain the residuals vs. fitted values plot and the residuals Q-Q plot. Do model assumptions hold?

4. Transformations of the response variable are often used to attempt to satisfy normality assumption. Power transformation y∗ = yλ is often used in practice. The optimal power λ of this transformation can be estimated using function boxcox() in library MASS:

library(’MASS’)

bc = boxcox(fit1) ## fit1 is the output of lm() function in part 3.

lambda = bc$x[which.max(bc$y)]

5. Create a new data set trans from the original data set by replacing glyhb with glyhb* = glyhbλ and regress glyhb* on ratio, bp.1s, age, gender and frame using trans (Model 2). Obtain the residuals vs. fitted values plot and the residuals Q-Q plot. Do model assumptions hold now?

6. Draw side-by-side box plots to show how glyhb is distributed in male and female, and how it is distributed in the three frame classes. Do the same thing for glyhb*. What do you observe?

Building regression models. From now on, we use trans.

7. From Model 2 fitting summary report, you find out that several X variables are far from being significant. Identify the least significant (i.e., with largest pvalue) X variable and fit a regression model by dropping it from Model 2 (call this Model 3).

8. From Model 3 fitting summary report, you find out that there are still some X variables that are far from being significant. Identify the least significant X variable and fit a regression model by dropping it from Model 3 (Model 4).

9. It turns out that Model 4 contains ratio, age and frame as X variables. Write down the model equation for Model 4. Derive the fitted regression function for each class of frame and interpret the regression coefficients related to frame. (Hint: how should you define indicator variables for frame?)

10. Suppose you want to find out whether age interacts with frame, so you fit a regression model by adding this interaction into Model 4 (Model 5). Write down model equation for Model 5. Derive the fitted regression function for each class of frame and interpret the regression coefficients related to frame.

Model selection criteria.

11. Calculate R2p, R 2 a,p, AICp, BICp for Models 2, 3, 4, 5, respectively. Which model would

you select according to each of these criteria?

Processing of the data (cont’d). Here we consider all predictors in the data set.

12. Drop id, bp.2s, bp.2d from the data. The column id are patient IDs and thus is not an explanatory variable. The variables bp.2s, bp.2d have many missing values. You may use the code:

drops=c("id","bp.2s", "bp.2d")

data=trans[,!(names(trans)%in%drops)]

13. Drop all the cases having missing value. You may use the code:

index.na=apply(is.na(data), 1, any) ## identify cases with missing value.

data.s=data[index.na==FALSE,] ## drop cases with missing value.

data.s = data.s[,!names(data.s)=="glyhb*"] ## drop glyhbs

any(is.na(data.s)) ## this should return FALSE: no NA in data.s

dim(data.s) ##this should return 366 16: 366 cases, 16 variables

table(data.s$frame) ## this should show three classes.

14. Draw scatterplot matrix and obtain the pairwise correlation matrix for all quantitative variables in the data. Comment on their relationships. Hint: one way to compute the correlation matrix is first drop categorical variables and then use cor() function:

## drop the factor variables

data.q=data.s[,!(sapply(data.s,class)%in%’factor’)]

## compute the correlation matrix

cor(data.q)

Selection of first-order and second-order effects. We now consider subsets selection from the pool of all first-order effects of the 15 predictors.

15. Fit a model with all first-order effects (Model 6). How many regression coefficients are there in this model? What is the MSE from this model? You may use the code:

lm(glyhb ~., data=data.s)

where data.s denotes the data set obtained by part 13.

16. Consider best subsets selection using the R function regsubsets() from the leaps library with Model 6 as the full model. Return the top 1 best subset of all subset sizes (i.e., number of X variables) up to 16. Get R2p, R

2 a,p, AICp and BICp for each of these models.

Identify the best model according to each criterion.

17. We now explore a stepwise procedure. Apply the forward selection procedure us- ing R function stepAIC(), starting from the none-model and using the AICp crite- rion. What is the first-order model being selected (hint: for parameter scope, set upper=formula(fita), where fita = lm(glyhb~., data=data.s) is the full first-order model)? Is it the “best” model according to AICp criterion identified in part 19? What is the second-order model (model with interactions) being selected (hint: for parame- ter scope, set upper=formula(fita2), where fita2 = lm(glyhb~.^2, data=data.s) is the full second-order model)?