data analysis
Grading:
· Points are listed next to each question and should total 20 points overall.
· To encourage better understanding, you may have your data analysis pre-graded with feedback. After viewing the feedback, you may make changes and resubmit your analysis for final grading.
· View feedback by clicking Grades and your assignment. Feedback will be written on your document.
· Late assignments will not be graded. Lowest two data analysis grades will be dropped.
Instructions:
· Download or view questions below.
· Clearly label and type answers, without question prompts, in word, google docs, PDF, or other word processing software.
· Insert diagrams or plots as a picture in an appropriate location.
· Math Formulas need to be typed with Math Type, LateX, or clearly using key board symbols such as +, -, *, /, sqrt() and ^
· Submit assignment Verify the correct document has been uploaded. If not, resubmit. You can submit up to three times.
Allowances:
· Any resources listed or posted in our class.
· You are encouraged to discuss the problems with other students, the instructor and TAs, however, all work must be your own words. Duplicate wording will be considered plagiarism.
· Outside Resources need to be cited. Websites such as Chegg, Coursehero, Koofers are discouraged, but if used need to be cited and within the boundaries of academic honesty.
The Data Collection: On May 22nd, 2018 the data in CorvallisRentalsSp2018.csv was created by going to the Corvallis Craigslist site and clicking on Apts/Housing for rent. The postings were selected through systematic random sampling. Since Craigslist posts the listings in a list, a starting point was randomly generated, then every 4th (randomly generated) posting was collected. If postings were outside the city of Corvallis, the next Corvallis posting was selected. Duplicate postings were skipped until the next new Corvallis posting was available. This sample is random, and assumed to be representative of current rental properties posted on Craigslist for Corvallis.
The following variables were collected:
rent – monthly rent advertised
rooms – number of bedrooms, studio = 0
baths – number of bathrooms
sqrfoot – square footage of dwelling
house – indicator whether property is a house (instead of an apartment/dual living environment) 0 = No, 1= Yes
campusclose – indicator whether property is close to campus (approx. within 1 mile) 0 = No, 1 = Yes
pets – indicator whether property allows pets 0 = No, 1 = Yes
new – indicator whether property is new or newly remodeled, 0 = No, 1= Yes
Part 1. (4 points) Multivariate Visualization: It is reasonable to consider that many factors are used to predict monthly rents. Investigate the individual relationships between advertised rent and the above explanatory variables. Use the R script Multivariate_Rent_Analysis.R to help you get started with the code.
a. (1 point) Construct a scatterplot matrix including rent and each of the quantitative explanatory variables. Paste the plot.
b. (1 point) Which variables have an individual relationship with advertised monthly rent?
c. (1 point) Describe the relationships from 1b above.
d. (1 point) Is there visual evidence any of the explanatory variables are related to each other?
Part 2. (8 points) Evaluate the full model, the model that uses all seven explanatory variables to predict monthly rents. You will notice that many of the predictors are not significant when the others are in the model. Follow the basic model selection process to settle on the best model to predict monthly rents in Corvallis.
Basic Model Selection Steps: (See lesson 49 for example)
1. Remove the variable with highest p-value and re-fit the model. Only remove one variable at a time.
2. Continue removing variables one-by-one until all variables in the model have a p-value less than 0.05.
3. Consider whether any of the variables in your model are related to each other. Check this with the scatterplot matrix and\or by finding the correlation between the two explanatory variables. If then keep both variables in the model. This is your final model. However If , then one of the variables should be removed from the model. Re-fit two models, each model without one of the correlated variables. Select the model with the higher adjusted R-squared value.
a. (2 points) Provide a narrative for how you settled upon the final model. Example: “I first fit the full model and noticed the p-value for ____was very high. I dropped it from the model and refit the data, then I check the correlation between ___ and ___ to see if the relationship was too strong between the explanatory variables.”
b. (2 points) Provide the R output of your final model.
c. (2 points) State the least squares regression equation of your model.
d. (2 points) Compare the adjusted R- squared values from the full model to your final model. Is there much of a difference? What does this comparison tell us about the fit of two models?
Part 3. (4 Points) Model interpretation.
a. (2 points) Number of rooms should be a part of your final model. Interpret the coefficient of number of rooms while keeping the other variables constant.
b. (2 points) Calculate the 95% confidence interval for . Show work. Interpret the interval.
Part 4. (4 points) Prediction.
a. (2 points) Use the least squares regression equation to predict the monthly rent for an apartment close to campus with 2 bedrooms and 1 bathroom. The apartment has square footage is 922 sq. ft, is not new and does not allow pets. Note: Your final model will not have all these variables.
b. (2 points) The listing above corresponds to the 19th listing in the dataset where the actual rent is $1200. How far off is your model at predicting the rent for this listing? In other words, calculate the residual.
3