PYTHON PROGAMMING
Ames Housing OLS Regression Project (300 Points)
DELIVERABLES
Your write up in PDF Format (no zip files). Your write up should have five
sections. Each section should have enough detail so that I can follow your
logic and someone else can replicate your work.
A file that contains all the python code you used in your analysis. I should be
able to run this file and get all the output that you got.
A csv file, which has the scored records values from ames_test_sfam.
There will be only two columns in this file: index and p_saleprice. You will be
graded on how your model performs versus my model and those of other
students in the class. You can see how well your model performance
improves by continuing to submit your csv file to kaggle.
Section 1. Modeling & More (100 points)
Submit at least 4 models for this assignment. It is a continuation of the model
building process for the Ames Housing Data. Your models should predict SalePrice
with increasingly complex models. However, keep in mind the principle of
parsimony. Find the best model with the least complexity and a model that you can
explain.
Neighborhood Accuracy
Use one of your models from HW01. Make a boxplot of the residuals by
neighborhood. Which neighborhoods are better fit by the model? Do you have
neighborhoods that are consistently over-predicted? Do you have neighborhoods
that are consistently under-predicted?
Compute actual and estimated mean price per square foot for each neighborhood.
Group the neighborhoods by actual price per square foot. Create between 3 and 6
groups. Code a family of indicator variables for the neighborhoods to include in
your multiple regression model. What is your base category? Refit your multiple
regression model with your indicator variables.
Two new variables are defined in the python shell code.
df['qualityindex'] = (df.overallqual*df.overallcond)
df['totalsqftcalc'] = (df.bsmtfinsf1+df.bsmtfinsf2+df.grlivarea)
Include these in your models. Can you think of other variables that it would make
sense to define? Try something creative. It might work.
Section 2. Model Comparison of Y versus log(Y) (20 points)
In this section, fit two models using the same set of predictor variables, but the
response variables will be SalePrice and log(SalePrice). You may use any set of
predictor variables that you wish, but the models must include at least four
continuous predictor variables and any discrete variables that you wish.
Respond to all of these bullet questions:
How do we interpret these two models?
How is the interpretation of the log(SalePrice) model different from the
price model?
Which model fits better?
Did the transformation of the response to log(SalePrice) improve the model
fit?
In general when can a log transformation of the response variable improve
the model fit?
Should we consider any transformations to the predictors? If so, then fit
one more model using any transformations that you find appropriate.
Compute the VIF values for the models. If the models have highly correlated pairs
of predictors that you do not like, then go back, add them to your drop list, and re-
perform the variable selection before you go on with the assignment. The VIF
values do not need to be ideal, but if you have a very large VIF value (like 20, 30,
50 etc.), then you should consider removing a variable so that your variable
selection models are not junk too.
Produce the relevant diagnostic plots to assess the goodness-of-fit of each model.
On what criteria are you assessing the model fit? Always report the fitted model
when we fit a linear model. This means that your report should contain a table with
the coefficient estimates, t-values, p-values, etc.
Optional
If you would like to try an automated selection algorithm, try feature selection and
f_regression for numeric variables:
from sklearn import feature_selection
from sklearn.linear_model import LinearRegression
Sklearn DOES have a forward selection algorithm, although it isn't called that in
scikit-learn. The feature selection method called F_regression in scikit-learn will
sequentially include features that improve the model the most, until there are K
features in the model (K is an input).
It starts by regressing the labels on each feature individually, then observing
which feature improved the model the most using the F-statistic. Then it
incorporates the winning feature into the model. Then it iterates through the
remaining features to find the next feature which improves the model the most,
again using the F-statistic or F test. It does this until there are K features in the
model.
Some code for this:
y = train['saleprice']
X = train[['intercept',
'qualityindex','totalsqftcalc','yearbuilt','yearremodel','wooddecksf','openporchsf'
]].copy()
X.head()
model = feature_selection.SelectKBest(score_func=f_regression, k=4)
results = model.fit(X, y)
print results.scores_
Try it. See if f_regression improves your model.
Use Kaggle to compare your models and other metrics.
It is OK to submit to Kaggle many times.
Section 3. SELECT MODELS (20 Points)
Decide on the criteria for selecting the “Best Model”. Will you use a metric such as
Adjusted R-Square or AIC? Will you select a model with slightly worse
performance if it makes more sense or is more parsimonious? Discuss why you
selected your model. Put the metrics in a table to display the results.
If you are just using Kaggle, that is OK. Just tell me what you are doing.
Section 4. Model Formula (10 Points)
Write a formula that will predict the sale price of a home. Make sure you include
this as a section in your report. Do not expect that I will search your report
to find it. This step should allow someone else to deploy your model.
The variable with the predicted saleprice should be named:
p_saleprice
Section 5. SCORED DATA FILE (150 POINTS)
Pick your best model or send me more than one model. I will score several models
with my code for you. Use your best model. Score the data file ames_test_sfam.
Create a file that has only TWO variables for each record:
index
p_saleprice
The first variable, index, will allow me to match my grading key to your predicted
value.
If I cannot score your model, you won’t get a grade. So please include the index
number. The second value, p_saleprice is the predicted price for a property per
your model.
Your values will be compared against …
A Perfect Model
Shell Code Example Model
Performance of Other Students
Predict the Average value for everybody (MEAN)
If your model is not better than simply using an AVERAGE value, you will lose
points.
BONUS Optional (10 pt Bonus): Assess the predictive accuracy of your model using cross-
validation. Use python code to split the full dataset for the Ames Housing Data
for your own train and test datasets. See HW02 shell code.
A defining feature of predictive modeling is assessing model performance out-of-
sample. You will use uniform random numbers to split the training data into a 70/30
train/test split. With a train/test split you have two data sets: one for in-sample
model development and one for out-of-sample model assessment.
If you want Bonus Points, write a brief section at the top of your Write Up
document and tell me exactly what you did and how many points you are attempting.
If I cannot see your Bonus work, I cannot give you credit. Bonus is difficult to
grade and I don’t have time to go back looking for it. If you don’t tell me it’s there,
I cannot give you points.
The policy with Bonus is: All Sales are Final !
(10 Points) Once you select a model try something else. Are the results the
same? Are there any differences?
(?? Points) Roll the dice … think of something creative and run with it. I might
give you points.
PENALTY BOX (Lose 10 Points) If you don’t have PDF format
(Lose 10 Points) If you don’t have a GOOD Introduction
(Lose 10 Points) If you don’t have a GOOD Conclusion
(Lose 10 Points) If you don’t put your NAME in the file names of any files you
hand in
(Lose 10 Points) If you don’t put your NAME inside the files you hand in