PYTHON PROGRAMMING
1
Assignment Template
New and Revised for
September 2017
In the real world, you will be building predictive models and doing analytic work.
But that is not your only function. After you do the work, you need to explain it to
other people (most of whom will not understand analytics). Therefore, it is critical
that you are able to explain your results in such a way that non analytic people can
understand it. If you dump 20 or 30 pages of output on the person and say “it’s all
in here”, then they won’t read it. In fact, that person will likely just ignore your
results go about with their day to day business without giving your work a second
thought. This is not a desirable outcome. You must write your report so that it can
be understood by others and it must contain enough detail that it can be replicated.
In my work I am often handed the work of others and asked to provide a critique.
If I am unable to replicate their work because it is lacking in detail then the
critique will be very negative.
It is not enough that you build a great model. You also have to sell it.
DOs AND DON’Ts:
DO put your document in PDF Format
DO put your name inside the document in the header
DO put your name in the file name (i.e. “Homework_03_Fred_Smith.pdf”)
DO limit your SAS Output to support the narrative
DO focus your time on analyzing and explaining the output
DO keep the output, charts, graphs, and tables close to your discussion
DO put page numbers on the document
DON’T put your document in MS WORD or some other format
DON’T omit your name.
DON’T name your file “Homework_03.pdf”
DON’T dump 30 or 40 pages of output and expect me to scroll through it
DON’T include any diagram or graph or table without discussing it
DON’T put your output at the end and say (refer to the diagram at the end of
the document …
UNLESS IT IS ABSOLUTELY NECESSARY to do that)
2
Example Report
(with a lot of commentary)
Assignment #3
Fred Smith PREDICT 410 Section 58
INTRODUCTION
The introduction should describe the purpose of the assignment and what you are
going to do in order to complete the assignment. It should be clear that you
understand why you are performing certain steps in an analysis.
BAD INTRODUCTION
The purpose of this report is to analyze baseball data.
GOOD INTRODUCTION
The purpose of the assignment is to analyze data from somewhere in order to
predict the number of something. This will be accomplished by generating simple
and multivariate regression models using different variable selection techniques
including, but not limited to, Forward, Stepwise, and Backward regression. From
these techniques, the best model will be selected. This best model will then be
further analyzed to determine if it is an adequate model to predict or if further
analysis is necessary.
Make sure you follow the assignment instructions. To get points for each of these
sections, you have to show them in your report. Each assignment will require a
different type report. This template is fairly generic so adjust to assignment
instructions.
If I don’t see the section in your report you will get 0 points for it.
1. Data Exploration
Important step. This is where you make or break model building. Spend time on
this.
Mean / Standard Deviation / Median
3
Do many charts, Bar Charts, Box Plots, Scatter Plots of the data
Is the data correlated to the target variable (or to other variables?)
Are any of the variables missing or out of range and need to be
imputed “fixed”?
Don’t delete records that will cause test records to be deleted, fix
them.
2. Data Preparation
Also, a critical section. Experiment with this step. Be creative. I like creative
ideas even if they don’t work.
Fix missing values (maybe with a Mean or Median value)
Fix outliers
Create flags to suggest if a variable was missing
Transform data by putting it into buckets
Try mathematical transforms such as log or square root
Combine variables (such as ratios or adding or multiplying) to create
new variables
3. Build Models
These are instructions from Assignment 1 but will be similar in the other
assignments.
Build at least two different LINEAR REGRESSION models using different
variables. Show all of your models and the statistical significance of the input
variables.
Discuss the coefficients in the model, do they make sense? Are you keeping the
model even though it is counter intuitive? Why?
Display the Python results for your assignment and comment on the results. Your
discussion of the results should be intertwined with (or linked to) the Python
output, i.e. the discussion should be on or near the page containing the output.
You should not be showing a lot of unnecessary Python output.
4
Discuss the results thoroughly. Include such discussion points as:
What is observed in the graph / table / output
Put the results in understandable (real world terms)
Are the results in keeping with theory?
Do the results make sense?
Should something different be done?
GOOD DESCRIPTION OF A DIAGRAM
The analysis continues by examining the plot of the residual values versus the
predicted variables given in Figure 1. In this type of analysis, a visual inspection of
the chart is conducted to determine whether or not any patterns exist in the
residuals. Some patterns might include errors that increase or decrease with
larger predictive variables or some other type of pattern such as a curve. In an
ideal situation, the data will appear to be random. An inspection of Figure 1
suggests that the data points are randomly distributed and no obvious patterns
exist in the data. Therefore, there are no immediate concerns with the
distribution of the errors.
Figure 1 Housing Data Predicted vs Residual Graph
BAD DESCRIPTION OF A DIAGRAM
I examined the output at the end of the document. There are no patterns in the
data.
GOOD DESCRIPTION OF AN EQUATION
The model chosen from the different candidates was the XXX model because it
had the highest Adjusted R-Squared value and the lowest AIC and SBC values.
Using these metrics, it was far superior to the other models. The formula given for
the predicted sale price is:
p_saleprice = 50000
+ 5000 * X1 LotFrontage
+ 6000 * X2 LotArea
+ 3000 * X3 OverallCond
5
The formula makes intuitive sense for the most part because sale price
coefficients reflect that size and condition add to the value of a property.
However, the data should be analyzed for multi-collinearity which can result in sign
changes. Also, it might be wise to remove the variable from the model if no
explanation can be found.
BAD DESCRIPTION OF AN EQUATION
This is the formula I chose.
p_saleprice = 10.4901
+ 3.11867 * X1 + 5.24082 * X2 + 1.76700 * X4 + 2.65534 * X5 -
3.21636 * X6 - 1.94656 * X8 + 2.35175 * X9
Additionally, it is important to note that this data was developed on data from
XXXX years, so it is unknown as to whether this data will translate into years in
the future. Further analysis will need to be done to determine whether this model
will be robust and translate outside the XXXX year time window.
NOTE: This is a made up formula, so don’t go investing in housing in New York
based on this model. Come to think of it, it’s probably not a good idea to invest in
New York unless you are very familiar with New York.
4. Select Models
Decide on the criteria for selecting the “Best Model”. Will you use a metric such as
Adjusted R-Square or AIC? Will you select a model with slightly worse
performance if it makes more sense or is more parsimonious? Discuss why you
selected your model. Put the results in a table to display and discuss.
5. Model Formula
If you expect points for this step, show it in your report and explain it. You
will get 0 points if it is somewhere in your code and left out of the report.
Don’t expect that I will search your code for it.
Write python code that will score new data and predict the sale price. The variable
with the predicted sale price should be named:
6
p_saleprice
6. Scored Data File
Make sure you submit as a csv file.
Use the stand alone program that you wrote in the previous section. Score the data
file ames_test. Create a file that has only TWO variables for each record:
index
p_saleprice
The first variable, index, will allow me to match my grading key to your predicted
value. If I cannot do this, you won’t get a grade. The second value, p_saleprice is
the predicted sale price of a home based on the data given to you.
Your values will be compared against …
A Perfect Model
Instructor’s Model
Performance of Other Students
Predict the Average value for everybody (MEAN)
If your model is not better than simply using an AVERAGE value, you will lose
points.
CONCLUSION:
A short wrap up of the assignment including a discussion of results and what was
learned.
GOOD CONCLUSION:
Several models were developed to predict the sale price of a home using Ames
Housing data. The best model was derived using XXXX. Although there were no
problems with the model from a statistical standpoint, the winning model did have a
7
sign issue with one of the variables where seemingly bad construction would result
in a higher sale price. This issue needs further investigation but is beyond the
scope of this document.
BAD CONCLUSION:
I built some models that were good and I learned a lot.
CODE:
Attach as a separate file or paste your code in at the end.
BONUS
Place all bonus work at the end of the document. Clearly identify what you are
doing and how many points you are trying to earn.