PYTHON PROGRAMMING

profileHussain2018
AssignmentTemplate-4.pdf

1

Assignment Template

New and Revised for

September 2017

In the real world, you will be building predictive models and doing analytic work.

But that is not your only function. After you do the work, you need to explain it to

other people (most of whom will not understand analytics). Therefore, it is critical

that you are able to explain your results in such a way that non analytic people can

understand it. If you dump 20 or 30 pages of output on the person and say “it’s all

in here”, then they won’t read it. In fact, that person will likely just ignore your

results go about with their day to day business without giving your work a second

thought. This is not a desirable outcome. You must write your report so that it can

be understood by others and it must contain enough detail that it can be replicated.

In my work I am often handed the work of others and asked to provide a critique.

If I am unable to replicate their work because it is lacking in detail then the

critique will be very negative.

It is not enough that you build a great model. You also have to sell it.

DOs AND DON’Ts:

 DO put your document in PDF Format

 DO put your name inside the document in the header

 DO put your name in the file name (i.e. “Homework_03_Fred_Smith.pdf”)

 DO limit your SAS Output to support the narrative

 DO focus your time on analyzing and explaining the output

 DO keep the output, charts, graphs, and tables close to your discussion

 DO put page numbers on the document

 DON’T put your document in MS WORD or some other format

 DON’T omit your name.

 DON’T name your file “Homework_03.pdf”

 DON’T dump 30 or 40 pages of output and expect me to scroll through it

 DON’T include any diagram or graph or table without discussing it

 DON’T put your output at the end and say (refer to the diagram at the end of

the document …

UNLESS IT IS ABSOLUTELY NECESSARY to do that)

2

Example Report

(with a lot of commentary)

Assignment #3

Fred Smith PREDICT 410 Section 58

INTRODUCTION

The introduction should describe the purpose of the assignment and what you are

going to do in order to complete the assignment. It should be clear that you

understand why you are performing certain steps in an analysis.

BAD INTRODUCTION

The purpose of this report is to analyze baseball data.

GOOD INTRODUCTION

The purpose of the assignment is to analyze data from somewhere in order to

predict the number of something. This will be accomplished by generating simple

and multivariate regression models using different variable selection techniques

including, but not limited to, Forward, Stepwise, and Backward regression. From

these techniques, the best model will be selected. This best model will then be

further analyzed to determine if it is an adequate model to predict or if further

analysis is necessary.

Make sure you follow the assignment instructions. To get points for each of these

sections, you have to show them in your report. Each assignment will require a

different type report. This template is fairly generic so adjust to assignment

instructions.

If I don’t see the section in your report you will get 0 points for it.

1. Data Exploration

Important step. This is where you make or break model building. Spend time on

this.

 Mean / Standard Deviation / Median

3

 Do many charts, Bar Charts, Box Plots, Scatter Plots of the data

 Is the data correlated to the target variable (or to other variables?)

 Are any of the variables missing or out of range and need to be

imputed “fixed”?

 Don’t delete records that will cause test records to be deleted, fix

them.

2. Data Preparation

Also, a critical section. Experiment with this step. Be creative. I like creative

ideas even if they don’t work.

 Fix missing values (maybe with a Mean or Median value)

 Fix outliers

 Create flags to suggest if a variable was missing

 Transform data by putting it into buckets

 Try mathematical transforms such as log or square root

 Combine variables (such as ratios or adding or multiplying) to create

new variables

3. Build Models

These are instructions from Assignment 1 but will be similar in the other

assignments.

Build at least two different LINEAR REGRESSION models using different

variables. Show all of your models and the statistical significance of the input

variables.

Discuss the coefficients in the model, do they make sense? Are you keeping the

model even though it is counter intuitive? Why?

Display the Python results for your assignment and comment on the results. Your

discussion of the results should be intertwined with (or linked to) the Python

output, i.e. the discussion should be on or near the page containing the output.

You should not be showing a lot of unnecessary Python output.

4

Discuss the results thoroughly. Include such discussion points as:

 What is observed in the graph / table / output

 Put the results in understandable (real world terms)

 Are the results in keeping with theory?

 Do the results make sense?

 Should something different be done?

GOOD DESCRIPTION OF A DIAGRAM

The analysis continues by examining the plot of the residual values versus the

predicted variables given in Figure 1. In this type of analysis, a visual inspection of

the chart is conducted to determine whether or not any patterns exist in the

residuals. Some patterns might include errors that increase or decrease with

larger predictive variables or some other type of pattern such as a curve. In an

ideal situation, the data will appear to be random. An inspection of Figure 1

suggests that the data points are randomly distributed and no obvious patterns

exist in the data. Therefore, there are no immediate concerns with the

distribution of the errors.

Figure 1 Housing Data Predicted vs Residual Graph

BAD DESCRIPTION OF A DIAGRAM

I examined the output at the end of the document. There are no patterns in the

data.

GOOD DESCRIPTION OF AN EQUATION

The model chosen from the different candidates was the XXX model because it

had the highest Adjusted R-Squared value and the lowest AIC and SBC values.

Using these metrics, it was far superior to the other models. The formula given for

the predicted sale price is:

p_saleprice = 50000

+ 5000 * X1 LotFrontage

+ 6000 * X2 LotArea

+ 3000 * X3 OverallCond

5

The formula makes intuitive sense for the most part because sale price

coefficients reflect that size and condition add to the value of a property.

However, the data should be analyzed for multi-collinearity which can result in sign

changes. Also, it might be wise to remove the variable from the model if no

explanation can be found.

BAD DESCRIPTION OF AN EQUATION

This is the formula I chose.

p_saleprice = 10.4901

+ 3.11867 * X1 + 5.24082 * X2 + 1.76700 * X4 + 2.65534 * X5 -

3.21636 * X6 - 1.94656 * X8 + 2.35175 * X9

Additionally, it is important to note that this data was developed on data from

XXXX years, so it is unknown as to whether this data will translate into years in

the future. Further analysis will need to be done to determine whether this model

will be robust and translate outside the XXXX year time window.

NOTE: This is a made up formula, so don’t go investing in housing in New York

based on this model. Come to think of it, it’s probably not a good idea to invest in

New York unless you are very familiar with New York.

4. Select Models

Decide on the criteria for selecting the “Best Model”. Will you use a metric such as

Adjusted R-Square or AIC? Will you select a model with slightly worse

performance if it makes more sense or is more parsimonious? Discuss why you

selected your model. Put the results in a table to display and discuss.

5. Model Formula

If you expect points for this step, show it in your report and explain it. You

will get 0 points if it is somewhere in your code and left out of the report.

Don’t expect that I will search your code for it.

Write python code that will score new data and predict the sale price. The variable

with the predicted sale price should be named:

6

p_saleprice

6. Scored Data File

Make sure you submit as a csv file.

Use the stand alone program that you wrote in the previous section. Score the data

file ames_test. Create a file that has only TWO variables for each record:

index

p_saleprice

The first variable, index, will allow me to match my grading key to your predicted

value. If I cannot do this, you won’t get a grade. The second value, p_saleprice is

the predicted sale price of a home based on the data given to you.

Your values will be compared against …

 A Perfect Model

 Instructor’s Model

 Performance of Other Students

 Predict the Average value for everybody (MEAN)

If your model is not better than simply using an AVERAGE value, you will lose

points.

CONCLUSION:

A short wrap up of the assignment including a discussion of results and what was

learned.

GOOD CONCLUSION:

Several models were developed to predict the sale price of a home using Ames

Housing data. The best model was derived using XXXX. Although there were no

problems with the model from a statistical standpoint, the winning model did have a

7

sign issue with one of the variables where seemingly bad construction would result

in a higher sale price. This issue needs further investigation but is beyond the

scope of this document.

BAD CONCLUSION:

I built some models that were good and I learned a lot.

CODE:

Attach as a separate file or paste your code in at the end.

BONUS

Place all bonus work at the end of the document. Clearly identify what you are

doing and how many points you are trying to earn.