PYTHON PROGAMMING
WRITE UP (200 POINTS)
1. First Steps (40 points)
Describe the ames_train data set so that I am convinced you understand it.
(a) A Data Survey
- Take a broad overview of the Ames housing data set. Read over the data
documentation. What data do you have, and what is it supposed to represent?
- In the linear regression component of this course you build linear regression
models to predict the value of a property (or home). Do you have the right
data to properly address the problem? Are there observations in the data
that should be excluded?
- What kinds of problems can you properly address given the data that you
have? In particular if you were to build a regression model with the variable
SalePrice as the response variable, what types of properties would you be
valuing? Be careful about what you are doing here.
(b) Define the Sample Population
- When building statistical models you have to define the population of
interest, and then sample from THAT population. Frequently you will not
actively perform the sampling function. Instead, the data will be made
available and you will have to sample from it retrospectively, i.e. you will need
to carve out the population of interest. In this assignment the objective of is
to be able to provide estimates of home values for 'typical' homes in Ames,
Iowa. You may not be able to define what 'typical' is, but can use the data to
find out what is atypical. Any values which are not atypical are then
considered to be typical.
- Define the appropriate sample population for your statistical problem. Hint:
You are building regression models for the response variable SalePrice. Are all
properties the same? Would you want to include an apartment building in the
same sample as a single family residence? Would you want to include a
warehouse or a shopping center in the same sample as a single family
residence? Would you want to include condominiums in the same sample as a
single family residence?
- Define your sample using ‘drop conditions’. Create for the drop conditions
and include it in your report so that it is clear to any reader what you are
excluding from the data set when defining your sample population.
The definition of your sample data should be clearly noted in your assignment
report.
(c) A Data Quality Check
- In practice your data will not be 'clean'. You will need to examine your data
for errors and outliers. Errors will not always show as outliers, and outliers
are not necessarily errors.
- If you have a data dictionary that states the set of proper values for each
field, then you will want to check your data against the data dictionary.
- If you do not have a data dictionary, then you will need to reason and
explore your way to a proper data set.
Example 1: In this project you will be modeling the sales price of housing
transactions. It should be obvious that none of these sales prices should be
zero or negative. Observations with a zero or negative sales price should
logically be considered to be errors.
Example 2: Suppose we had a 'small' number of housing transactions with a
sale price over one million dollars, should we consider these sales prices to be
valid? In this case these values could be valid data points, which would make
them outliers, or they could be errors, such as 140,000.00 entered as
1,400,000. In either case they are not relevant data points if the objective
is to model the 'typical' home price for the area.
You skipped this section in the assignment. This is where you discover
that the train data should be reduced to just single family homes. Score Points 0
2. EDA (30 Points)
Pick ten variables from the data quality check to explore in your initial
exploratory data analysis. Perform an initial exploratory data analysis. How do you
perform an exploratory data analysis for continuous versus discrete (or
categorical) data? Consider the use of scatterplots, scatterplot smoothers such as
LOESS, and boxplots to produce relevant graphics when appropriate.
Note that you are particularly interested in the relationships between the
response variable and the predictor variables.
Suggest you split your EDA into two sections in your report – one section for
continuous variables and one section for discrete variables.
You should follow the assignment. Very hard to follow your work.
Much more to do here. Scatterplots for all of the continuous variables vs the
saleprice. Score Points 20
3. BUILD MODELS (100 Points)
Modeling looks OK. Score Points 100
4. SELECT MODELS (20 Points)
Good display of the metrics. Score Points 20
5. WRITE MODEL FORMULA (10 Points)
OK Score Points 10
SCORED DATA FILE (100 POINTS)
Obs NAME ERROR
1 Perfect Model 0.00
2 koo_410_hw01 29699.08
3 Shell Code 31444.47
4 Average Value Model 60779.35
Your model s cores O K but you nee d to foll ow the ass ignment.
Score Points 100
Total Points 250
0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000
0
100
200
300
400
C o u n t
PredictedActual
Predicted vs Actual