PYTHON PROGAMMING

profileHussain2018
HW01_Feedback.pdf

WRITE UP (200 POINTS)

1. First Steps (40 points)

Describe the ames_train data set so that I am convinced you understand it.

(a) A Data Survey

- Take a broad overview of the Ames housing data set. Read over the data

documentation. What data do you have, and what is it supposed to represent?

- In the linear regression component of this course you build linear regression

models to predict the value of a property (or home). Do you have the right

data to properly address the problem? Are there observations in the data

that should be excluded?

- What kinds of problems can you properly address given the data that you

have? In particular if you were to build a regression model with the variable

SalePrice as the response variable, what types of properties would you be

valuing? Be careful about what you are doing here.

(b) Define the Sample Population

- When building statistical models you have to define the population of

interest, and then sample from THAT population. Frequently you will not

actively perform the sampling function. Instead, the data will be made

available and you will have to sample from it retrospectively, i.e. you will need

to carve out the population of interest. In this assignment the objective of is

to be able to provide estimates of home values for 'typical' homes in Ames,

Iowa. You may not be able to define what 'typical' is, but can use the data to

find out what is atypical. Any values which are not atypical are then

considered to be typical.

- Define the appropriate sample population for your statistical problem. Hint:

You are building regression models for the response variable SalePrice. Are all

properties the same? Would you want to include an apartment building in the

same sample as a single family residence? Would you want to include a

warehouse or a shopping center in the same sample as a single family

residence? Would you want to include condominiums in the same sample as a

single family residence?

- Define your sample using ‘drop conditions’. Create for the drop conditions

and include it in your report so that it is clear to any reader what you are

excluding from the data set when defining your sample population.

The definition of your sample data should be clearly noted in your assignment

report.

(c) A Data Quality Check

- In practice your data will not be 'clean'. You will need to examine your data

for errors and outliers. Errors will not always show as outliers, and outliers

are not necessarily errors.

- If you have a data dictionary that states the set of proper values for each

field, then you will want to check your data against the data dictionary.

- If you do not have a data dictionary, then you will need to reason and

explore your way to a proper data set.

Example 1: In this project you will be modeling the sales price of housing

transactions. It should be obvious that none of these sales prices should be

zero or negative. Observations with a zero or negative sales price should

logically be considered to be errors.

Example 2: Suppose we had a 'small' number of housing transactions with a

sale price over one million dollars, should we consider these sales prices to be

valid? In this case these values could be valid data points, which would make

them outliers, or they could be errors, such as 140,000.00 entered as

1,400,000. In either case they are not relevant data points if the objective

is to model the 'typical' home price for the area.

You skipped this section in the assignment. This is where you discover

that the train data should be reduced to just single family homes. Score Points 0

2. EDA (30 Points)

Pick ten variables from the data quality check to explore in your initial

exploratory data analysis. Perform an initial exploratory data analysis. How do you

perform an exploratory data analysis for continuous versus discrete (or

categorical) data? Consider the use of scatterplots, scatterplot smoothers such as

LOESS, and boxplots to produce relevant graphics when appropriate.

Note that you are particularly interested in the relationships between the

response variable and the predictor variables.

Suggest you split your EDA into two sections in your report – one section for

continuous variables and one section for discrete variables.

You should follow the assignment. Very hard to follow your work.

Much more to do here. Scatterplots for all of the continuous variables vs the

saleprice. Score Points 20

3. BUILD MODELS (100 Points)

Modeling looks OK. Score Points 100

4. SELECT MODELS (20 Points)

Good display of the metrics. Score Points 20

5. WRITE MODEL FORMULA (10 Points)

OK Score Points 10

SCORED DATA FILE (100 POINTS)

Obs NAME ERROR

1 Perfect Model 0.00

2 koo_410_hw01 29699.08

3 Shell Code 31444.47

4 Average Value Model 60779.35

Your model s cores O K but you nee d to foll ow the ass ignment.

Score Points 100

Total Points 250

0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000

0

100

200

300

400

C o u n t

PredictedActual

Predicted vs Actual