Median Housing Price Prediction Model for D. M. Pan National Real Estate Company 1
Report: Housing Price Prediction Model for D. M. Pan National Real Estate Company
Kalyn Shoop
Southern New Hampshire University
Median Housing Price Model for D. M. Pan National Real Estate Company 2
Introduction
This report predicts the median listing price based on the median square footage using linear
regression, graphs, tables, and a sample to produce an analysis. Using linear regression, is most
appropriate when predicting the strength and trend of a set of data. For example, when
determining how much the square footage affects the listing price. The predictor variable is used
to predict and the response variable is what we are trying to predict. The predictor variable is the
median square footage of a home. The response variable is the median listing price of the home.
To determine the listing price, we need to use the median square footage.
Data Collection
I obtained the sample data by using the =RAND ( ) function in excel. I started by inserting a column to
the left of the column marked ‘Region’. Next I entered the =RAND ( ) function into the first cell of the
inserted column. I proceeded to copy and paste the function all the way until the last cell in the data set.
Then I used the sort function to randomize the data. Finally I selected the first 50 regions. Response
variable, Y= median listing price. Predictor Variable, X=median square footage.
Data Analysis
To use a linear regression for the data set, multiple requirements must be met. To do so a scatter plot must
be made. The first requirement is that the variables must have a true linear relationship. This can be
determined by the slope. If a data set is truly linear, the slope never changes. Next, we check to see if our
errors are normally distributed. We can note this through looking at a histogram of the variables. If they
are normal or have an obvious pattern, then they can be normally distributed. We also have to look for
equality of variance. If the dots on the plots do not curve into a megaphone pattern, then it has equal
variance. We can determine independence of residuals depend on the type of data. If it’s cross sectional
data it is assumed. This data set sample meets all previously listed requirements.
When looking at a linear regression involving the variable involving two variables, the response variable
is the variable being modeled or predicted, while the predictor variable is the variable that is used to
predict the response. The response variable will respond to a change in the predictor variable. In a
scatterplot, the predictor variable is on the x axis and the response variable is on the y axis. In the case of
this data set, the response variable is the median listing price and the predictor variable is the median
square footage.
For the histogram, median square feet have an unimodal symmetric shape. It only peaks around the square
footage of 1,839 to 2,169. The data has a center of 2,043 which is between 1,839 and 2,169 square feet.
The data has a smaller spread (standard deviation). From this data, we can interpret that most of the data
is UConn’s from a square footage around 1,839 to 2,169. The standard deviation of this data set is
relatively small. This produces little variation.
The second histogram median listing price unimodal skewed right distribution. It only peaks around the
listing price of $234,000 and $374,000. The data has a center of $283,575 which is between $234,000 and
$374,000. The data has a smaller spread (standard deviation). This data set has outliers at hat differ
greatly from the data set resting between $654,000 and $794,000. Keeping in mind that these outliers can
Median Housing Price Model for D. M. Pan National Real Estate Company 3
greatly affect the mean. From this data we can interpret that most of the data is listed at $90,000 to
$514,000. The standard deviation of the data set is $145,381 which is relatively small and the data is
concentrated close to the mean producing little variation.
Overall, we can see that majority of the listings are between $94,000 and $514,000 with a square footage
of 1,509 and 2,499 based on the shape, center spread and outliers. Also this data set contains little
variance.
The overall characteristics between the national sample and the national population are close. The
histogram of the national population median listing price and the national sample are both unimodal and
symmetrical. The only difference between the two histograms are that they have greater spreaders leading
to greater variation due to having more data than the sample.
Looking at the summary statistics of both samples concludes that the national sample could me used to
represent the national population. The difference between the measure of spread between the median
listing price of both national population and national sample are Mean: Sample is 5.5% more than
national population. Median: Sample is 10.4% more than national population. Standard Deviation:
Sample is 11.3% less than national population. The difference between the measure of spread between the
median square footage of national population and national sample are Mean: Sample is 3.2% more than
national population. Median: Sample is 7.5% more than national population. Standard Deviation: Sample
is 4.9% less than national population. From this we can see that there is not much difference between the
national sample and national population.
Develop Regression Model
A regression model can be developed for the data set. Regression models are used for making
observations and predictions between the relationship of independent and dependent variables. The
analysis of this data set is being used to determine the use of square footage as a benchmark for listing
prices on homes. To develop a regression model for this data set, you must create a regression equation
which consists of the dependent variable (Y), the independent variable (X) the coefficient and intercept.
All these factors can be determined using the data analysis function in excel.
Based on the scatter plot, it depicts a moderate positive linear correlation. The scatter plot also shows 3
outlines that have high leverage that are possibly throwing off the trendline and heavily influencing the
mean of the data.
There’s not much change on my graph after removing the 3 outliers. Looking at the summary statistics
once the outliers are removed, it makes minor changes in the measures of center/spread.
The correlation coefficient / multiple R in excel) R - value is 0.584. I got this value by using regression
under data analysis in excel. The value determines the strength of the correlation. It is neither close to 1
(strong correlation) or close to -1 (weak correlation), its in the middle. Therefore it is moderate.
Determine the Line of Best Fit
The regression equals 242.88 (square foot) and the intercept is 182,801 (listing price). Therefore, for
every increase in 242.88 sq. ft, the listing price increases by $182,801. R squared is 0.340. I came to this
Median Housing Price Model for D. M. Pan National Real Estate Company 4
value by using regression under data analysis in excel. This means that about 34% of the variability in the
median listing price can be explained by the regression model. Based on my analysis, I will be making
two predictions. If I wanted to find the median listing price of a home with 2222 sq ft (Y= 242.88 x 1967
- 182,801) I would get a listing price of $294,943.
Conclusions
This analysis conveys that this sample can be used to represent the national population. Based on the
summary statistics, histograms, and scatter plot, the national population has slight difference compared to
the sample. The regression equation can be used to make predictions for further listing prices of homes
based on their square footage. This sample aligns perfectly with my expectations based on the scatter plot.
In this sample, outliers were not a problem at all. Perhaps, in a different sample with larger outliers it
would affect the trendline and correlation of the scatter plot.