English papper polishing: sentence, grammar, tense, logic, polishing.
|
WOODSTOCK’18, June, 2018, El Paso, Texas USA |
F. Surname et al. |
|
Insert Your Title Here |
WOODSTOCK’18, June, 2018, El Paso, Texas USA |
House Prices: Advanced Regression Techniques ∗
FirstName Surname† Department Name Institution/University Name City State Country email@email.com
1 Executive summary
The Kaggle competition "House Prices: Advanced Regression Techniques" provides a dataset of residential homes in Ames, Iowa, including various features such as the size, number of bedrooms and bathrooms, location, and other factors that might influence the sale price. The goal of the competition is to use machine learning techniques to predict the final sale price of each home in the test dataset, given the features in the training dataset. House Prices - Advanced Regression Techniques | Kaggle
The dataset includes 79 explanatory variables (features) describing various aspects of the residential homes, such as the overall quality of materials and finish, the number of fireplaces, the presence of a pool, and many more. Additionally, there are two columns in the dataset containing the ID of each house and the sale price that we aim to predict.
Due to the limited space, the statistical information of some variables are listed here.
|
|
Sale Price |
MSSub Class |
LotFront age |
LotArea |
Overall Qual |
Overall Cond |
Year Built |
|
mean |
180921.2 |
56.8972 |
70.04996 |
10516.8 |
6.099315 |
5.575342 |
1971.2 |
|
std |
79442.5 |
42.3005 |
24.28475 |
9981.26 |
1.382997 |
1.112799 |
30.202 |
|
min |
34900 |
20 |
21 |
1300 |
1 |
1 |
1872 |
|
25% |
129975 |
20 |
59 |
7553.5 |
5 |
5 |
1954 |
|
50% |
163000 |
50 |
69 |
9478.5 |
6 |
5 |
1973 |
|
75% |
214000 |
70 |
80 |
11601.5 |
7 |
6 |
2000 |
|
max |
755000 |
190 |
313 |
215245 |
10 |
9 |
2010 |
|
|
Year RemodAdd |
MasVnr Area |
Bsmt FinSF1 |
Bsmt FinSF2 |
Bsm tUnfSF |
Total BsmtSF |
1st FlrSF |
|
mean |
1984.866 |
103.6853 |
443.6397 |
46.54932 |
567.2404 |
1057.429 |
1162.62 |
|
std |
20.64541 |
181.0662 |
456.0981 |
161.3193 |
441.867 |
438.7053 |
386.587 |
|
min |
1950 |
0 |
0 |
0 |
0 |
0 |
334 |
|
25% |
1967 |
0 |
0 |
0 |
223 |
795.75 |
882 |
|
50% |
1994 |
0 |
383.5 |
0 |
477.5 |
991.5 |
1087 |
|
75% |
2004 |
166 |
712.25 |
0 |
808 |
1298.25 |
1391.25 |
|
max |
2010 |
1600 |
5644 |
1474 |
2336 |
6110 |
4692 |
The dataset requires pre-processing and cleaning, as there are missing values, categorical features, and outliers. Participants can use a variety of regression techniques to make their predictions, such as linear regression, decision trees, random forests, gradient boosting, and neural networks.
The competition provides a public leaderboard to evaluate the performance of participants' models on a portion of the test data. The final evaluation is based on the root mean squared error (RMSE) between the predicted and actual sale prices on the remainder of the test data. The winner of the competition is the participant with the lowest RMSE on the final evaluation set.
2 Benchmarking of other solutions
In this part, we will be analyzing three Kaggle solutions for the House Prices competition. The solutions are chosen based on their high performance on the Kaggle prediction task. We will summarize the features, modeling approach, and performance of each solution in a table and comment on their approach and what makes them successful.
We have chosen the following three solutions for analysis:
|
Solution |
Features |
Modeling Approach |
Performance |
|
Stacked Regressions : Top 4% on LeaderBoard |
Engineering features, Stacking models |
Linear Regression, Ridge, Lasso, ElasticNet, Gradient Boosting, XGBoost, LightGBM |
RMSE: 0.10630 |
|
A study on Regression applied to the Ames dataset |
Feature selection, Feature engineering, Outlier removal, Cross-validation |
Ridge, Lasso, ElasticNet, Random Forest, Gradient Boosting, XGBoost |
RMSE: 0.10664 |
|
Blend&Stack LR&GB = [0.10649] {House Prices} v57 |
Engineering features, Stacking models |
Linear Regression, Gradient Boosting, XGBoost, LightGBM |
RMSE: 0.10649 |
[1]: Stacked Regressions : Top 4% on LeaderBoard | Kaggle [2]: A study on Regression applied to the Ames dataset | Kaggle
[3]: Blend&Stack LR&GB = [0.10649] {House Prices} v57 | Kaggle
Solution 1
Solution 1 by Serigne is the top-performing solution on the leaderboard with an RMSE of 0.10630. The solution uses stacking models approach and engineering features. The models used in the stacking approach are Linear Regression, Ridge, Lasso, ElasticNet, Gradient Boosting, XGBoost, and LightGBM. The solution also performs outlier removal and normalization.
Solution 2
Solution 2 by JulienCS is the second top-performing solution with an RMSE of 0.10664. The solution performs feature selection and engineering, outlier removal, and cross-validation. The models used are Ridge, Lasso, ElasticNet, Random Forest, Gradient Boosting, and XGBoost.
Solution 3
Solution 3 by Itslek is the third top-performing solution with an RMSE of 0.10649. The solution also uses stacking models and engineering features. The models used are Linear Regression, Gradient Boosting, XGBoost, and LightGBM.
Comments on Successful Approaches
All three solutions use stacking models approach and feature engineering, which is a common and effective technique in machine learning. Solution 1 and 3 both use the same set of models, while solution 2 uses a combination of different models. Solution 1 performs outlier removal and normalization, while solution 2 performs feature selection and cross-validation. These additional techniques can improve model performance.
In conclusion, successful approaches in the Kaggle House Prices competition involve a combination of stacking models, feature engineering, and additional techniques like outlier removal, normalization, feature selection, and cross-validation.
3 Data description and initial processing
In this section, we will perform an initial analysis of the data and prepare it for building a model.
3.1 Importing the Data
We will begin by importing the necessary libraries and loading the dataset into a Pandas DataFrame:
The dataset contains 1460 observations and 81 columns, including the target variable SalePrice. The test set contains 1459 observations and the same columns as the training set except for the SalePrice column.
3.2 Understanding the Target Variable
Let's start by examining the distribution of the target variable SalePrice. We will plot a histogram and calculate some basic statistics:
|
count |
1460 |
|
mean |
180921.2 |
|
std |
79442.5 |
|
min |
34900 |
|
25% |
129975 |
|
50% |
163000 |
|
75% |
214000 |
|
max |
755000 |
|
Name: SalePrice, dtype: float64 |
The mean sale price is approximately 180,921 dollars, with a standard deviation of 79,442 dollars. The minimum sale price is 34,900 dollars, and the maximum sale price is 755,000 dollars. The distribution is right-skewed, with a long tail on the higher end of the price range.
3.3 Handling Missing Values
Next, we will check for missing values in the dataset and decide how to handle them. We will use the isnull() method to create a DataFrame of Boolean values indicating the presence or absence of missing values, then sum the number of missing values in each column:
|
variable |
Missing |
% |
GarageQual |
81 |
5.547945 |
|
PoolQC |
1453 |
99.52055 |
GarageCond |
81 |
5.547945 |
|
MiscFeature |
1406 |
96.30137 |
BsmtExposure |
38 |
2.60274 |
|
Alley |
1369 |
93.76712 |
BsmtFinType2 |
38 |
2.60274 |
|
Fence |
1179 |
80.75343 |
BsmtFinType1 |
37 |
2.534247 |
|
FireplaceQu |
690 |
47.26027 |
BsmtCond |
37 |
2.534247 |
|
LotFrontage |
259 |
17.73973 |
BsmtQual |
37 |
2.534247 |
|
GarageType |
81 |
5.547945 |
MasVnrArea |
8 |
0.547945 |
|
GarageYrBlt |
81 |
5.547945 |
MasVnrType |
8 |
0.547945 |
|
GarageFinish |
81 |
5.547945 |
Electrical |
1 |
0.068493 |
This is a table of columns with missing values and their corresponding percentage of missing values. We can then handle missing values for each column.
From the output, we can see that the columns 'PoolQC', 'MiscFeature', 'Alley', and 'Fence' have a very high percentage of missing values, and we will drop them from the dataset. For other columns with missing values, we will decide how to handle them based on their correlation with the target variable and the nature of the missing values. The percentage of missing FireplaceQu is about 47. Interpolation will change the original digital characteristics, and drop will lose information. For LotFrontage missing values greater than 17%, we use median interpolation. And less missing values for other variables.
4 Modeling
For the price of residential properties in Ames, Iowa, we will explore the relevance of different independent variables and compare the performance of different algorithms.
4.1 Data Preprocessing
First I load the Ames Housing dataset into two dataframes, one for training data and another for testing data. Then, the training data is split into features (X_train) and target (y_train), and the test data is split into features (X_test).
Next, columns that have more than 50% of their values missing are dropped from both the training and test dataframes to clean the data. Any remaining missing values in the dataframes are filled with the median value of each respective column. This process ensures that the data is ready for modeling.
Then, the categorical features in the dataframes are one-hot encoded using the OneHotEncoder from scikit-learn. This process converts categorical variables into a series of binary variables which makes it possible to include categorical data in statistical models more easily. The numerical features are scaled using the StandardScaler from scikit-learn. Scaling the numerical features makes sure that all features are on the same scale so that no single feature dominates the model.
4.2 Visualizing the Correlations
We will now create a correlation matrix and sort them to visualize the relationships between the features and the target variable.
Top Correlations:
|
Variable 1 |
Variable 2 |
Correlation |
|
GarageArea |
GarageCars |
0.882475 |
|
GarageCars |
GarageArea |
0.882475 |
|
GrLivArea |
TotRmsAbvGrd |
0.825489 |
|
TotRmsAbvGrd |
GrLivArea |
0.825489 |
|
TotalBsmtSF |
1stFlrSF |
0.81953 |
|
1stFlrSF |
TotalBsmtSF |
0.81953 |
|
OverallQual |
SalePrice |
0.790982 |
|
SalePrice |
OverallQual |
0.790982 |
|
YearBuilt |
GarageYrBlt |
0.777182 |
|
GarageYrBlt |
YearBuilt |
0.777182 |
|
GrLivArea |
SalePrice |
0.708624 |
|
SalePrice |
GrLivArea |
0.708624 |
|
GrLivArea |
2ndFlrSF |
0.687501 |
|
2ndFlrSF |
GrLivArea |
0.687501 |
|
TotRmsAbvGrd |
BedroomAbvGr |
0.67662 |
|
BedroomAbvGr |
TotRmsAbvGrd |
0.67662 |
|
BsmtFinSF1 |
BsmtFullBath |
0.649212 |
|
BsmtFullBath |
BsmtFinSF1 |
0.649212 |
|
GarageCars |
SalePrice |
0.640409 |
|
SalePrice |
GarageCars |
0.640409 |
Bottom Correlations:
|
Variable 1 |
Variable 2 |
Correlation |
|
BsmtFullBath |
3SsnPorch |
0.000106 |
|
3SsnPorch |
BsmtFullBath |
0.000106 |
|
Id |
GarageYrBlt |
0.000122 |
|
GarageYrBlt |
Id |
0.000122 |
|
LotFrontage |
MiscVal |
0.000255 |
|
MiscVal |
LotFrontage |
0.000255 |
|
BsmtHalfBath |
TotalBsmtSF |
0.000315 |
|
TotalBsmtSF |
BsmtHalfBath |
0.000315 |
|
MiscVal |
3SsnPorch |
0.000354 |
|
3SsnPorch |
MiscVal |
0.000354 |
This correlation matrix shows the pairwise correlations between variables in the dataset. The variables with the highest correlations with each other are:
GarageArea and GarageCars: These variables have a correlation of 0.882475, which means that they are highly correlated with each other, indicating that the larger the garage area, the more cars it can hold.
GrLivArea and TotRmsAbvGrd: These variables have a correlation of 0.825489, which means that they are highly correlated with each other, indicating that the larger the ground living area, the more rooms above ground level it has.
TotalBsmtSF and 1stFlrSF: These variables have a correlation of 0.819530, which means that they are highly correlated with each other, indicating that the larger the total basement square footage, the larger the first floor square footage.
OverallQual and SalePrice: These variables have a correlation of 0.790982, which means that they are highly correlated with each other, indicating that the higher the overall quality of the house, the higher the sale price.
YearBuilt and GarageYrBlt: These variables have a correlation of 0.777182, which means that they are highly correlated with each other, indicating that the year the house was built is related to the year the garage was built.
GrLivArea and SalePrice: These variables have a correlation of 0.708624, which means that they are moderately correlated with each other, indicating that the larger the ground living area, the higher the sale price.
GarageCars and SalePrice: These variables have a correlation of 0.640409, which means that they are moderately correlated with each other, indicating that the larger the garage capacity, the higher the sale price.
The variables with the lowest correlations with each other are:
BsmtFullBath and 3SsnPorch: These variables have a correlation of 0.000106, which means that they are not correlated with each other.
Id and GarageYrBlt: These variables have a correlation of 0.000122, which means that they are not correlated with each other.
LotFrontage and MiscVal: These variables have a correlation of 0.000255, which means that they are not correlated with each other.
BsmtHalfBath and TotalBsmtSF: These variables have a correlation of 0.000315, which means that they are not correlated with each other.
MiscVal and 3SsnPorch: These variables have a correlation of 0.000354, which means that they are not correlated with each other.
4.3 Model Selection
After preprocessing the data, the code trains three models, Linear Regression, Ridge Regression, and XGBoost using the training data. The performance of each model is evaluated using the Root Mean Squared Log Error (RMSLE) metric. The weights of each model's RMSLE are calculated and the weighted average of the models' predictions is used to make the final prediction on the test data
Linear Regression: Linear regression is a simple linear approach to modeling the relationship between a dependent variable and one or more independent variables. In this code, linear regression is used to predict the sale price of houses based on various features such as the number of rooms, square footage, etc. The model learns a set of weights for each feature and combines them linearly to make the prediction. Linear regression is a common baseline model for regression problems.
Ridge Regression: Ridge regression is a regularized version of linear regression that introduces a penalty term to the loss function. This penalty term shrinks the learned weights towards zero, which helps prevent overfitting. In this code, ridge regression is used as an alternative to linear regression to further improve the model's performance.
XGBoost: XGBoost is a gradient boosting framework that uses decision trees as base learners. XGBoost builds an ensemble of decision trees iteratively, with each new tree aiming to correct the errors made by the previous trees. XGBoost has gained popularity in recent years due to its state-of-the-art performance on many machine learning tasks, including regression. In this code, XGBoost is used as the main model for predicting house prices.
4.4 Model Evaluation
The model evaluation section describes the performance evaluation of three different regression models: Linear Regression, Ridge Regression, and XGBoost Regression. These models are trained on preprocessed training data, and their performances are evaluated using the root mean squared logarithmic error (RMSLE) metric.
The first model trained is the Linear Regression model. The RMSLE for this model is 0.0930859803886687. This value indicates that the model has a low error rate and is performing well.
The second model trained is the Ridge Regression model. The RMSLE for this model is 0.09990443288510562. This value is slightly higher than that of the Linear Regression model, indicating that the model may be overfitting the data to some extent.
The third model trained is the XGBoost Regression model. The RMSLE for this model is 0.1303998403809394. This value is higher than that of the previous two models, indicating that the model may not be performing as well on this particular dataset.
Finally, a weighted average of the predictions of the three models is taken, with the weights being the inverse of the RMSLEs. This is a common technique in ensemble modeling to combine the strengths of multiple models. The weighted average is expected to perform better than any of the individual models.
After training the models and evaluating their performance, I analyzed the residual plots for each model. A residual plot is a scatter plot of the residuals (i.e., the differences between the predicted and actual values) against the predicted values. A good residual plot should show no obvious pattern, which indicates that the model is capturing the underlying patterns in the data and there are no systematic errors.
Based on analysis of the residual plots, I found that there was no obvious trend and the four models were almost the same. This is a good sign as it indicates that the models are capturing the patterns in the data and there are no systematic errors. I also created box plots to compare the performance of the different models. The box plot shows the distribution of the residuals for each model. A good model will have a narrow distribution of residuals around zero.
Based on the box plots, you found that the regression and ensemble models had the best performance, followed by the Ridge Regression model, and XGBoost had the worst performance. This suggests that the linear models (i.e., Linear Regression and Ridge Regression) are better suited for this dataset than the non-linear model (i.e., XGBoost).
Overall, it seems that the ensemble model and linear regression (i.e., the weighted average of the three models) performed the best among the models evaluated. However, it is important to note that the performance of the models may vary depending on the specific dataset and the problem being solved. Therefore, it is important to evaluate the performance of different models on the specific dataset and choose the best model based on the performance metrics and domain knowledge.
5 Appendix
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv('train.csv')
data.head()
data.describe().to_excel('describe.xlsx')
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
print(train_data['SalePrice'].describe())
sns.histplot(train_data['SalePrice'], kde=True)
plt.title('Distribution of SalePrice')
plt.show()
# Checking for missing values
missing = train_data.isnull().sum()
missing_perc = missing / len(train_data) * 100
missing_data = pd.concat([missing, missing_perc], axis=1, keys=['Missing', '%'])
missing_data = missing_data[missing_data['Missing'] > 0].sort_values(by='Missing', ascending=False)
print(missing_data)
# Drop columns with a high percentage of missing values
train_data.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'], axis=1, inplace=True)
train_data.fillna(train_data.median(), inplace=True)
# Correlation matrix
corr_matrix = train_data.corr()
plt.figure(figsize=(12, 9))
sns.heatmap(corr_matrix, square=True)
plt.show()
# Sort correlation matrix in descending order and extract top values
corr_top = corr_matrix.abs().unstack().sort_values(ascending=False)
corr_top = corr_top[corr_top != 1] # Remove self-correlations
corr_top = pd.DataFrame(corr_top).reset_index()
corr_top.columns = ['Variable 1', 'Variable 2', 'Correlation']
corr_top_high = corr_top.head(20) # Extract top 10 values
print("Top Correlations:\n", corr_top_high)
# Sort correlation matrix in ascending order and extract bottom values
corr_bottom = corr_matrix.abs().unstack().sort_values()
corr_bottom = pd.DataFrame(corr_bottom).reset_index()
corr_bottom.columns = ['Variable 1', 'Variable 2', 'Correlation']
corr_bottom_low = corr_bottom.head(10) # Extract bottom 10 values
print("\nBottom Correlations:\n", corr_bottom_low)
corr_top_high.to_excel('tem1.xlsx')
corr_bottom_low.to_excel('tem2.xlsx')
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
# Load the data
train_df = pd.read_csv('train.csv')
X_train, X_test, y_train, y_test = train_test_split(train_df.drop(['Id', 'SalePrice'], axis=1),
train_df['SalePrice'],
test_size=0.2,
random_state=42)
# Drop columns with high percentage of missing values
X_train.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'], axis=1, inplace=True)
X_test.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'], axis=1, inplace=True)
# Fill missing values with the median value of each column
X_train.fillna(X_train.median(), inplace=True)
X_test.fillna(X_train.median(), inplace=True)
# Preprocess the data
cat_cols = X_train.select_dtypes(include=['object']).columns.tolist()
num_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
enc = OneHotEncoder(handle_unknown='ignore')
scaler = StandardScaler()
X_train_cat = enc.fit_transform(X_train[cat_cols]).toarray()
X_train_num = scaler.fit_transform(X_train[num_cols])
X_train = np.concatenate((X_train_cat, X_train_num), axis=1)
X_test_cat = enc.transform(X_test[cat_cols]).toarray()
X_test_num = scaler.transform(X_test[num_cols])
X_test = np.concatenate((X_test_cat, X_test_num), axis=1)
# Define the RMSLE function
def rmsle(y_true, y_pred):
return np.sqrt(np.mean(np.square(np.log(y_pred + 1) - np.log(y_true + 1))))
# Train linear regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train, np.log(y_train))
y_val_pred_linear = np.exp(linear_reg.predict(X_train))
y_linear = np.exp(linear_reg.predict(X_test))
rmsle_linear = rmsle(y_train, y_val_pred_linear)
print('Linear regression RMSLE:', rmsle_linear)
# Train ridge regression model
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train, np.log(y_train))
y_val_pred_ridge = np.exp(ridge_reg.predict(X_train))
y_ridge = np.exp(ridge_reg.predict(X_test))
rmsle_ridge = rmsle(y_train, y_val_pred_ridge)
print('Ridge regression RMSLE:', rmsle_ridge)
# Train XGBoost model
xgb_reg = XGBRegressor(n_estimators=100, learning_rate=0.05, max_depth=3)
xgb_reg.fit(X_train, np.log(y_train))
y_val_pred_xgb = np.exp(xgb_reg.predict(X_train))
y_xgb = np.exp(xgb_reg.predict(X_test))
rmsle_xgb = rmsle(y_train, y_val_pred_xgb)
print('XGBoost RMSLE:', rmsle_xgb)
# Weighted average of predictions
weights = [rmsle_linear, rmsle_ridge, rmsle_xgb]
weights = np.array(weights) / np.sum(weights)
y_pred = np.exp(linear_reg.predict(X_test)) * weights[0] + np.exp(ridge_reg.predict(X_test)) * weights[1] + np.exp(xgb_reg.predict(X_test)) * weights[2]
import matplotlib.pyplot as plt
fig, axs = plt.subplots(2, 2, figsize=(12, 8))
# Plot y_pred vs. y_test
axs[0, 0].plot(y_pred - y_test, 'o', alpha=0.5)
axs[0, 0].set_xlabel('')
axs[0, 0].set_ylabel('y_pred - y_test')
axs[0, 0].set_title('Predicted vs. Actual (Ensemble)')
axs[0, 0].set_ylim([-100000, 100000])
# Plot y_linear vs. y_test
axs[0, 1].plot(y_linear - y_test, 'o', alpha=0.5)
axs[0, 1].set_xlabel('')
axs[0, 1].set_ylabel('y_linear - y_test')
axs[0, 1].set_title('Predicted vs. Actual (Linear Regression)')
axs[0, 1].set_ylim([-100000, 100000])
# Plot y_ridge vs. y_test
axs[1, 0].plot(y_ridge - y_test, 'o', alpha=0.5)
axs[1, 0].set_xlabel('')
axs[1, 0].set_ylabel('y_ridge - y_test')
axs[1, 0].set_title('Predicted vs. Actual (Ridge Regression)')
axs[1, 0].set_ylim([-100000, 100000])
# Plot y_xgb vs. y_test
axs[1, 1].plot(y_xgb - y_test, 'o', alpha=0.5)
axs[1, 1].set_xlabel('')
axs[1, 1].set_ylabel('y_xgb - y_test')
axs[1, 1].set_title('Predicted vs. Actual (XGBoost)')
axs[1, 1].set_ylim([-100000, 100000])
plt.tight_layout()
plt.show()
plt.boxplot([y_pred - y_test, y_linear - y_test, y_ridge - y_test, y_xgb - y_test], labels=['Ensemble', 'Linear Regression', 'Ridge Regression', 'XGBoost'])
plt.ylabel('Absolute Error')
plt.title('Model Comparison')
plt.show()