English papper polishing: sentence, grammar, tense, logic, polishing.

Steven01U05K
imla-final.docx

WOODSTOCK’18, June, 2018, El Paso, Texas USA

F. Surname et al.

Insert Your Title Here

WOODSTOCK’18, June, 2018, El Paso, Texas USA

House Prices: Advanced Regression Techniques ∗

FirstName Surname† Department Name Institution/University Name City State Country email@email.com

1 Executive summary

The Kaggle competition "House Prices: Advanced Regression Techniques" provides a dataset of residential homes in Ames, Iowa, including various features such as the size, number of bedrooms and bathrooms, location, and other factors that might influence the sale price. The goal of the competition is to use machine learning techniques to predict the final sale price of each home in the test dataset, given the features in the training dataset. House Prices - Advanced Regression Techniques | Kaggle

The dataset includes 79 explanatory variables (features) describing various aspects of the residential homes, such as the overall quality of materials and finish, the number of fireplaces, the presence of a pool, and many more. Additionally, there are two columns in the dataset containing the ID of each house and the sale price that we aim to predict.

Due to the limited space, the statistical information of some variables are listed here.

Sale

Price

MSSub

Class

LotFront

age

LotArea

Overall

Qual

Overall

Cond

Year

Built

mean

180921.2

56.8972

70.04996

10516.8

6.099315

5.575342

1971.2

std

79442.5

42.3005

24.28475

9981.26

1.382997

1.112799

30.202

min

34900

20

21

1300

1

1

1872

25%

129975

20

59

7553.5

5

5

1954

50%

163000

50

69

9478.5

6

5

1973

75%

214000

70

80

11601.5

7

6

2000

max

755000

190

313

215245

10

9

2010

Year

RemodAdd

MasVnr

Area

Bsmt

FinSF1

Bsmt

FinSF2

Bsm

tUnfSF

Total

BsmtSF

1st

FlrSF

mean

1984.866

103.6853

443.6397

46.54932

567.2404

1057.429

1162.62

std

20.64541

181.0662

456.0981

161.3193

441.867

438.7053

386.587

min

1950

0

0

0

0

0

334

25%

1967

0

0

0

223

795.75

882

50%

1994

0

383.5

0

477.5

991.5

1087

75%

2004

166

712.25

0

808

1298.25

1391.25

max

2010

1600

5644

1474

2336

6110

4692

The dataset requires pre-processing and cleaning, as there are missing values, categorical features, and outliers. Participants can use a variety of regression techniques to make their predictions, such as linear regression, decision trees, random forests, gradient boosting, and neural networks.

The competition provides a public leaderboard to evaluate the performance of participants' models on a portion of the test data. The final evaluation is based on the root mean squared error (RMSE) between the predicted and actual sale prices on the remainder of the test data. The winner of the competition is the participant with the lowest RMSE on the final evaluation set.

2 Benchmarking of other solutions

In this part, we will be analyzing three Kaggle solutions for the House Prices competition. The solutions are chosen based on their high performance on the Kaggle prediction task. We will summarize the features, modeling approach, and performance of each solution in a table and comment on their approach and what makes them successful.

We have chosen the following three solutions for analysis:

Solution

Features

Modeling Approach

Performance

Stacked Regressions : Top 4% on LeaderBoard

Engineering features, Stacking models

Linear Regression, Ridge, Lasso, ElasticNet, Gradient Boosting, XGBoost, LightGBM

RMSE: 0.10630

A study on Regression applied to the Ames dataset

Feature selection, Feature engineering, Outlier removal, Cross-validation

Ridge, Lasso, ElasticNet, Random Forest, Gradient Boosting, XGBoost

RMSE: 0.10664

Blend&Stack LR&GB = [0.10649] {House Prices} v57

Engineering features, Stacking models

Linear Regression, Gradient Boosting, XGBoost, LightGBM

RMSE: 0.10649

[1]: Stacked Regressions : Top 4% on LeaderBoard | Kaggle [2]: A study on Regression applied to the Ames dataset | Kaggle

[3]: Blend&Stack LR&GB = [0.10649] {House Prices} v57 | Kaggle

Solution 1

Solution 1 by Serigne is the top-performing solution on the leaderboard with an RMSE of 0.10630. The solution uses stacking models approach and engineering features. The models used in the stacking approach are Linear Regression, Ridge, Lasso, ElasticNet, Gradient Boosting, XGBoost, and LightGBM. The solution also performs outlier removal and normalization.

Solution 2

Solution 2 by JulienCS is the second top-performing solution with an RMSE of 0.10664. The solution performs feature selection and engineering, outlier removal, and cross-validation. The models used are Ridge, Lasso, ElasticNet, Random Forest, Gradient Boosting, and XGBoost.

Solution 3

Solution 3 by Itslek is the third top-performing solution with an RMSE of 0.10649. The solution also uses stacking models and engineering features. The models used are Linear Regression, Gradient Boosting, XGBoost, and LightGBM.

Comments on Successful Approaches

All three solutions use stacking models approach and feature engineering, which is a common and effective technique in machine learning. Solution 1 and 3 both use the same set of models, while solution 2 uses a combination of different models. Solution 1 performs outlier removal and normalization, while solution 2 performs feature selection and cross-validation. These additional techniques can improve model performance.

In conclusion, successful approaches in the Kaggle House Prices competition involve a combination of stacking models, feature engineering, and additional techniques like outlier removal, normalization, feature selection, and cross-validation.

3 Data description and initial processing

In this section, we will perform an initial analysis of the data and prepare it for building a model.

3.1 Importing the Data

We will begin by importing the necessary libraries and loading the dataset into a Pandas DataFrame:

The dataset contains 1460 observations and 81 columns, including the target variable SalePrice. The test set contains 1459 observations and the same columns as the training set except for the SalePrice column.

3.2 Understanding the Target Variable

Let's start by examining the distribution of the target variable SalePrice. We will plot a histogram and calculate some basic statistics:

count

1460

mean

180921.2

std

79442.5

min

34900

25%

129975

50%

163000

75%

214000

max

755000

Name: SalePrice, dtype: float64

The mean sale price is approximately 180,921 dollars, with a standard deviation of 79,442 dollars. The minimum sale price is 34,900 dollars, and the maximum sale price is 755,000 dollars. The distribution is right-skewed, with a long tail on the higher end of the price range.

3.3 Handling Missing Values

Next, we will check for missing values in the dataset and decide how to handle them. We will use the isnull() method to create a DataFrame of Boolean values indicating the presence or absence of missing values, then sum the number of missing values in each column:

variable

Missing

%

GarageQual

81

5.547945

PoolQC

1453

99.52055

GarageCond

81

5.547945

MiscFeature

1406

96.30137

BsmtExposure

38

2.60274

Alley

1369

93.76712

BsmtFinType2

38

2.60274

Fence

1179

80.75343

BsmtFinType1

37

2.534247

FireplaceQu

690

47.26027

BsmtCond

37

2.534247

LotFrontage

259

17.73973

BsmtQual

37

2.534247

GarageType

81

5.547945

MasVnrArea

8

0.547945

GarageYrBlt

81

5.547945

MasVnrType

8

0.547945

GarageFinish

81

5.547945

Electrical

1

0.068493

This is a table of columns with missing values and their corresponding percentage of missing values. We can then handle missing values for each column.

From the output, we can see that the columns 'PoolQC', 'MiscFeature', 'Alley', and 'Fence' have a very high percentage of missing values, and we will drop them from the dataset. For other columns with missing values, we will decide how to handle them based on their correlation with the target variable and the nature of the missing values. The percentage of missing FireplaceQu is about 47. Interpolation will change the original digital characteristics, and drop will lose information. For LotFrontage missing values greater than 17%, we use median interpolation. And less missing values for other variables.

4 Modeling

For the price of residential properties in Ames, Iowa, we will explore the relevance of different independent variables and compare the performance of different algorithms.

4.1 Data Preprocessing

First I load the Ames Housing dataset into two dataframes, one for training data and another for testing data. Then, the training data is split into features (X_train) and target (y_train), and the test data is split into features (X_test).

Next, columns that have more than 50% of their values missing are dropped from both the training and test dataframes to clean the data. Any remaining missing values in the dataframes are filled with the median value of each respective column. This process ensures that the data is ready for modeling.

Then, the categorical features in the dataframes are one-hot encoded using the OneHotEncoder from scikit-learn. This process converts categorical variables into a series of binary variables which makes it possible to include categorical data in statistical models more easily. The numerical features are scaled using the StandardScaler from scikit-learn. Scaling the numerical features makes sure that all features are on the same scale so that no single feature dominates the model.

4.2 Visualizing the Correlations

We will now create a correlation matrix and sort them to visualize the relationships between the features and the target variable.

Top Correlations:

Variable 1

Variable 2

Correlation

GarageArea

GarageCars

0.882475

GarageCars

GarageArea

0.882475

GrLivArea

TotRmsAbvGrd

0.825489

TotRmsAbvGrd

GrLivArea

0.825489

TotalBsmtSF

1stFlrSF

0.81953

1stFlrSF

TotalBsmtSF

0.81953

OverallQual

SalePrice

0.790982

SalePrice

OverallQual

0.790982

YearBuilt

GarageYrBlt

0.777182

GarageYrBlt

YearBuilt

0.777182

GrLivArea

SalePrice

0.708624

SalePrice

GrLivArea

0.708624

GrLivArea

2ndFlrSF

0.687501

2ndFlrSF

GrLivArea

0.687501

TotRmsAbvGrd

BedroomAbvGr

0.67662

BedroomAbvGr

TotRmsAbvGrd

0.67662

BsmtFinSF1

BsmtFullBath

0.649212

BsmtFullBath

BsmtFinSF1

0.649212

GarageCars

SalePrice

0.640409

SalePrice

GarageCars

0.640409

Bottom Correlations:

Variable 1

Variable 2

Correlation

BsmtFullBath

3SsnPorch

0.000106

3SsnPorch

BsmtFullBath

0.000106

Id

GarageYrBlt

0.000122

GarageYrBlt

Id

0.000122

LotFrontage

MiscVal

0.000255

MiscVal

LotFrontage

0.000255

BsmtHalfBath

TotalBsmtSF

0.000315

TotalBsmtSF

BsmtHalfBath

0.000315

MiscVal

3SsnPorch

0.000354

3SsnPorch

MiscVal

0.000354

This correlation matrix shows the pairwise correlations between variables in the dataset. The variables with the highest correlations with each other are:

GarageArea and GarageCars: These variables have a correlation of 0.882475, which means that they are highly correlated with each other, indicating that the larger the garage area, the more cars it can hold.

GrLivArea and TotRmsAbvGrd: These variables have a correlation of 0.825489, which means that they are highly correlated with each other, indicating that the larger the ground living area, the more rooms above ground level it has.

TotalBsmtSF and 1stFlrSF: These variables have a correlation of 0.819530, which means that they are highly correlated with each other, indicating that the larger the total basement square footage, the larger the first floor square footage.

OverallQual and SalePrice: These variables have a correlation of 0.790982, which means that they are highly correlated with each other, indicating that the higher the overall quality of the house, the higher the sale price.

YearBuilt and GarageYrBlt: These variables have a correlation of 0.777182, which means that they are highly correlated with each other, indicating that the year the house was built is related to the year the garage was built.

GrLivArea and SalePrice: These variables have a correlation of 0.708624, which means that they are moderately correlated with each other, indicating that the larger the ground living area, the higher the sale price.

GarageCars and SalePrice: These variables have a correlation of 0.640409, which means that they are moderately correlated with each other, indicating that the larger the garage capacity, the higher the sale price.

The variables with the lowest correlations with each other are:

BsmtFullBath and 3SsnPorch: These variables have a correlation of 0.000106, which means that they are not correlated with each other.

Id and GarageYrBlt: These variables have a correlation of 0.000122, which means that they are not correlated with each other.

LotFrontage and MiscVal: These variables have a correlation of 0.000255, which means that they are not correlated with each other.

BsmtHalfBath and TotalBsmtSF: These variables have a correlation of 0.000315, which means that they are not correlated with each other.

MiscVal and 3SsnPorch: These variables have a correlation of 0.000354, which means that they are not correlated with each other.

4.3 Model Selection

After preprocessing the data, the code trains three models, Linear Regression, Ridge Regression, and XGBoost using the training data. The performance of each model is evaluated using the Root Mean Squared Log Error (RMSLE) metric. The weights of each model's RMSLE are calculated and the weighted average of the models' predictions is used to make the final prediction on the test data

Linear Regression: Linear regression is a simple linear approach to modeling the relationship between a dependent variable and one or more independent variables. In this code, linear regression is used to predict the sale price of houses based on various features such as the number of rooms, square footage, etc. The model learns a set of weights for each feature and combines them linearly to make the prediction. Linear regression is a common baseline model for regression problems.

Ridge Regression: Ridge regression is a regularized version of linear regression that introduces a penalty term to the loss function. This penalty term shrinks the learned weights towards zero, which helps prevent overfitting. In this code, ridge regression is used as an alternative to linear regression to further improve the model's performance.

XGBoost: XGBoost is a gradient boosting framework that uses decision trees as base learners. XGBoost builds an ensemble of decision trees iteratively, with each new tree aiming to correct the errors made by the previous trees. XGBoost has gained popularity in recent years due to its state-of-the-art performance on many machine learning tasks, including regression. In this code, XGBoost is used as the main model for predicting house prices.

4.4 Model Evaluation

The model evaluation section describes the performance evaluation of three different regression models: Linear Regression, Ridge Regression, and XGBoost Regression. These models are trained on preprocessed training data, and their performances are evaluated using the root mean squared logarithmic error (RMSLE) metric.

The first model trained is the Linear Regression model. The RMSLE for this model is 0.0930859803886687. This value indicates that the model has a low error rate and is performing well.

The second model trained is the Ridge Regression model. The RMSLE for this model is 0.09990443288510562. This value is slightly higher than that of the Linear Regression model, indicating that the model may be overfitting the data to some extent.

The third model trained is the XGBoost Regression model. The RMSLE for this model is 0.1303998403809394. This value is higher than that of the previous two models, indicating that the model may not be performing as well on this particular dataset.

Finally, a weighted average of the predictions of the three models is taken, with the weights being the inverse of the RMSLEs. This is a common technique in ensemble modeling to combine the strengths of multiple models. The weighted average is expected to perform better than any of the individual models.

After training the models and evaluating their performance, I analyzed the residual plots for each model. A residual plot is a scatter plot of the residuals (i.e., the differences between the predicted and actual values) against the predicted values. A good residual plot should show no obvious pattern, which indicates that the model is capturing the underlying patterns in the data and there are no systematic errors.

Based on analysis of the residual plots, I found that there was no obvious trend and the four models were almost the same. This is a good sign as it indicates that the models are capturing the patterns in the data and there are no systematic errors. I also created box plots to compare the performance of the different models. The box plot shows the distribution of the residuals for each model. A good model will have a narrow distribution of residuals around zero.

Based on the box plots, you found that the regression and ensemble models had the best performance, followed by the Ridge Regression model, and XGBoost had the worst performance. This suggests that the linear models (i.e., Linear Regression and Ridge Regression) are better suited for this dataset than the non-linear model (i.e., XGBoost).

Overall, it seems that the ensemble model and linear regression (i.e., the weighted average of the three models) performed the best among the models evaluated. However, it is important to note that the performance of the models may vary depending on the specific dataset and the problem being solved. Therefore, it is important to evaluate the performance of different models on the specific dataset and choose the best model based on the performance metrics and domain knowledge.

5 Appendix

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

data = pd.read_csv('train.csv')

data.head()

data.describe().to_excel('describe.xlsx')

train_data = pd.read_csv('train.csv')

test_data = pd.read_csv('test.csv')

print(train_data['SalePrice'].describe())

sns.histplot(train_data['SalePrice'], kde=True)

plt.title('Distribution of SalePrice')

plt.show()

# Checking for missing values

missing = train_data.isnull().sum()

missing_perc = missing / len(train_data) * 100

missing_data = pd.concat([missing, missing_perc], axis=1, keys=['Missing', '%'])

missing_data = missing_data[missing_data['Missing'] > 0].sort_values(by='Missing', ascending=False)

print(missing_data)

# Drop columns with a high percentage of missing values

train_data.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'], axis=1, inplace=True)

train_data.fillna(train_data.median(), inplace=True)

# Correlation matrix

corr_matrix = train_data.corr()

plt.figure(figsize=(12, 9))

sns.heatmap(corr_matrix, square=True)

plt.show()

# Sort correlation matrix in descending order and extract top values

corr_top = corr_matrix.abs().unstack().sort_values(ascending=False)

corr_top = corr_top[corr_top != 1] # Remove self-correlations

corr_top = pd.DataFrame(corr_top).reset_index()

corr_top.columns = ['Variable 1', 'Variable 2', 'Correlation']

corr_top_high = corr_top.head(20) # Extract top 10 values

print("Top Correlations:\n", corr_top_high)

# Sort correlation matrix in ascending order and extract bottom values

corr_bottom = corr_matrix.abs().unstack().sort_values()

corr_bottom = pd.DataFrame(corr_bottom).reset_index()

corr_bottom.columns = ['Variable 1', 'Variable 2', 'Correlation']

corr_bottom_low = corr_bottom.head(10) # Extract bottom 10 values

print("\nBottom Correlations:\n", corr_bottom_low)

corr_top_high.to_excel('tem1.xlsx')

corr_bottom_low.to_excel('tem2.xlsx')

import pandas as pd

import numpy as np

from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.linear_model import LinearRegression, Ridge

from xgboost import XGBRegressor

from sklearn.model_selection import train_test_split

# Load the data

train_df = pd.read_csv('train.csv')

X_train, X_test, y_train, y_test = train_test_split(train_df.drop(['Id', 'SalePrice'], axis=1),

train_df['SalePrice'],

test_size=0.2,

random_state=42)

# Drop columns with high percentage of missing values

X_train.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'], axis=1, inplace=True)

X_test.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'], axis=1, inplace=True)

# Fill missing values with the median value of each column

X_train.fillna(X_train.median(), inplace=True)

X_test.fillna(X_train.median(), inplace=True)

# Preprocess the data

cat_cols = X_train.select_dtypes(include=['object']).columns.tolist()

num_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

enc = OneHotEncoder(handle_unknown='ignore')

scaler = StandardScaler()

X_train_cat = enc.fit_transform(X_train[cat_cols]).toarray()

X_train_num = scaler.fit_transform(X_train[num_cols])

X_train = np.concatenate((X_train_cat, X_train_num), axis=1)

X_test_cat = enc.transform(X_test[cat_cols]).toarray()

X_test_num = scaler.transform(X_test[num_cols])

X_test = np.concatenate((X_test_cat, X_test_num), axis=1)

# Define the RMSLE function

def rmsle(y_true, y_pred):

return np.sqrt(np.mean(np.square(np.log(y_pred + 1) - np.log(y_true + 1))))

# Train linear regression model

linear_reg = LinearRegression()

linear_reg.fit(X_train, np.log(y_train))

y_val_pred_linear = np.exp(linear_reg.predict(X_train))

y_linear = np.exp(linear_reg.predict(X_test))

rmsle_linear = rmsle(y_train, y_val_pred_linear)

print('Linear regression RMSLE:', rmsle_linear)

# Train ridge regression model

ridge_reg = Ridge(alpha=1.0)

ridge_reg.fit(X_train, np.log(y_train))

y_val_pred_ridge = np.exp(ridge_reg.predict(X_train))

y_ridge = np.exp(ridge_reg.predict(X_test))

rmsle_ridge = rmsle(y_train, y_val_pred_ridge)

print('Ridge regression RMSLE:', rmsle_ridge)

# Train XGBoost model

xgb_reg = XGBRegressor(n_estimators=100, learning_rate=0.05, max_depth=3)

xgb_reg.fit(X_train, np.log(y_train))

y_val_pred_xgb = np.exp(xgb_reg.predict(X_train))

y_xgb = np.exp(xgb_reg.predict(X_test))

rmsle_xgb = rmsle(y_train, y_val_pred_xgb)

print('XGBoost RMSLE:', rmsle_xgb)

# Weighted average of predictions

weights = [rmsle_linear, rmsle_ridge, rmsle_xgb]

weights = np.array(weights) / np.sum(weights)

y_pred = np.exp(linear_reg.predict(X_test)) * weights[0] + np.exp(ridge_reg.predict(X_test)) * weights[1] + np.exp(xgb_reg.predict(X_test)) * weights[2]

import matplotlib.pyplot as plt

fig, axs = plt.subplots(2, 2, figsize=(12, 8))

# Plot y_pred vs. y_test

axs[0, 0].plot(y_pred - y_test, 'o', alpha=0.5)

axs[0, 0].set_xlabel('')

axs[0, 0].set_ylabel('y_pred - y_test')

axs[0, 0].set_title('Predicted vs. Actual (Ensemble)')

axs[0, 0].set_ylim([-100000, 100000])

# Plot y_linear vs. y_test

axs[0, 1].plot(y_linear - y_test, 'o', alpha=0.5)

axs[0, 1].set_xlabel('')

axs[0, 1].set_ylabel('y_linear - y_test')

axs[0, 1].set_title('Predicted vs. Actual (Linear Regression)')

axs[0, 1].set_ylim([-100000, 100000])

# Plot y_ridge vs. y_test

axs[1, 0].plot(y_ridge - y_test, 'o', alpha=0.5)

axs[1, 0].set_xlabel('')

axs[1, 0].set_ylabel('y_ridge - y_test')

axs[1, 0].set_title('Predicted vs. Actual (Ridge Regression)')

axs[1, 0].set_ylim([-100000, 100000])

# Plot y_xgb vs. y_test

axs[1, 1].plot(y_xgb - y_test, 'o', alpha=0.5)

axs[1, 1].set_xlabel('')

axs[1, 1].set_ylabel('y_xgb - y_test')

axs[1, 1].set_title('Predicted vs. Actual (XGBoost)')

axs[1, 1].set_ylim([-100000, 100000])

plt.tight_layout()

plt.show()

plt.boxplot([y_pred - y_test, y_linear - y_test, y_ridge - y_test, y_xgb - y_test], labels=['Ensemble', 'Linear Regression', 'Ridge Regression', 'XGBoost'])

plt.ylabel('Absolute Error')

plt.title('Model Comparison')

plt.show()

image2.png

image3.png

image4.png

image1.png