This repository contains a machine learning project aimed at predicting house prices based on various features. The project is inspired by the Kaggle House Prices - Advanced Regression Techniques competition.
-
Exploratory Data Analysis (EDA):
- Checked dataset structure and summary statistics using
df.info()
anddf.describe()
. - Analyzed missing values, identifying features with significant null entries like
LotFrontage
andAlley
. - Examined unique values for categorical features (e.g.,
MSZoning
,Street
) to plan for encoding. - Visualized key insights:
- Heatmap to explore correlations between numerical features and
SalePrice
. - Distribution analysis of
SalePrice
, revealing positive skewness. - Scatterplots and boxplots to analyze relationships between features and
SalePrice
.
- Heatmap to explore correlations between numerical features and
- Checked dataset structure and summary statistics using
-
Handling Missing and Categorical Data:
- Missing values in features like
PoolQC
,Alley
, andGarageType
were replaced with meaningful defaults (e.g., "NoFeature") based on feature context. - Numerical features with missing values, such as
LotFrontage
, were imputed using medians. - Categorical features were encoded:
- Nominal features using one-hot encoding to create binary columns.
- Ordinal features using ordinal encoding based on predefined category order.
- Missing values in features like
-
Feature Engineering:
- Addressed skewness in numerical features using log transformations to normalize distributions.
- Created new features to enhance predictive potential:
TotalBathrooms
: Combines all bathroom-related features into one.TotalSF
: Total usable square footage of the property.GrLivAreaToLotArea
: Ratio of above-ground living area to lot area.GarageAreaToLotArea
: Garage size relative to lot area.
-
Scaling Numerical Features:
- Standardized all numerical features using
StandardScaler
to ensure they have a mean of 0 and a standard deviation of 1. - This step enhances model compatibility, especially for algorithms sensitive to feature magnitudes (e.g., SVM, k-NN).
- Standardized all numerical features using
-
Model Training and Evaluation (TensorFlow & scikit-learn):
-
Several regression models were trained and evaluated using various metrics, including ( R^2 ), RMSE, and RMSLE. Below are the scikit-learn models and their performances:
Model Default RandomizedSearchCV GridSearchCV Ridge Regression 0.806927 N/A 0.806529 LightGBM 0.882784 0.889637 0.888717 XGBoost 0.865421 0.888977 0.886378 RandomForestRegressor 0.868972 0.860351 0.851057 - LightGBM emerged as the best-performing model, showing consistent performance after hyperparameter tuning with both RandomizedSearchCV and GridSearchCV.
- XGBoost and Ridge Regression delivered competitive results, with slight trade-offs in complexity versus performance.
- Random Forest Regressor, while robust, underperformed compared to other models after tuning.
-
A TensorFlow sequential neural network was implemented and outperformed traditional models:
- Architecture:
- Input layer matching the number of features.
- Three hidden layers with ReLU activation.
- Output layer with one neuron for regression.
- Regularization:
- Early stopping based on validation loss.
- Learning rate scheduling to improve convergence.
- Evaluation:
- Achieved a test loss of 0.293 and a mean absolute error (MAE) of 0.384.
- Architecture:
-
-
Model Stacking: By combining the predictions of base models (Ridge Regression, LightGBM, XGBoost, and RandomForestRegressor) using a meta-model (Ridge Regression), the stacking regressor achieved an excellent score of 0.9215. This approach effectively captured complementary patterns from each model, outperforming all individual models.
- The final stacking model and the TensorFlow model were used to generate predictions for the test dataset.
- The submission file was created with predicted house prices and uploaded to Kaggle.
- Achieved a leaderboard score of 0.62562 with TensorFlow, improving over prior attempts with scikit-learn models.
The dataset is sourced from Kaggle and includes:
- Training Data: 79 features describing various attributes of houses, with the target variable,
SalePrice
. - Test Data: Contains the same features but excludes the target variable for evaluation.
- Feature Descriptions: Detailed explanations of the features can be found in the Kaggle data description.
- Features like
LotFrontage
andAlley
have significant missing values, requiring imputation or removal. - Categorical features such as
MSZoning
andNeighborhood
show meaningful groupings forSalePrice
. - Numerical features like
GrLivArea
andOverallQual
exhibit strong positive correlations withSalePrice
. - The target variable
SalePrice
is positively skewed, suggesting potential log transformation during preprocessing.
The evaluation metric used is Root Mean Squared Logarithmic Error (RMSLE), designed to equally penalize errors for expensive and inexpensive houses:
This metric ensures proportional fairness across price ranges, and minimizing it is the primary objective of this project. For further details, refer to the Kaggle evaluation section.
- Correlation Heatmap: Highlights relationships between numerical features and
SalePrice
. - Distribution Analysis: Reveals the skewness of
SalePrice
and potential outliers. - Feature Relationships: Scatterplots and boxplots demonstrate how numerical and categorical features relate to
SalePrice
. - Missing Value Analysis: Bar plot prioritizes features with significant missing data.
-
Feature Importance Analysis:
- Use tree-based models like LightGBM or XGBoost to identify key features driving predictions.
- Visualize feature importances to interpret model results better.
-
Additional Feature Engineering:
- Experiment with creating new features, such as interactions between existing ones, to capture complex relationships.
-
Advanced Model Ensembling:
- Explore more sophisticated ensembling techniques, such as blending or boosting, to further improve predictive performance.
-
Advanced Neural Network Architectures:
- Experiment with deeper or more complex TensorFlow architectures, incorporating techniques like dropout and batch normalization.
-
Deployment:
- Deploy the trained TensorFlow model as a web application or API for real-world usability.
- The dataset used in this project is sourced from the Kaggle House Prices - Advanced Regression Techniques competition.
This project is licensed under the terms of the MIT License.