The goal of this project is to predict house prices using a combination of advanced regression techniques & machine learning models. Accurate prediction of house prices is complex due to the multitude of influencing factors such as location, structural details, and neighborhood characteristics. We utilize advanced machine learning techniques like Support Vector Regression (SVR), Artificial Neural Networks (ANN), and XGBoost to achieve this goal.
The primary dataset used in this project is the Ames Housing dataset, which contains 79 features (explanatory variables) describing various aspects of residential homes in Ames, Iowa. Additionally, we utilize the train and test datasets from the House Prices - Advanced Regression Techniques competition on Kaggle.
Original Dataset: Courtesy of Prof. Dean De Cock - Thank you!
https://www.kaggle.com/datasets/marcopale/housing/data https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques
- AmesHousing: 2931 rows in
AmesHousing.csv - Sample Submission: 1460 rows in
sample_submission.csv - Test: 1460 rows in
test.csv - Train: 1461 rows in
train.csv
The competition dataset is divided differently than the original dataset, allowing us to test our models on both and compare their performance. The dataset is manageable, with the following characteristics:
- Volume: Moderate size with over 80 features.
- Variety: Diverse attributes including numerical, categorical, and ordinal data.
- Velocity: Static dataset, no real-time data.
- Veracity: High quality with well-documented attributes.
- Value: Significant value in accurately predicting prices for buyers, sellers, and real estate professionals.
data/
├── competition/
│ ├── AmesHousing.csv
│ ├── data_description.txt
│ ├── sample_submission.csv
│ ├── test.csv
│ ├── train.csv
├── .devcontainer/
│ └── devcontainer.json
├── data/
│ ├── competition/
│ │ ├── AmesHousing.csv
│ │ ├── data_description.txt
│ │ ├── sample_submission.csv
│ │ ├── test.csv
│ │ ├── train.csv
├── notebooks/
│ └── initial_analysis.ipynb
├── requirements.txt
├── setup.sh
└── README.md
-
Open Codespaces:
- Navigate to your repository on GitHub.
- Click on the
Codebutton and selectOpen with Codespaces.
-
Configuration:
- The
.devcontainer/devcontainer.jsonfile is set up to automatically install necessary packages fromrequirements.txtwhen the Codespace starts.
- The
- Clone the Repository:
git clone https://github.com/OzPol/DataScience.git
cd DataScience- Install Dependencies:
pip install -r requirements.txt- Start Jupyter Notebook:
jupyter notebookThe initial data analysis and processing are performed in the notebooks/initial_analysis.ipynb notebook.
One-hot encoding is used for categorical features.
Missing values are handled using appropriate imputation techniques.
Outliers are identified and treated to improve model performance.
New features are created from existing data to enhance model accuracy.
- Linear Regression
- Decision Trees
- Support Vector Regression (SVR)
- Artificial Neural Networks (ANN)
- XGBoost
- Random Forest
- Gradient Boosting
Hyperparameters are tuned using GridSearchCV.
- Accuracy: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) on the test set.
- Runtime: Computational efficiency and model training time.
Each feature or bug fix should be worked on in its own branch. Use meaningful branch names to easily identify the purpose of the branch (e.g., feature-data-preprocessing, bugfix-missing-values).
Ensure your branch is up to date with the main branch before creating a pull request. Provide a clear description of the changes and the problem it solves.
This project is licensed under the MIT License. See the LICENSE file for details.
- Understand the dataset and clean it up if required.
- Build regression models to predict the sales with respect to single and multiple features.
- Evaluate the models and compare their respective scores like R2, RMSE, etc.
- Data Exploration
- Exploratory Data Analysis (EDA)
- Data Pre-processing
- Data Manipulation
- Feature Selection/Extraction
- Predictive Modeling
- Python
- Jupyter Notebooks
- TensorFlow
- Scikit-learn
- XGBoost
- Pandas
- NumPy
- Matplotlib
- Seaborn
- SciPy
- https://jse.amstat.org/v19n3/decock.pdf
- https://www.kaggle.com/datasets/marcopale/housing/data
- https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques
- https://www.mdpi.com/2220-9964/12/5/200
- https://seaborn.pydata.org/tutorial.html
- https://matplotlib.org/stable/gallery/index
- https://pandas.pydata.org/pandas-docs/stable/reference/plotting.html
- https://www.tensorflow.org/tutorials/keras/text_classification
- https://www.tensorflow.org/tutorials/keras/regression
- https://www.geeksforgeeks.org/house-price-prediction-using-machine-learning-in-python/
A Survey of Methods and Input Data Types for House Price Prediction https://www.mdpi.com/2220-9964/12/5/200
House Price Prediction using a Machine Learning Model: A Survey of Literature: https://www.irjet.net/archives/V9/i5/IRJET-V9I5455.pdf