This repository contains the coursework and projects developed for the Machine Learning 2 course. The primary objective was to apply and evaluate various machine learning techniques on real-world datasets, specifically focusing on classification and regression problems.
- Aleksandra Dobosz
- Wojciech Hrycenko
Directory: Classification/
Objective
The goal of this project was to develop a binary classification model to predict whether it will rain the following day in Australia (target variable: RainTomorrow), based on daily meteorological observations.
Dataset
The project utilizes the weatherAUS.csv dataset, which includes features such as temperature, rainfall, sunshine, wind gusts, humidity, and pressure.
Methodology
- Data Analysis & Preprocessing: Addressed missing values and analyzed variable distributions (noting significant skewness in rainfall data).
- Modeling:
- Decision Tree Classifier
- Random Forest Classifier
- Gradient Boosting Classifier
- Techniques: Implemented class weighting (
class_weight='balanced') to mitigate the imbalance between rainy and non-rainy days. - Evaluation: Performance assessment focused on Recall and F1-score for the positive class to minimize the risk of failing to predict rainfall. The Gradient Boosting model, following decision threshold optimization, yielded the most robust results.
Directory: Regression - Used Cars/
Objective This project aimed to build a regression model to estimate the market price of used vehicles based on data scraped from Craigslist.
Dataset
The analysis used the vehicles.csv dataset, comprising millions of vehicle listings from the United States.
Note: Due to file size limitations, the dataset is not hosted in this repository. It can be downloaded from Kaggle: Used Cars Dataset (Craigslist)
Methodology
- Data Quality Assessment: Conducted a thorough analysis of missing data and unique value counts.
- Preprocessing:
- Dimensionality Reduction: Removed columns with excessive missingness (e.g.,
county,size) and irrelevant identifiers (VIN,url). - Imputation: Filled missing categorical data with an 'unknown' placeholder.
- Outlier Detection: Filtered data to realistic ranges for price ($500 - $150k), manufacturing year (1990-2025), and odometer readings.
- Dimensionality Reduction: Removed columns with excessive missingness (e.g.,
- Modeling:
- Linear Regression
- Random Forest Regressor
- XGBoost Regressor
- Neural Networks (MLP Regressor)
- Voting Regressor (Ensemble method)
- Evaluation: Models were compared using Mean Absolute Error (MAE), Mean Squared Error (MSE), and the R-squared (R2) coefficient.
The project was developed in Python, utilizing the following key libraries:
- Pandas & NumPy: For efficient data manipulation and numerical analysis.
- Scikit-learn: For model training, preprocessing pipelines, and evaluation metrics.
- XGBoost: For high-performance gradient boosting algorithms.
- Matplotlib & Seaborn: For exploratory data analysis and result visualization.
- Jupyter Notebook: Used as the interactive development environment.
- Clone this repository to your local machine.
- Ensure all required dependencies are installed (refer to library list above).
- For Regression Project: Download the
vehicles.csvfile from the Kaggle link provided above and place it in theRegression - Used Cars/directory. - Navigate to the respective directories (
ClassificationorRegression) and execute the Jupyter Notebooks (.ipynb) to view the analysis and reproduce the models.