Our Machine Learning project aims to classify tweets based on their sentiment, identifying whether they convey positive :) or negative :( emotions. Multiple approaches were explored, from traditional methods using TF-IDF and GloVe embeddings to more advanced methods like FastText and transformer-based architectures, including DistilBERT and RoBERTa.
The project includes hyperparameter tuning using Optuna, efficient model training on large datasets, and a systematic approach to evaluating performance.
Throughout this project, we prioritized creating a clean, understandable, and concise repository. Our goal was to ensure that anyone exploring the code, dataset, and analysis could do so efficiently and with minimal confusion. We aimed to :
- Maintain a clear directory structure for easy navigation.
- Write modular, well-documented code to simplify reuse and understanding.
- Ensure that all scripts and notebooks are free from unnecessary clutter.
├── data/ # Directory for dataset files
│ ├── train_pos.txt # Positive sentiment tweets (small dataset)
│ ├── train_neg.txt # Negative sentiment tweets (small dataset)
│ ├── train_pos_full.txt # Positive sentiment tweets (full dataset)
│ ├── train_neg_full.txt # Negative sentiment tweets (full dataset)
│ ├── test_data.txt # Test dataset
├── notebooks/ # Jupyter notebooks for analysis
│ ├── EDA.ipynb # Exploratory Data Analysis notebook
│ ├── Ethical_Risk.ipynb # Ethical risk analysis notebook
├── src/ # Source code for model training and evaluation
│ ├── preprocess.py # Preprocessing scripts
│ ├── train.py # Training scripts
│ ├── evaluate.py # Evaluation scripts
├── submission.csv # Test set submission file
├── run.py # Main script to run the project
├── requirements.txt # Python dependencies
- Features: Pre-trained GloVe embeddings (100-dimensional) were used to represent tweets as averaged word embeddings.
- Classifier: Trained Logistic Regression with hyperparameter tuning.
- Validation Accuracy: Achieved 76.0%.
- Features: Used TF-IDF vectors with n-grams (up to bigrams or trigrams) as features.
- Classifier: Trained a Logistic Regression model with GridSearchCV for hyperparameter tuning.
- Validation Accuracy: Achieved 82.1%.
- Model: FastText, a subword-based text classifier, was tuned using Optuna to optimize hyperparameters like learning rate, number of epochs, and word n-grams.
- Hyperparameter Tuning:
- Learning Rate: 0.00347
- Epochs: 69
- Word N-grams: 4
- Embedding Dimensions: 50
- Loss Function: softmax
- Validation Accuracy: Achieved 83.9%.
- Best F1 Score: 84.0%.
-
Model: Pre-trained DistilBERT (
distilbert-base-uncased), fine-tuned for sentiment classification. -
Hyperparameter Tuning:
- Learning Rate:
$2 \times 10^{-5}$ - Batch Size: 16
- Number of Epochs: 5
- Weight Decay: 0.01
- Learning Rate:
- Validation Accuracy: Achieved 88.7%.
- Best F1 Score: 88.9%.
- Model: Pre-trained RoBERTa (
roberta-base). - Validation Accuracy: Achieved 88.4%.
- Best F1 Score: 88.7%.
-
Clone the repository:
git clone https://github.com/CS-433/ml-project-2-mocro_learning.git cd ml-project-2-mocro_learning -
Install the required dependencies:
pip install -r requirements.txt
-
Download and place the dataset files in a
data/directory inside ml-project-2-mocro_learning. -
Download the GloVe Twitter embeddings from the official GloVe website, extract it, and place the files
glove.twitter.27B.50d.txt,glove.twitter.27B.100d.txtandglove.twitter.27B.300d.txtdirectly in the project folder (ml-project-2-mocro_learning/):
ml-project-2-mocro_learning/
├── glove.twitter.27B.50d.txt
├── glove.twitter.27B.100d.txt
├── glove.twitter.27B.200d.txt
The run.py script provides a unified interface to train and evaluate models.
- Choose the desired method in the
methodvariable (glove,tfidf,fasttext,distilbert, orroberta). - Execute the script:
python run.py
The src/preprocess.py script includes utility functions for:
- Removing irrelevant tokens like URLs, mentions, and placeholders.
- Converting tweets to GloVe embeddings or TF-IDF features.
The src/evaluate.py script evaluates models on validation data and predicts sentiment labels for the test dataset.
The project leverages Optuna for efficient hyperparameter optimization. Adjust the number of trials and search spaces in src/train.py.
| Model | Test Accuracy | Test F1 Score | Submission ID |
|---|---|---|---|
| Logistic Regression (Glove) | 70% | 74.3% | 276722 |
| Logistic Regression (TF-IDF) | 78.2% | 78.6% | 276347 |
| FastText | 83.9% | 84.0% | 276771 |
| RoBERTa | 88.4% | 88.7% | 277552 |
| DistilBERT | 88.7% | 88.9% | 277867 |
- Bakiri Ayman
- Ben Mohamed Nizar
- Chahed Ouazzani Adam