Decoding Emotions: A Multi-Approach Sentiment Classification of Tweets

Overview

Our Machine Learning project aims to classify tweets based on their sentiment, identifying whether they convey positive :) or negative :( emotions. Multiple approaches were explored, from traditional methods using TF-IDF and GloVe embeddings to more advanced methods like FastText and transformer-based architectures, including DistilBERT and RoBERTa.

The project includes hyperparameter tuning using Optuna, efficient model training on large datasets, and a systematic approach to evaluating performance.

Repository Philosophy

Throughout this project, we prioritized creating a clean, understandable, and concise repository. Our goal was to ensure that anyone exploring the code, dataset, and analysis could do so efficiently and with minimal confusion. We aimed to :

Maintain a clear directory structure for easy navigation.
Write modular, well-documented code to simplify reuse and understanding.
Ensure that all scripts and notebooks are free from unnecessary clutter.

Repository Structure

├── data/                     # Directory for dataset files
│   ├── train_pos.txt         # Positive sentiment tweets (small dataset)
│   ├── train_neg.txt         # Negative sentiment tweets (small dataset)
│   ├── train_pos_full.txt    # Positive sentiment tweets (full dataset)
│   ├── train_neg_full.txt    # Negative sentiment tweets (full dataset)
│   ├── test_data.txt         # Test dataset
├── notebooks/                # Jupyter notebooks for analysis
│   ├── EDA.ipynb             # Exploratory Data Analysis notebook
│   ├── Ethical_Risk.ipynb    # Ethical risk analysis notebook
├── src/                      # Source code for model training and evaluation
│   ├── preprocess.py         # Preprocessing scripts
│   ├── train.py              # Training scripts
│   ├── evaluate.py           # Evaluation scripts
├── submission.csv            # Test set submission file
├── run.py                    # Main script to run the project
├── requirements.txt          # Python dependencies

Models and Methods

1. GloVe + Logistic Regression

Features: Pre-trained GloVe embeddings (100-dimensional) were used to represent tweets as averaged word embeddings.
Classifier: Trained Logistic Regression with hyperparameter tuning.
Validation Accuracy: Achieved 76.0%.

2. TF-IDF + Logistic Regression

Features: Used TF-IDF vectors with n-grams (up to bigrams or trigrams) as features.
Classifier: Trained a Logistic Regression model with GridSearchCV for hyperparameter tuning.
Validation Accuracy: Achieved 82.1%.

3. FastText

Model: FastText, a subword-based text classifier, was tuned using Optuna to optimize hyperparameters like learning rate, number of epochs, and word n-grams.
Hyperparameter Tuning:
- Learning Rate: 0.00347
- Epochs: 69
- Word N-grams: 4
- Embedding Dimensions: 50
- Loss Function: softmax
Validation Accuracy: Achieved 83.9%.
Best F1 Score: 84.0%.

4. DistilBERT

Model: Pre-trained DistilBERT (distilbert-base-uncased), fine-tuned for sentiment classification.
Hyperparameter Tuning:
- Learning Rate: $2 \times 10^{-5}$
- Batch Size: 16
- Number of Epochs: 5
- Weight Decay: 0.01
Validation Accuracy: Achieved 88.7%.
Best F1 Score: 88.9%.

5. RoBERTa

Model: Pre-trained RoBERTa (roberta-base).
Validation Accuracy: Achieved 88.4%.
Best F1 Score: 88.7%.

Installation

Clone the repository:

git clone https://github.com/CS-433/ml-project-2-mocro_learning.git
cd ml-project-2-mocro_learning

Install the required dependencies:
```
pip install -r requirements.txt
```
Download and place the dataset files in a data/ directory inside ml-project-2-mocro_learning.
Download the GloVe Twitter embeddings from the official GloVe website, extract it, and place the files glove.twitter.27B.50d.txt, glove.twitter.27B.100d.txt and glove.twitter.27B.300d.txt directly in the project folder (ml-project-2-mocro_learning/):

ml-project-2-mocro_learning/
├── glove.twitter.27B.50d.txt
├── glove.twitter.27B.100d.txt
├── glove.twitter.27B.200d.txt

Usage

Training and Evaluation

The run.py script provides a unified interface to train and evaluate models.

Choose the desired method in the method variable (glove, tfidf, fasttext, distilbert, or roberta).
Execute the script:
```
python run.py
```

Preprocessing

The src/preprocess.py script includes utility functions for:

Removing irrelevant tokens like URLs, mentions, and placeholders.
Converting tweets to GloVe embeddings or TF-IDF features.

Evaluation

The src/evaluate.py script evaluates models on validation data and predicts sentiment labels for the test dataset.

Hyperparameter Tuning

The project leverages Optuna for efficient hyperparameter optimization. Adjust the number of trials and search spaces in src/train.py.

Results

Model	Test Accuracy	Test F1 Score	Submission ID
Logistic Regression (Glove)	70%	74.3%	276722
Logistic Regression (TF-IDF)	78.2%	78.6%	276347
FastText	83.9%	84.0%	276771
RoBERTa	88.4%	88.7%	277552
DistilBERT	88.7%	88.9%	277867

Contributors

Bakiri Ayman
Ben Mohamed Nizar
Chahed Ouazzani Adam

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
notebooks		notebooks
src		src
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decoding Emotions: A Multi-Approach Sentiment Classification of Tweets

Overview

Repository Philosophy

Repository Structure

Models and Methods

1. GloVe + Logistic Regression

2. TF-IDF + Logistic Regression

3. FastText

4. DistilBERT

5. RoBERTa

Installation

Usage

Training and Evaluation

Preprocessing

Evaluation

Hyperparameter Tuning

Results

Contributors

About

Uh oh!

Releases

Packages

Languages

Portgas37/Tweet-Sentiment-Classification

Folders and files

Latest commit

History

Repository files navigation

Decoding Emotions: A Multi-Approach Sentiment Classification of Tweets

Overview

Repository Philosophy

Repository Structure

Models and Methods

1. GloVe + Logistic Regression

2. TF-IDF + Logistic Regression

3. FastText

4. DistilBERT

5. RoBERTa

Installation

Usage

Training and Evaluation

Preprocessing

Evaluation

Hyperparameter Tuning

Results

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages