This repository contains the coursework and analysis for a Sentiment Analysis project. The primary objective was to perform a comparative study of sentiment analysis techniques applied to Donald Trump's historical Twitter data. The project contrasts a lexicon-based approach (VADER) with various Supervised Machine Learning algorithms to classify the emotional tone of political communication.
- Weronika Mądro
- Wojciech Hrycenko
File: Sentiment_Analysis_Madro_Hrycenko.ipynb
Objective The goal of this project was to analyze the sentiment of tweets to:
- Apply lexicon-based analysis using VADER to generate sentiment scores (Compound, Positive, Neutral, Negative).
- Transform text data into numerical features using TF-IDF.
- Train and evaluate supervised models (Classification) using labels derived from VADER scores.
- Optimize model hyperparameters to maximize predictive performance.
Dataset
The project utilizes the realdonaldtrump.csv dataset, which contains over 43,000 tweets from Donald Trump (up to June 2020). Key analysis was performed on the content column after extensive cleaning.
Methodology
- Data Analysis & Preprocessing:
- Text Cleaning: Removal of URLs, user mentions (
@user), hashtags (#), punctuation, and numbers; conversion to lowercase. - Normalization: Tokenization and removal of English stopwords.
- EDA: Frequency analysis of top unigrams (e.g., "realdonaldtrump", "great", "fake news") and visualization of sentiment score distributions.
- Text Cleaning: Removal of URLs, user mentions (
- Lexicon-Based Approach (VADER):
- Utilized
SentimentIntensityAnalyzerto compute polarity scores. - The Compound score was used to label tweets for the supervised learning stage.
- Utilized
- Supervised Modeling:
- Feature Extraction: Implemented
TfidfVectorizerto convert cleaned text into weighted feature vectors. - Algorithms: Trained and compared multiple classifiers:
- Logistic Regression
- Linear SVM (LinearSVC)
- Decision Trees
- K-Nearest Neighbors (KNN)
- Random Forest
- Optimization: Applied
GridSearchCVandRandomizedSearchCVfor hyperparameter tuning.
- Feature Extraction: Implemented
- Evaluation:
- Performance measured using Accuracy, ROC AUC, and Confusion Matrices.
- Linear SVM and Logistic Regression demonstrated superior performance compared to non-linear models.
The project was developed in Python, utilizing the following key libraries:
- NLTK: For natural language processing tasks (Stopwords, VADER Sentiment Intensity Analyzer).
- Scikit-learn: For machine learning models (
LogisticRegression,SVM,RandomForest), feature extraction (TfidfVectorizer), and evaluation metrics. - Pandas & NumPy: For efficient data manipulation and numerical analysis.
- Matplotlib & Seaborn: For plotting data distributions and model results.
- WordCloud: For visualizing the most frequent terms in the corpus.
- Jupyter Notebook: Used as the interactive development environment.
- Clone this repository to your local machine.
- Ensure all required dependencies are installed (refer to the library list above).
- Download the
realdonaldtrump.csvdataset and place it in the same directory as the notebook. - Navigate to the directory and execute
Sentiment_Analysis_Madro_Hrycenko.ipynbto view the data cleaning, VADER scoring, and model training processes.