Skip to content

Repository contains the project for a Sentiment Analysis course. Main goal was to perform a comparative study of sentiment analysis techniques applied to Donald Trump's historical Twitter data. The project contrasts a lexicon-based approach (VADER) with various Supervised ML algorithms to classify the emotional tone of political communication.

Notifications You must be signed in to change notification settings

WojciechHrycenko/SentimentAnalysis

Repository files navigation

Sentiment Analysis - Donald Trump Tweets

Project Status Course Python Jupyter NLTK Scikit-Learn Pandas

Project Overview

This repository contains the coursework and analysis for a Sentiment Analysis project. The primary objective was to perform a comparative study of sentiment analysis techniques applied to Donald Trump's historical Twitter data. The project contrasts a lexicon-based approach (VADER) with various Supervised Machine Learning algorithms to classify the emotional tone of political communication.

Authors

  • Weronika Mądro
  • Wojciech Hrycenko

Repository Contents

1. Sentiment Analysis: Donald Trump Tweets

File: Sentiment_Analysis_Madro_Hrycenko.ipynb

Objective The goal of this project was to analyze the sentiment of tweets to:

  1. Apply lexicon-based analysis using VADER to generate sentiment scores (Compound, Positive, Neutral, Negative).
  2. Transform text data into numerical features using TF-IDF.
  3. Train and evaluate supervised models (Classification) using labels derived from VADER scores.
  4. Optimize model hyperparameters to maximize predictive performance.

Dataset The project utilizes the realdonaldtrump.csv dataset, which contains over 43,000 tweets from Donald Trump (up to June 2020). Key analysis was performed on the content column after extensive cleaning.

Methodology

  • Data Analysis & Preprocessing:
    • Text Cleaning: Removal of URLs, user mentions (@user), hashtags (#), punctuation, and numbers; conversion to lowercase.
    • Normalization: Tokenization and removal of English stopwords.
    • EDA: Frequency analysis of top unigrams (e.g., "realdonaldtrump", "great", "fake news") and visualization of sentiment score distributions.
  • Lexicon-Based Approach (VADER):
    • Utilized SentimentIntensityAnalyzer to compute polarity scores.
    • The Compound score was used to label tweets for the supervised learning stage.
  • Supervised Modeling:
    • Feature Extraction: Implemented TfidfVectorizer to convert cleaned text into weighted feature vectors.
    • Algorithms: Trained and compared multiple classifiers:
      • Logistic Regression
      • Linear SVM (LinearSVC)
      • Decision Trees
      • K-Nearest Neighbors (KNN)
      • Random Forest
    • Optimization: Applied GridSearchCV and RandomizedSearchCV for hyperparameter tuning.
  • Evaluation:
    • Performance measured using Accuracy, ROC AUC, and Confusion Matrices.
    • Linear SVM and Logistic Regression demonstrated superior performance compared to non-linear models.

Technologies and Libraries

The project was developed in Python, utilizing the following key libraries:

  • NLTK: For natural language processing tasks (Stopwords, VADER Sentiment Intensity Analyzer).
  • Scikit-learn: For machine learning models (LogisticRegression, SVM, RandomForest), feature extraction (TfidfVectorizer), and evaluation metrics.
  • Pandas & NumPy: For efficient data manipulation and numerical analysis.
  • Matplotlib & Seaborn: For plotting data distributions and model results.
  • WordCloud: For visualizing the most frequent terms in the corpus.
  • Jupyter Notebook: Used as the interactive development environment.

Usage Instructions

  1. Clone this repository to your local machine.
  2. Ensure all required dependencies are installed (refer to the library list above).
  3. Download the realdonaldtrump.csv dataset and place it in the same directory as the notebook.
  4. Navigate to the directory and execute Sentiment_Analysis_Madro_Hrycenko.ipynb to view the data cleaning, VADER scoring, and model training processes.

About

Repository contains the project for a Sentiment Analysis course. Main goal was to perform a comparative study of sentiment analysis techniques applied to Donald Trump's historical Twitter data. The project contrasts a lexicon-based approach (VADER) with various Supervised ML algorithms to classify the emotional tone of political communication.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •