Skip to content

epfl-ada/ada-2025-project-barrada

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit’s Invisible Brain: The Web Hidden Beneath the Threads

→ View Interactive Data Story


Overview

Reddit is not just a collection of online forums,it is a living ecosystem of interconnected communities, each with its own emotional and cognitive identity. Through hyperlinks, subreddits comment on, critique, and reference one another, forming a complex web of inter-community relationships.

This project uncovered the hidden psychological structure underlying that web by integrating three distinct perspectives:

  • Structural: Who talks about whom? (Network topology, centrality, community structure)
  • Psychological: How do they talk? (Emotional tone, cognitive style, LIWC linguistic analysis)
  • Semantic: What are they about? (Topic embeddings and clustering)

By combining these layers, we visualized Reddit as a "social MRI", revealing how communities align, clash, or coexist not only by what they discuss but how they think and feel. This project tells the story of Reddit's invisible brain: the emotional and cognitive currents that shape the flow of ideas across its digital landscape.

Project Scope:

  • 858,488 hyperlinks across 67,180 subreddits (2014-2017)
  • 3 integrated data layers (Structural, Psychological, Semantic)
  • 40 topic clusters with manual semantic validation
  • statistical research questions across 5 thematic areas
  • 11 interactive web visualizations + 17 static analytical charts

Quickstart

# clone project
git clone https://github.com/epfl-ada/ada-2025-project-barrada.git barrADA
cd barrADA
# create conda environment (optional)
conda create -n barrada python=3.13
conda activate barrada
# install requirements
pip install -r pip_requirements.txt
# create data folders and download initial datasets
mkdir -p data/hyperlink_network && mkdir -p data/subreddit_embeddings
wget -O data/hyperlink_network/soc-redditHyperlinks-body.tsv "https://snap.stanford.edu/data/soc-redditHyperlinks-body.tsv"
wget -O data/hyperlink_network/soc-redditHyperlinks-title.tsv "https://snap.stanford.edu/data/soc-redditHyperlinks-title.tsv"
wget -O data/subreddit_embeddings/web-redditEmbeddings-subreddits.csv "https://snap.stanford.edu/data/web-redditEmbeddings-subreddits.csv"
# run server to look into results.ipynb file or python files inside src/
jupyter lab

Project Structure

├── data/
│   ├── hyperlink_network/              # Raw SNAP datasets (not in repo)
│   │   ├── soc-redditHyperlinks-body.tsv
│   │   └── soc-redditHyperlinks-title.tsv
│   ├── subreddit_embeddings/           # Raw embeddings (not in repo)
│   │   └── web-redditEmbeddings-subreddits.csv
│   └── processed/                        # Generated by pipeline
│       ├── combined_hyperlinks.csv        # Step 1: 858k links + 86 features
│       ├── subreddit_features_source.csv  # Step 2: Outgoing LIWC profiles
│       ├── subreddit_features_target.csv  # Step 2: Incoming LIWC profiles
│       ├── subreddit_roles.csv            # Step 2: Role classifications
│       ├── network_node_metrics.csv       # Step 3: PageRank, betweenness, etc.
│       ├── network_communities.csv        # Step 3: Louvain communities
│       ├── embeddings_processed.csv       # Step 4: PCA-reduced 50D vectors
│       ├── embeddings_kmeans_40.csv       # Step 5: Topic cluster assignments
│       ├── cluster_labels_40.csv          # Step 5: Manual topic labels
│       ├── final_dataset.csv              # Step 6: Master dataset (67k × 161)
│       ├── cluster_master_dataset.csv     # Step 7: Cluster-level aggregation (40 × 110)
│       └── rq_analysis/                   # 20 research question outputs
│           ├── rq1_role_patterns.csv
│           └── ...
│
├── src/                                
│   ├── data_processing.py              # Step 1: Hyperlink loading + LIWC parsing
│   ├── liwc_analysis.py                # Step 2: Psychological profiling
│   ├── network_analysis.py             # Step 3: Graph metrics + communities
│   ├── embedding_processing.py         # Step 4: PCA dimensionality reduction
│   ├── topic_clustering.py             # Step 5: K-Means + manual labeling
│   ├── integration.py                  # Step 6: Merge all layers
│   ├── cluster_aggregation.py          # Step 7: Cluster-level aggregation
│   ├── research_questions.py           # 20 statistical tests
│   └── visualize_pipeline.py           # 17 static visualizations
│
├── scripts/
│   └── prepare_data.py                 # Generate 11 JSON files for web visualizations
│
├── results/
│   └── figures/                        # 17 PNG charts (step1-step6)
│
├── docs/                               # Jekyll website (GitHub Pages)
│   ├── assets/
│   │   ├── data/                      
│   │   │   ├── nodes.json              
│   │   │   ├── edges.json              
│   │   │   ├── rivalry.json           
│   │   │   ├── toxicity.json          
│   │   │   ├── insurgency.json         
│   │   │   ├── echo.json               
│   │   │   ├── roles.json              
│   │   │   ├── roles_scatter.json   
│   │   │   ├── power.json             
│   │   │   ├── bridges.json            
│   │   │   └── civility.json           
│   │   ├── visualizations/             # 10 interactive HTML embeds
│   │   │   ├── topic_continents_network.html       # Force-directed graph
│   │   │   ├── roles_quadrant_scatter.html         # Role classification scatter
│   │   │   ├── rivalry_chord.html                  # Rivalry flows
│   │   │   ├── toxicity_flow_sankey.html           # Toxicity flow
│   │   │   ├── insurgency_power_waffle.html        # Attack direction patterns
│   │   │   ├── isolation_civility_slope.html       # Isolation vs civility
│   │   │   ├── bridges_linguistic_ba.html          # Bridge predictors
│   │   │   ├── roles_synapses_bundling.html        # Role interaction flows
│   │   │   ├── network_synthesis_explorer.html     # Interactive 3-layer explorer
│   │   │   └── alliances_chord.html                # Positive sentiment flows
│   │   └── img/                        # Static PNG assets for website
│   └── reddit-story.html               # Main data story page
│
├── results.ipynb                       # Complete technical notebook
├── README.md                           # This file
├── .gitignore                          # List of files ignored by git
├── LICENSE                             # MIT License
└── pip_requirements.txt                # Python dependencies

Methods

Data Sources

We used three datasets from the Stanford Network Analysis Project (SNAP):

  1. Reddit Hyperlink Network (Body + Title): hyperlinks from post bodies and titles (2014-2017)
  2. Subreddit Embeddings: 300-dimensional topic vectors for 51,278 subreddits

Each hyperlink contains:

  • Source and target subreddit
  • Timestamp
  • Sentiment label (+1 positive/neutral, -1 negative)
  • PROPERTIES string with 21 text features + 65 LIWC psychological scores

Analysis Pipeline (7 Steps)

Step 1: Data Processing

  • Combined body and title hyperlinks (858,490 total links)
  • Parsed PROPERTIES string into 86 distinct features:
    • 21 Text Properties: word count, readability, VADER sentiment, compound score, etc.
    • 65 LIWC Features: psychological dimensions (anger, anxiety, certainty, cognitive complexity, etc.)
  • Standardized subreddit names, removed self-loops, validated data integrity
  • Output: combined_hyperlinks.csv

Step 2: LIWC Analysis (Psychological Layer)

  • Aggregated 65 LIWC scores per subreddit as both source (outgoing) and target (incoming)
  • Computed composite psychological metrics:
    • Emotion_Total: Sum of positive and negative emotion
    • Negemo_Specific: Mean of anger, anxiety, sadness
    • Cognitive_Total: Mean of certainty, tentativeness, insight, discrepancy, causation
  • Calculated asymmetry features: outgoing - incoming (e.g., "I speak with anger" vs "others speak to me with anger")
  • Classified subreddits into 4 social roles:
    • Critical: High outgoing negativity
    • Controversial: High incoming negativity
    • Supportive: High outgoing positive/neutral links
    • Influential: High incoming positive/neutral links
  • Outputs: subreddit_features_source.csv, subreddit_features_target.csv, subreddit_roles.csv

Step 3: Network Analysis (Structural Layer)

  • Built directed graph G(V, E) with 67,180 nodes and 858,490 weighted edges
  • Computed centrality metrics:
    • PageRank: Prestige (who gets linked to)
    • Betweenness Centrality: Bridges (who connects disconnected communities)
      • Used k=50 sampling for computational efficiency
      • Sufficient for ranking; absolute values are approximations
    • HITS Algorithm: Hub scores (good linkers) and Authority scores (good content)
  • Detected communities using Louvain method on undirected version (optimizes modularity)
  • Built separate graphs for positive-only and negative-only sentiment
  • Outputs: network_node_metrics.csv, network_communities.csv, network_edges.csv

Step 4: Embedding Processing (Semantic Layer)

  • Loaded 300-dimensional subreddit embeddings (51,278 subreddits)
  • Applied PCA to reduce to 50 dimensions
    • Captures ~92% of variance
    • Reduces noise and improves clustering
  • Outputs: embeddings_processed.csv, pca_variance.csv

Step 5: Topic Clustering

  • Ran K-Means clustering with K=40 on PCA-reduced embeddings
  • Why K=40?
    • Tested K=20, K=40, K=50
    • K=50 had better silhouette scores but semantically incoherent clusters
    • K=40 provides best balance between statistical fit and interpretability
  • Manual Labeling Process:
    1. For each cluster, examined top-20 subreddits by link volume
    2. Assigned semantic labels (e.g., "Gaming," "Politics," "Meta-Commentary")
    3. Applied 66 manual overrides to fix edge cases (e.g., r/steam misclassified with cooking)
  • Outputs: embeddings_kmeans_40.csv, cluster_labels_40.csv

Step 6: Integration

  • Merged all processed datasets on subreddit key:
    • Network metrics (PageRank, betweenness, communities)
    • LIWC psychological profiles (source + target)
    • Topic cluster assignments and labels
    • Social role classifications
  • Output: final_dataset.csv (67,180 rows × 161 features)

Step 7: Cluster Aggregation

  • Aggregated subreddit-level data to cluster level
  • Calculated per-cluster:
    • Mean network metrics (PageRank, betweenness, degree)
    • Mean psychological scores (all 65 LIWC dimensions)
    • Insularity (percentage of internal links)
    • Sentiment flows (incoming/outgoing positive/negative)
    • Role distributions (percentage Critical, Supportive, etc.)
    • Top-5 exemplar subreddits (by PageRank, betweenness, in-degree)
    • Derived scores (toxicity, analytical, emotional)
  • Output: cluster_master_dataset.csv (40 clusters × 110 features)

Research Question Framework

We conducted statistical tests organized into 5 thematic areas:


1. Roles & Psychology

Methods: ANOVA, t-tests, correlation analysis

Research Question Method
RQ1.1 What linguistic patterns distinguish social roles? Compare LIWC means across roles (Critical/Supportive/etc.)
RQ1.2 Which emotions dominate conflict vs support? T-test: positive vs negative link LIWC scores
RQ1.3 Do communities experience emotional asymmetry? Compare outgoing vs incoming LIWC (paired differences)
RQ1.4 Do ideological neighbors attack each other more? Compare within-cluster vs cross-cluster sentiment

2. Echo Chambers

Methods: Correlation, purity index, regression

Research Question Method
RQ2.1 Does certainty correlate with insularity? Pearson r: certainty score vs internal link percentage
RQ2.2 Where do semantic and structural clustering align? Purity matrix: topic cluster × network community
RQ2.3 What predicts isolation? Regression: insularity ~ LIWC + network features

3. Network Structure

Methods: Correlation, flow matrices, bridge analysis

Research Question Method
RQ3.1 What makes a bridge? Correlation: betweenness vs LIWC/role features
RQ3.2 Which topics act as bridges? Cross-cluster link density by topic
RQ3.3 What are the major rivalries and alliances? Net sentiment flow matrix between clusters

4. Conflict Patterns

Methods: Cluster comparisons, sentiment aggregation

Research Question Method
RQ4.1 How are roles distributed across topics? Role percentage per cluster
RQ4.2 Which clusters attack outward most? External negativity: outgoing negative - incoming positive
RQ4.3 Which clusters attack themselves? Internal civility: negative links within cluster
RQ4.4 What language predicts internal peace? Correlation: civility vs LIWC features
RQ4.5 What language predicts external war? Correlation: external negativity vs LIWC

5. Power Dynamics

Methods: Flow analysis, regression, directional tests

Research Question Method
RQ5.1 Who are the punching bags? Net toxicity flow: incoming negative - outgoing negative
RQ5.2 What predicts PageRank? Correlation: PageRank vs LIWC/betweenness/degree
RQ5.3 Do attacks flow up or down the hierarchy? Compare low-PR(PageRank) → high-PR vs high-PR → low-PR links
RQ5.4 Does receiving hate breed sending hate? Correlation: incoming negativity vs outgoing negativity
RQ5.5 Are critics analytical or emotional? Compare cognitive vs emotional LIWC in negative links

Statistical Methods:

  • Correlation Analysis: Pearson r for continuous variables
  • T-tests: Compare group means (e.g., positive vs negative link language)
  • ANOVA: Compare multiple groups (e.g., LIWC across 4 roles)
  • Flow Matrices: Net sentiment between clusters (incoming - outgoing)

All outputs saved to data/processed/rq_analysis/ for reproducibility.

Visualization Pipeline

Static Visualizations (17 PNG charts)

Generated via src/visualize_pipeline.py, saved to results/figures/:

  • Step 1: Sentiment distribution, temporal patterns, top sources/targets, LIWC profiles, attack patterns (7 plots)
  • Step 2: Psychological asymmetry, role quadrants, top influential/supported (4 plots)
  • Step 3: Centrality rankings (PageRank, betweenness, hubs, authorities) (1 plot)
  • Step 4: PCA variance explained (1 plot)
  • Step 5: Topic cluster size distribution (1 plot)
  • Step 6: Topic-network-role integration, LIWC role lift, echo chamber heatmap (3 plots)

Interactive Web Visualizations (11 HTML/JSON)

Generated via scripts/prepare_data.py, embedded in data story:

  • topic_continents_network.html: Force-directed graph of 40 topic clusters
  • roles_quadrant_scatter.html: 4-quadrant role classification (x=outgoing, y=incoming sentiment)
  • rivalry_chord.html: Chord diagram of negative sentiment flows between clusters
  • toxicity_flow_sankey.html: Sankey diagram of toxicity propagation
  • insurgency_power_waffle.html: Waffle chart showing attack direction (up vs down)
  • isolation_civility_slope.html: Slope graph comparing isolation and internal civility
  • bridges_linguistic_bar.html: Bar chart of betweenness predictors
  • roles_synapses_bundling.html: Hierarchical edge bundling of role interactions
  • network_synthesis_explorer.html: Interactive 3-layer data explorer
  • alliances_chord.html: Chord diagram of positive sentiment flows

Reproducibility

Running the Full Pipeline

Execute results.ipynb to reproduce all results:

  1. Steps 1-7 generate all processed datasets
  2. Research questions section generates 20 analysis outputs
  3. Visualization cells create 17 PNG charts
  4. JSON preparation creates web visualization data

Expected Outputs

After running the notebook:

  • data/processed/: 17 CSV files + 1 directory with 20 RQ results
  • results/figures/: 17 PNG visualization files
  • docs/assets/data/: 11 JSON files

Limitations

  1. Temporal Scope: Data from 2014-2017 predates some major world events (covid quarantine, russia-ukraine war, most of the trump presidency...). Current dynamics may differ significantly.

  2. Language Bias: Analysis is English-dominant. Non-English clusters (Japanese, German, Brazilian) are underrepresented and may have biased LIWC scores.

  3. Sarcasm Blind Spot: LIWC cannot detect sarcasm or irony. A post saying "Great job destroying the subreddit" scores as positive emotion.

  4. Sampling Approximation: Betweenness centrality uses k=50 sampling (not full n=67,180 paths). Rankings are reliable, but absolute values are approximations.

  5. Manual Labeling Subjectivity: 40 topic labels assigned by manual inspection. While validated with top-20 exemplars per cluster, some edge cases required 66 manual overrides.

  6. Correlation ≠ Causation: We identify patterns (e.g., toxicity contagion), not mechanisms. Experimental or longitudinal data needed for causal claims.


Team Contributions

Amer Lakrami: Implemented Handled embeddings, clustering and engineered key research questions.

Hamza Barrada: Data integration and designed the interactive visualizations

Omar El Khyari: Focused on LIWC analysis, role classification and created the project webpage.

Omar Zakariya: Conducted network analysis and assisted with the webpage, visualizations, and Jupyter notebook.

Cesar Illanes: Responsible for the README and project documentation.

Collaborative Work:

  • Website design and user experience testing (all members)
  • Methodological discussions and RQ refinement (all members)

References

Datasets:

  • Kumar, S., Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2018). Community interaction and conflict on the web. Proceedings of the 2018 World Wide Web Conference, 933-943.
  • Hamilton, W. L., Zhang, J., Danescu-Niculescu-Mizil, C., Jurafsky, D., & Leskovec, J. (2017). Loyalty in online communities. Proceedings of the International AAAI Conference on Web and Social Media, 11(1).

Project Repository: https://github.com/epfl-ada/ada-2025-project-barrada
Data Story: https://epfl-ada.github.io/ada-2025-project-barrada/

License: MIT

About

ada-2025-project-barrada created by GitHub Classroom

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •