Reddit’s Invisible Brain: The Web Hidden Beneath the Threads

Overview

Reddit is not just a collection of online forums,it is a living ecosystem of interconnected communities, each with its own emotional and cognitive identity. Through hyperlinks, subreddits comment on, critique, and reference one another, forming a complex web of inter-community relationships.

This project uncovered the hidden psychological structure underlying that web by integrating three distinct perspectives:

Structural: Who talks about whom? (Network topology, centrality, community structure)
Psychological: How do they talk? (Emotional tone, cognitive style, LIWC linguistic analysis)
Semantic: What are they about? (Topic embeddings and clustering)

By combining these layers, we visualized Reddit as a "social MRI", revealing how communities align, clash, or coexist not only by what they discuss but how they think and feel. This project tells the story of Reddit's invisible brain: the emotional and cognitive currents that shape the flow of ideas across its digital landscape.

Project Scope:

858,488 hyperlinks across 67,180 subreddits (2014-2017)
3 integrated data layers (Structural, Psychological, Semantic)
40 topic clusters with manual semantic validation
statistical research questions across 5 thematic areas
11 interactive web visualizations + 17 static analytical charts

Quickstart

# clone project
git clone https://github.com/epfl-ada/ada-2025-project-barrada.git barrADA
cd barrADA

# create conda environment (optional)
conda create -n barrada python=3.13
conda activate barrada

# install requirements
pip install -r pip_requirements.txt

# create data folders and download initial datasets
mkdir -p data/hyperlink_network && mkdir -p data/subreddit_embeddings
wget -O data/hyperlink_network/soc-redditHyperlinks-body.tsv "https://snap.stanford.edu/data/soc-redditHyperlinks-body.tsv"
wget -O data/hyperlink_network/soc-redditHyperlinks-title.tsv "https://snap.stanford.edu/data/soc-redditHyperlinks-title.tsv"
wget -O data/subreddit_embeddings/web-redditEmbeddings-subreddits.csv "https://snap.stanford.edu/data/web-redditEmbeddings-subreddits.csv"

# run server to look into results.ipynb file or python files inside src/
jupyter lab

Project Structure

├── data/
│   ├── hyperlink_network/              # Raw SNAP datasets (not in repo)
│   │   ├── soc-redditHyperlinks-body.tsv
│   │   └── soc-redditHyperlinks-title.tsv
│   ├── subreddit_embeddings/           # Raw embeddings (not in repo)
│   │   └── web-redditEmbeddings-subreddits.csv
│   └── processed/                        # Generated by pipeline
│       ├── combined_hyperlinks.csv        # Step 1: 858k links + 86 features
│       ├── subreddit_features_source.csv  # Step 2: Outgoing LIWC profiles
│       ├── subreddit_features_target.csv  # Step 2: Incoming LIWC profiles
│       ├── subreddit_roles.csv            # Step 2: Role classifications
│       ├── network_node_metrics.csv       # Step 3: PageRank, betweenness, etc.
│       ├── network_communities.csv        # Step 3: Louvain communities
│       ├── embeddings_processed.csv       # Step 4: PCA-reduced 50D vectors
│       ├── embeddings_kmeans_40.csv       # Step 5: Topic cluster assignments
│       ├── cluster_labels_40.csv          # Step 5: Manual topic labels
│       ├── final_dataset.csv              # Step 6: Master dataset (67k × 161)
│       ├── cluster_master_dataset.csv     # Step 7: Cluster-level aggregation (40 × 110)
│       └── rq_analysis/                   # 20 research question outputs
│           ├── rq1_role_patterns.csv
│           └── ...
│
├── src/                                
│   ├── data_processing.py              # Step 1: Hyperlink loading + LIWC parsing
│   ├── liwc_analysis.py                # Step 2: Psychological profiling
│   ├── network_analysis.py             # Step 3: Graph metrics + communities
│   ├── embedding_processing.py         # Step 4: PCA dimensionality reduction
│   ├── topic_clustering.py             # Step 5: K-Means + manual labeling
│   ├── integration.py                  # Step 6: Merge all layers
│   ├── cluster_aggregation.py          # Step 7: Cluster-level aggregation
│   ├── research_questions.py           # 20 statistical tests
│   └── visualize_pipeline.py           # 17 static visualizations
│
├── scripts/
│   └── prepare_data.py                 # Generate 11 JSON files for web visualizations
│
├── results/
│   └── figures/                        # 17 PNG charts (step1-step6)
│
├── docs/                               # Jekyll website (GitHub Pages)
│   ├── assets/
│   │   ├── data/                      
│   │   │   ├── nodes.json              
│   │   │   ├── edges.json              
│   │   │   ├── rivalry.json           
│   │   │   ├── toxicity.json          
│   │   │   ├── insurgency.json         
│   │   │   ├── echo.json               
│   │   │   ├── roles.json              
│   │   │   ├── roles_scatter.json   
│   │   │   ├── power.json             
│   │   │   ├── bridges.json            
│   │   │   └── civility.json           
│   │   ├── visualizations/             # 10 interactive HTML embeds
│   │   │   ├── topic_continents_network.html       # Force-directed graph
│   │   │   ├── roles_quadrant_scatter.html         # Role classification scatter
│   │   │   ├── rivalry_chord.html                  # Rivalry flows
│   │   │   ├── toxicity_flow_sankey.html           # Toxicity flow
│   │   │   ├── insurgency_power_waffle.html        # Attack direction patterns
│   │   │   ├── isolation_civility_slope.html       # Isolation vs civility
│   │   │   ├── bridges_linguistic_ba.html          # Bridge predictors
│   │   │   ├── roles_synapses_bundling.html        # Role interaction flows
│   │   │   ├── network_synthesis_explorer.html     # Interactive 3-layer explorer
│   │   │   └── alliances_chord.html                # Positive sentiment flows
│   │   └── img/                        # Static PNG assets for website
│   └── reddit-story.html               # Main data story page
│
├── results.ipynb                       # Complete technical notebook
├── README.md                           # This file
├── .gitignore                          # List of files ignored by git
├── LICENSE                             # MIT License
└── pip_requirements.txt                # Python dependencies

Methods

Data Sources

We used three datasets from the Stanford Network Analysis Project (SNAP):

Reddit Hyperlink Network (Body + Title): hyperlinks from post bodies and titles (2014-2017)
Subreddit Embeddings: 300-dimensional topic vectors for 51,278 subreddits

Each hyperlink contains:

Source and target subreddit
Timestamp
Sentiment label (+1 positive/neutral, -1 negative)
PROPERTIES string with 21 text features + 65 LIWC psychological scores

Analysis Pipeline (7 Steps)

Step 1: Data Processing

Combined body and title hyperlinks (858,490 total links)
Parsed PROPERTIES string into 86 distinct features:
- 21 Text Properties: word count, readability, VADER sentiment, compound score, etc.
- 65 LIWC Features: psychological dimensions (anger, anxiety, certainty, cognitive complexity, etc.)
Standardized subreddit names, removed self-loops, validated data integrity
Output: combined_hyperlinks.csv

Step 2: LIWC Analysis (Psychological Layer)

Aggregated 65 LIWC scores per subreddit as both source (outgoing) and target (incoming)
Computed composite psychological metrics:
- Emotion_Total: Sum of positive and negative emotion
- Negemo_Specific: Mean of anger, anxiety, sadness
- Cognitive_Total: Mean of certainty, tentativeness, insight, discrepancy, causation
Calculated asymmetry features: outgoing - incoming (e.g., "I speak with anger" vs "others speak to me with anger")
Classified subreddits into 4 social roles:
- Critical: High outgoing negativity
- Controversial: High incoming negativity
- Supportive: High outgoing positive/neutral links
- Influential: High incoming positive/neutral links
Outputs: subreddit_features_source.csv, subreddit_features_target.csv, subreddit_roles.csv

Step 3: Network Analysis (Structural Layer)

Built directed graph G(V, E) with 67,180 nodes and 858,490 weighted edges
Computed centrality metrics:
- PageRank: Prestige (who gets linked to)
- Betweenness Centrality: Bridges (who connects disconnected communities)
  - Used k=50 sampling for computational efficiency
  - Sufficient for ranking; absolute values are approximations
- HITS Algorithm: Hub scores (good linkers) and Authority scores (good content)
Detected communities using Louvain method on undirected version (optimizes modularity)
Built separate graphs for positive-only and negative-only sentiment
Outputs: network_node_metrics.csv, network_communities.csv, network_edges.csv

Step 4: Embedding Processing (Semantic Layer)

Loaded 300-dimensional subreddit embeddings (51,278 subreddits)
Applied PCA to reduce to 50 dimensions
- Captures ~92% of variance
- Reduces noise and improves clustering
Outputs: embeddings_processed.csv, pca_variance.csv

Step 5: Topic Clustering

Ran K-Means clustering with K=40 on PCA-reduced embeddings
Why K=40?
- Tested K=20, K=40, K=50
- K=50 had better silhouette scores but semantically incoherent clusters
- K=40 provides best balance between statistical fit and interpretability
Manual Labeling Process:
1. For each cluster, examined top-20 subreddits by link volume
2. Assigned semantic labels (e.g., "Gaming," "Politics," "Meta-Commentary")
3. Applied 66 manual overrides to fix edge cases (e.g., r/steam misclassified with cooking)
Outputs: embeddings_kmeans_40.csv, cluster_labels_40.csv

Step 6: Integration

Merged all processed datasets on subreddit key:
- Network metrics (PageRank, betweenness, communities)
- LIWC psychological profiles (source + target)
- Topic cluster assignments and labels
- Social role classifications
Output: final_dataset.csv (67,180 rows × 161 features)

Step 7: Cluster Aggregation

Aggregated subreddit-level data to cluster level
Calculated per-cluster:
- Mean network metrics (PageRank, betweenness, degree)
- Mean psychological scores (all 65 LIWC dimensions)
- Insularity (percentage of internal links)
- Sentiment flows (incoming/outgoing positive/negative)
- Role distributions (percentage Critical, Supportive, etc.)
- Top-5 exemplar subreddits (by PageRank, betweenness, in-degree)
- Derived scores (toxicity, analytical, emotional)
Output: cluster_master_dataset.csv (40 clusters × 110 features)

Research Question Framework

We conducted statistical tests organized into 5 thematic areas:

1. Roles & Psychology

Methods: ANOVA, t-tests, correlation analysis

	Research Question	Method
RQ1.1	What linguistic patterns distinguish social roles?	Compare LIWC means across roles (Critical/Supportive/etc.)
RQ1.2	Which emotions dominate conflict vs support?	T-test: positive vs negative link LIWC scores
RQ1.3	Do communities experience emotional asymmetry?	Compare outgoing vs incoming LIWC (paired differences)
RQ1.4	Do ideological neighbors attack each other more?	Compare within-cluster vs cross-cluster sentiment

2. Echo Chambers

Methods: Correlation, purity index, regression

	Research Question	Method
RQ2.1	Does certainty correlate with insularity?	Pearson r: certainty score vs internal link percentage
RQ2.2	Where do semantic and structural clustering align?	Purity matrix: topic cluster × network community
RQ2.3	What predicts isolation?	Regression: insularity ~ LIWC + network features

3. Network Structure

Methods: Correlation, flow matrices, bridge analysis

	Research Question	Method
RQ3.1	What makes a bridge?	Correlation: betweenness vs LIWC/role features
RQ3.2	Which topics act as bridges?	Cross-cluster link density by topic
RQ3.3	What are the major rivalries and alliances?	Net sentiment flow matrix between clusters

4. Conflict Patterns

Methods: Cluster comparisons, sentiment aggregation

	Research Question	Method
RQ4.1	How are roles distributed across topics?	Role percentage per cluster
RQ4.2	Which clusters attack outward most?	External negativity: outgoing negative - incoming positive
RQ4.3	Which clusters attack themselves?	Internal civility: negative links within cluster
RQ4.4	What language predicts internal peace?	Correlation: civility vs LIWC features
RQ4.5	What language predicts external war?	Correlation: external negativity vs LIWC

5. Power Dynamics

Methods: Flow analysis, regression, directional tests

	Research Question	Method
RQ5.1	Who are the punching bags?	Net toxicity flow: incoming negative - outgoing negative
RQ5.2	What predicts PageRank?	Correlation: PageRank vs LIWC/betweenness/degree
RQ5.3	Do attacks flow up or down the hierarchy?	Compare low-PR(PageRank) → high-PR vs high-PR → low-PR links
RQ5.4	Does receiving hate breed sending hate?	Correlation: incoming negativity vs outgoing negativity
RQ5.5	Are critics analytical or emotional?	Compare cognitive vs emotional LIWC in negative links

Statistical Methods:

Correlation Analysis: Pearson r for continuous variables
T-tests: Compare group means (e.g., positive vs negative link language)
ANOVA: Compare multiple groups (e.g., LIWC across 4 roles)
Flow Matrices: Net sentiment between clusters (incoming - outgoing)

All outputs saved to data/processed/rq_analysis/ for reproducibility.

Visualization Pipeline

Static Visualizations (17 PNG charts)

Generated via src/visualize_pipeline.py, saved to results/figures/:

Step 1: Sentiment distribution, temporal patterns, top sources/targets, LIWC profiles, attack patterns (7 plots)
Step 2: Psychological asymmetry, role quadrants, top influential/supported (4 plots)
Step 3: Centrality rankings (PageRank, betweenness, hubs, authorities) (1 plot)
Step 4: PCA variance explained (1 plot)
Step 5: Topic cluster size distribution (1 plot)
Step 6: Topic-network-role integration, LIWC role lift, echo chamber heatmap (3 plots)

Interactive Web Visualizations (11 HTML/JSON)

Generated via scripts/prepare_data.py, embedded in data story:

topic_continents_network.html: Force-directed graph of 40 topic clusters
roles_quadrant_scatter.html: 4-quadrant role classification (x=outgoing, y=incoming sentiment)
rivalry_chord.html: Chord diagram of negative sentiment flows between clusters
toxicity_flow_sankey.html: Sankey diagram of toxicity propagation
insurgency_power_waffle.html: Waffle chart showing attack direction (up vs down)
isolation_civility_slope.html: Slope graph comparing isolation and internal civility
bridges_linguistic_bar.html: Bar chart of betweenness predictors
roles_synapses_bundling.html: Hierarchical edge bundling of role interactions
network_synthesis_explorer.html: Interactive 3-layer data explorer
alliances_chord.html: Chord diagram of positive sentiment flows

Reproducibility

Running the Full Pipeline

Execute results.ipynb to reproduce all results:

Steps 1-7 generate all processed datasets
Research questions section generates 20 analysis outputs
Visualization cells create 17 PNG charts
JSON preparation creates web visualization data

Expected Outputs

After running the notebook:

data/processed/: 17 CSV files + 1 directory with 20 RQ results
results/figures/: 17 PNG visualization files
docs/assets/data/: 11 JSON files

Limitations

Temporal Scope: Data from 2014-2017 predates some major world events (covid quarantine, russia-ukraine war, most of the trump presidency...). Current dynamics may differ significantly.
Language Bias: Analysis is English-dominant. Non-English clusters (Japanese, German, Brazilian) are underrepresented and may have biased LIWC scores.
Sarcasm Blind Spot: LIWC cannot detect sarcasm or irony. A post saying "Great job destroying the subreddit" scores as positive emotion.
Sampling Approximation: Betweenness centrality uses k=50 sampling (not full n=67,180 paths). Rankings are reliable, but absolute values are approximations.
Manual Labeling Subjectivity: 40 topic labels assigned by manual inspection. While validated with top-20 exemplars per cluster, some edge cases required 66 manual overrides.
Correlation ≠ Causation: We identify patterns (e.g., toxicity contagion), not mechanisms. Experimental or longitudinal data needed for causal claims.

Team Contributions

Amer Lakrami: Implemented Handled embeddings, clustering and engineered key research questions.

Hamza Barrada: Data integration and designed the interactive visualizations

Omar El Khyari: Focused on LIWC analysis, role classification and created the project webpage.

Omar Zakariya: Conducted network analysis and assisted with the webpage, visualizations, and Jupyter notebook.

Cesar Illanes: Responsible for the README and project documentation.

Collaborative Work:

Website design and user experience testing (all members)
Methodological discussions and RQ refinement (all members)

References

Datasets:

Kumar, S., Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2018). Community interaction and conflict on the web. Proceedings of the 2018 World Wide Web Conference, 933-943.
Hamilton, W. L., Zhang, J., Danescu-Niculescu-Mizil, C., Jurafsky, D., & Leskovec, J. (2017). Loyalty in online communities. Proceedings of the International AAAI Conference on Web and Social Media, 11(1).

Project Repository: https://github.com/epfl-ada/ada-2025-project-barrada
Data Story: https://epfl-ada.github.io/ada-2025-project-barrada/

License: MIT

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pip_requirements.txt		pip_requirements.txt
results.ipynb		results.ipynb

License

epfl-ada/ada-2025-project-barrada

Folders and files

Latest commit

History

Repository files navigation