Reddit’s Invisible Brain: The Web Hidden Beneath the Threads
Reddit is not just a collection of online forums,it is a living ecosystem of interconnected communities, each with its own emotional and cognitive identity. Through hyperlinks, subreddits comment on, critique, and reference one another, forming a complex web of inter-community relationships.
This project uncovered the hidden psychological structure underlying that web by integrating three distinct perspectives:
- Structural: Who talks about whom? (Network topology, centrality, community structure)
- Psychological: How do they talk? (Emotional tone, cognitive style, LIWC linguistic analysis)
- Semantic: What are they about? (Topic embeddings and clustering)
By combining these layers, we visualized Reddit as a "social MRI", revealing how communities align, clash, or coexist not only by what they discuss but how they think and feel. This project tells the story of Reddit's invisible brain: the emotional and cognitive currents that shape the flow of ideas across its digital landscape.
Project Scope:
- 858,488 hyperlinks across 67,180 subreddits (2014-2017)
- 3 integrated data layers (Structural, Psychological, Semantic)
- 40 topic clusters with manual semantic validation
- statistical research questions across 5 thematic areas
- 11 interactive web visualizations + 17 static analytical charts
# clone project
git clone https://github.com/epfl-ada/ada-2025-project-barrada.git barrADA
cd barrADA# create conda environment (optional)
conda create -n barrada python=3.13
conda activate barrada# install requirements
pip install -r pip_requirements.txt# create data folders and download initial datasets
mkdir -p data/hyperlink_network && mkdir -p data/subreddit_embeddings
wget -O data/hyperlink_network/soc-redditHyperlinks-body.tsv "https://snap.stanford.edu/data/soc-redditHyperlinks-body.tsv"
wget -O data/hyperlink_network/soc-redditHyperlinks-title.tsv "https://snap.stanford.edu/data/soc-redditHyperlinks-title.tsv"
wget -O data/subreddit_embeddings/web-redditEmbeddings-subreddits.csv "https://snap.stanford.edu/data/web-redditEmbeddings-subreddits.csv"# run server to look into results.ipynb file or python files inside src/
jupyter lab├── data/
│ ├── hyperlink_network/ # Raw SNAP datasets (not in repo)
│ │ ├── soc-redditHyperlinks-body.tsv
│ │ └── soc-redditHyperlinks-title.tsv
│ ├── subreddit_embeddings/ # Raw embeddings (not in repo)
│ │ └── web-redditEmbeddings-subreddits.csv
│ └── processed/ # Generated by pipeline
│ ├── combined_hyperlinks.csv # Step 1: 858k links + 86 features
│ ├── subreddit_features_source.csv # Step 2: Outgoing LIWC profiles
│ ├── subreddit_features_target.csv # Step 2: Incoming LIWC profiles
│ ├── subreddit_roles.csv # Step 2: Role classifications
│ ├── network_node_metrics.csv # Step 3: PageRank, betweenness, etc.
│ ├── network_communities.csv # Step 3: Louvain communities
│ ├── embeddings_processed.csv # Step 4: PCA-reduced 50D vectors
│ ├── embeddings_kmeans_40.csv # Step 5: Topic cluster assignments
│ ├── cluster_labels_40.csv # Step 5: Manual topic labels
│ ├── final_dataset.csv # Step 6: Master dataset (67k × 161)
│ ├── cluster_master_dataset.csv # Step 7: Cluster-level aggregation (40 × 110)
│ └── rq_analysis/ # 20 research question outputs
│ ├── rq1_role_patterns.csv
│ └── ...
│
├── src/
│ ├── data_processing.py # Step 1: Hyperlink loading + LIWC parsing
│ ├── liwc_analysis.py # Step 2: Psychological profiling
│ ├── network_analysis.py # Step 3: Graph metrics + communities
│ ├── embedding_processing.py # Step 4: PCA dimensionality reduction
│ ├── topic_clustering.py # Step 5: K-Means + manual labeling
│ ├── integration.py # Step 6: Merge all layers
│ ├── cluster_aggregation.py # Step 7: Cluster-level aggregation
│ ├── research_questions.py # 20 statistical tests
│ └── visualize_pipeline.py # 17 static visualizations
│
├── scripts/
│ └── prepare_data.py # Generate 11 JSON files for web visualizations
│
├── results/
│ └── figures/ # 17 PNG charts (step1-step6)
│
├── docs/ # Jekyll website (GitHub Pages)
│ ├── assets/
│ │ ├── data/
│ │ │ ├── nodes.json
│ │ │ ├── edges.json
│ │ │ ├── rivalry.json
│ │ │ ├── toxicity.json
│ │ │ ├── insurgency.json
│ │ │ ├── echo.json
│ │ │ ├── roles.json
│ │ │ ├── roles_scatter.json
│ │ │ ├── power.json
│ │ │ ├── bridges.json
│ │ │ └── civility.json
│ │ ├── visualizations/ # 10 interactive HTML embeds
│ │ │ ├── topic_continents_network.html # Force-directed graph
│ │ │ ├── roles_quadrant_scatter.html # Role classification scatter
│ │ │ ├── rivalry_chord.html # Rivalry flows
│ │ │ ├── toxicity_flow_sankey.html # Toxicity flow
│ │ │ ├── insurgency_power_waffle.html # Attack direction patterns
│ │ │ ├── isolation_civility_slope.html # Isolation vs civility
│ │ │ ├── bridges_linguistic_ba.html # Bridge predictors
│ │ │ ├── roles_synapses_bundling.html # Role interaction flows
│ │ │ ├── network_synthesis_explorer.html # Interactive 3-layer explorer
│ │ │ └── alliances_chord.html # Positive sentiment flows
│ │ └── img/ # Static PNG assets for website
│ └── reddit-story.html # Main data story page
│
├── results.ipynb # Complete technical notebook
├── README.md # This file
├── .gitignore # List of files ignored by git
├── LICENSE # MIT License
└── pip_requirements.txt # Python dependencies
We used three datasets from the Stanford Network Analysis Project (SNAP):
- Reddit Hyperlink Network (Body + Title): hyperlinks from post bodies and titles (2014-2017)
- Subreddit Embeddings: 300-dimensional topic vectors for 51,278 subreddits
Each hyperlink contains:
- Source and target subreddit
- Timestamp
- Sentiment label (+1 positive/neutral, -1 negative)
- PROPERTIES string with 21 text features + 65 LIWC psychological scores
- Combined body and title hyperlinks (858,490 total links)
- Parsed PROPERTIES string into 86 distinct features:
- 21 Text Properties: word count, readability, VADER sentiment, compound score, etc.
- 65 LIWC Features: psychological dimensions (anger, anxiety, certainty, cognitive complexity, etc.)
- Standardized subreddit names, removed self-loops, validated data integrity
- Output:
combined_hyperlinks.csv
- Aggregated 65 LIWC scores per subreddit as both source (outgoing) and target (incoming)
- Computed composite psychological metrics:
- Emotion_Total: Sum of positive and negative emotion
- Negemo_Specific: Mean of anger, anxiety, sadness
- Cognitive_Total: Mean of certainty, tentativeness, insight, discrepancy, causation
- Calculated asymmetry features: outgoing - incoming (e.g., "I speak with anger" vs "others speak to me with anger")
- Classified subreddits into 4 social roles:
- Critical: High outgoing negativity
- Controversial: High incoming negativity
- Supportive: High outgoing positive/neutral links
- Influential: High incoming positive/neutral links
- Outputs:
subreddit_features_source.csv,subreddit_features_target.csv,subreddit_roles.csv
- Built directed graph G(V, E) with 67,180 nodes and 858,490 weighted edges
- Computed centrality metrics:
- PageRank: Prestige (who gets linked to)
- Betweenness Centrality: Bridges (who connects disconnected communities)
- Used k=50 sampling for computational efficiency
- Sufficient for ranking; absolute values are approximations
- HITS Algorithm: Hub scores (good linkers) and Authority scores (good content)
- Detected communities using Louvain method on undirected version (optimizes modularity)
- Built separate graphs for positive-only and negative-only sentiment
- Outputs:
network_node_metrics.csv,network_communities.csv,network_edges.csv
- Loaded 300-dimensional subreddit embeddings (51,278 subreddits)
- Applied PCA to reduce to 50 dimensions
- Captures ~92% of variance
- Reduces noise and improves clustering
- Outputs:
embeddings_processed.csv,pca_variance.csv
- Ran K-Means clustering with K=40 on PCA-reduced embeddings
- Why K=40?
- Tested K=20, K=40, K=50
- K=50 had better silhouette scores but semantically incoherent clusters
- K=40 provides best balance between statistical fit and interpretability
- Manual Labeling Process:
- For each cluster, examined top-20 subreddits by link volume
- Assigned semantic labels (e.g., "Gaming," "Politics," "Meta-Commentary")
- Applied 66 manual overrides to fix edge cases (e.g., r/steam misclassified with cooking)
- Outputs:
embeddings_kmeans_40.csv,cluster_labels_40.csv
- Merged all processed datasets on
subredditkey:- Network metrics (PageRank, betweenness, communities)
- LIWC psychological profiles (source + target)
- Topic cluster assignments and labels
- Social role classifications
- Output:
final_dataset.csv(67,180 rows × 161 features)
- Aggregated subreddit-level data to cluster level
- Calculated per-cluster:
- Mean network metrics (PageRank, betweenness, degree)
- Mean psychological scores (all 65 LIWC dimensions)
- Insularity (percentage of internal links)
- Sentiment flows (incoming/outgoing positive/negative)
- Role distributions (percentage Critical, Supportive, etc.)
- Top-5 exemplar subreddits (by PageRank, betweenness, in-degree)
- Derived scores (toxicity, analytical, emotional)
- Output:
cluster_master_dataset.csv(40 clusters × 110 features)
We conducted statistical tests organized into 5 thematic areas:
Methods: ANOVA, t-tests, correlation analysis
| Research Question | Method | |
|---|---|---|
| RQ1.1 | What linguistic patterns distinguish social roles? | Compare LIWC means across roles (Critical/Supportive/etc.) |
| RQ1.2 | Which emotions dominate conflict vs support? | T-test: positive vs negative link LIWC scores |
| RQ1.3 | Do communities experience emotional asymmetry? | Compare outgoing vs incoming LIWC (paired differences) |
| RQ1.4 | Do ideological neighbors attack each other more? | Compare within-cluster vs cross-cluster sentiment |
Methods: Correlation, purity index, regression
| Research Question | Method | |
|---|---|---|
| RQ2.1 | Does certainty correlate with insularity? | Pearson r: certainty score vs internal link percentage |
| RQ2.2 | Where do semantic and structural clustering align? | Purity matrix: topic cluster × network community |
| RQ2.3 | What predicts isolation? | Regression: insularity ~ LIWC + network features |
Methods: Correlation, flow matrices, bridge analysis
| Research Question | Method | |
|---|---|---|
| RQ3.1 | What makes a bridge? | Correlation: betweenness vs LIWC/role features |
| RQ3.2 | Which topics act as bridges? | Cross-cluster link density by topic |
| RQ3.3 | What are the major rivalries and alliances? | Net sentiment flow matrix between clusters |
Methods: Cluster comparisons, sentiment aggregation
| Research Question | Method | |
|---|---|---|
| RQ4.1 | How are roles distributed across topics? | Role percentage per cluster |
| RQ4.2 | Which clusters attack outward most? | External negativity: outgoing negative - incoming positive |
| RQ4.3 | Which clusters attack themselves? | Internal civility: negative links within cluster |
| RQ4.4 | What language predicts internal peace? | Correlation: civility vs LIWC features |
| RQ4.5 | What language predicts external war? | Correlation: external negativity vs LIWC |
Methods: Flow analysis, regression, directional tests
| Research Question | Method | |
|---|---|---|
| RQ5.1 | Who are the punching bags? | Net toxicity flow: incoming negative - outgoing negative |
| RQ5.2 | What predicts PageRank? | Correlation: PageRank vs LIWC/betweenness/degree |
| RQ5.3 | Do attacks flow up or down the hierarchy? | Compare low-PR(PageRank) → high-PR vs high-PR → low-PR links |
| RQ5.4 | Does receiving hate breed sending hate? | Correlation: incoming negativity vs outgoing negativity |
| RQ5.5 | Are critics analytical or emotional? | Compare cognitive vs emotional LIWC in negative links |
Statistical Methods:
- Correlation Analysis: Pearson r for continuous variables
- T-tests: Compare group means (e.g., positive vs negative link language)
- ANOVA: Compare multiple groups (e.g., LIWC across 4 roles)
- Flow Matrices: Net sentiment between clusters (incoming - outgoing)
All outputs saved to data/processed/rq_analysis/ for reproducibility.
Generated via src/visualize_pipeline.py, saved to results/figures/:
- Step 1: Sentiment distribution, temporal patterns, top sources/targets, LIWC profiles, attack patterns (7 plots)
- Step 2: Psychological asymmetry, role quadrants, top influential/supported (4 plots)
- Step 3: Centrality rankings (PageRank, betweenness, hubs, authorities) (1 plot)
- Step 4: PCA variance explained (1 plot)
- Step 5: Topic cluster size distribution (1 plot)
- Step 6: Topic-network-role integration, LIWC role lift, echo chamber heatmap (3 plots)
Generated via scripts/prepare_data.py, embedded in data story:
topic_continents_network.html: Force-directed graph of 40 topic clustersroles_quadrant_scatter.html: 4-quadrant role classification (x=outgoing, y=incoming sentiment)rivalry_chord.html: Chord diagram of negative sentiment flows between clusterstoxicity_flow_sankey.html: Sankey diagram of toxicity propagationinsurgency_power_waffle.html: Waffle chart showing attack direction (up vs down)isolation_civility_slope.html: Slope graph comparing isolation and internal civilitybridges_linguistic_bar.html: Bar chart of betweenness predictorsroles_synapses_bundling.html: Hierarchical edge bundling of role interactionsnetwork_synthesis_explorer.html: Interactive 3-layer data exploreralliances_chord.html: Chord diagram of positive sentiment flows
Execute results.ipynb to reproduce all results:
- Steps 1-7 generate all processed datasets
- Research questions section generates 20 analysis outputs
- Visualization cells create 17 PNG charts
- JSON preparation creates web visualization data
After running the notebook:
data/processed/: 17 CSV files + 1 directory with 20 RQ resultsresults/figures/: 17 PNG visualization filesdocs/assets/data/: 11 JSON files
-
Temporal Scope: Data from 2014-2017 predates some major world events (covid quarantine, russia-ukraine war, most of the trump presidency...). Current dynamics may differ significantly.
-
Language Bias: Analysis is English-dominant. Non-English clusters (Japanese, German, Brazilian) are underrepresented and may have biased LIWC scores.
-
Sarcasm Blind Spot: LIWC cannot detect sarcasm or irony. A post saying "Great job destroying the subreddit" scores as positive emotion.
-
Sampling Approximation: Betweenness centrality uses k=50 sampling (not full n=67,180 paths). Rankings are reliable, but absolute values are approximations.
-
Manual Labeling Subjectivity: 40 topic labels assigned by manual inspection. While validated with top-20 exemplars per cluster, some edge cases required 66 manual overrides.
-
Correlation ≠ Causation: We identify patterns (e.g., toxicity contagion), not mechanisms. Experimental or longitudinal data needed for causal claims.
Amer Lakrami: Implemented Handled embeddings, clustering and engineered key research questions.
Hamza Barrada: Data integration and designed the interactive visualizations
Omar El Khyari: Focused on LIWC analysis, role classification and created the project webpage.
Omar Zakariya: Conducted network analysis and assisted with the webpage, visualizations, and Jupyter notebook.
Cesar Illanes: Responsible for the README and project documentation.
Collaborative Work:
- Website design and user experience testing (all members)
- Methodological discussions and RQ refinement (all members)
Datasets:
- Kumar, S., Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2018). Community interaction and conflict on the web. Proceedings of the 2018 World Wide Web Conference, 933-943.
- Hamilton, W. L., Zhang, J., Danescu-Niculescu-Mizil, C., Jurafsky, D., & Leskovec, J. (2017). Loyalty in online communities. Proceedings of the International AAAI Conference on Web and Social Media, 11(1).
Project Repository: https://github.com/epfl-ada/ada-2025-project-barrada
Data Story: https://epfl-ada.github.io/ada-2025-project-barrada/
License: MIT