helper script for plotting by josephdviviano · Pull Request #463 · GFNOrg/torchgfn

josephdviviano · 2026-01-14T03:59:01Z

I've read the .github/CONTRIBUTING.md file
My code follows the typing guidelines
I've added appropriate tests
I've run pre-commit hooks locally

Description

Plotting tool for the multinode experiments.

Automatically scrapes wandb for results and plots them. To be expanded.

josephdviviano · 2026-01-14T03:59:43Z

For our notes - here's a summary of the work done / next steps:

# TorchGFN Multinode Scaling Analysis - Project State

## Overview
A comprehensive analysis framework for evaluating multinode scaling experiments in TorchGFN using Weights & Biases data. The tool visualizes how different community sizes and strategies affect mode discovery performance across hypergrid environments.

## Current Capabilities

### 📊 Data Pipeline
- **Wandb Integration**: Fetches runs from `torchgfn/torchgfn` project with robust timeout handling
- **Hierarchical Organization**: Environment → Community → Runs structure
- **Strategy Extraction**: Automatically identifies unique strategy configurations across all runs

### 🎨 Visualization System

#### Multi-Dimensional Encoding
| Dimension | Visual Encoding | Purpose |
|-----------|----------------|---------|
| Community Size | **Color** (tab20 colormap) | Compare scaling (Size 1, 2, 4, 8, 16, 32...) |
| Strategy | **Linestyle + Marker** | Distinguish experimental conditions |
| Run State | **Opacity** | Solid=finished, Faded=crashed/failed |

#### Plot Layout
- **70/30 split**: Main plot takes 70% of vertical space, legend takes 30%
- **Three-section legend**:
  1. Community Sizes (horizontal row, color-coded)
  2. Strategies (single column, linestyle + marker shown)
  3. Run States (opacity explanation)

#### Visual Tuning
- Thin lines (`linewidth=1.2`) to reduce overlap
- Large markers (`markersize=9`) for shape distinction
- ~12 markers per line for visual clarity
- Legend shows full linestyle pattern (`handlelength=4`)

### 🏷️ Strategy Mapping System

```python
STRATEGY_MAPPING = {
    "average_every=100_...": "Baseline",
    "average_every=100_..._use_selective_averaging=True": "Selective Averaging",
    "average_every=100_..._use_random_strategies=True": "Random Strategies",
    # Add more mappings as needed
}

Consistent encoding: Same linestyle/marker for each strategy across all environment plots
Quick discovery mode: --print-strategies-only flag to list all strategies without generating plots

🖥️ Command Line Interface

# Full analysis with plots
python multinode_scaling_analysis.py

# Quick mode: just print strategies for mapping
python multinode_scaling_analysis.py --print-strategies-only

Current Strategy Mappings

Long-form ID	Shorthand
`average_every=100_..._use_selective_averaging=False`	Baseline
`average_every=100_..._use_selective_averaging=True`	Selective Averaging
`average_every=100_..._use_random_strategies=True`	Random Strategies
`average_every=100_..._use_random_strategies=True_use_selective_averaging=True`	Selective Averaging & Random Strategies
`average_every=16384000_...`	Baseline (16M steps Averaging?)
`average_every=4294967296_...`	Baseline (4B steps Averaging?)

File Structure

tutorials/notebooks/
├── multinode_scaling_analysis.py    # Main analysis script (~1010 lines)
└── multinode_scaling_analysis.ipynb # Jupyter notebook companion

Possible Next Steps for Analysis

📈 Quantitative Analysis

Convergence Speed Metrics
- Time/iterations to reach X% of max modes
- Compare convergence rates across strategies and sizes
- Statistical significance testing (t-tests, ANOVA)
Scaling Efficiency
- Plot modes-per-agent vs community size
- Identify diminishing returns threshold
- Calculate parallelization efficiency
Strategy Effectiveness Ranking
- Aggregate performance across all environments
- Rank strategies by average mode discovery rate
- Identify best strategy per environment type

🔬 Deeper Investigations

Environment Difficulty Analysis
- Which environments are hardest (lowest max modes)?
- Does strategy effectiveness vary by environment difficulty?
- Correlation between environment parameters and performance
Failure Analysis
- Why do some runs crash/fail?
- Is there a pattern (certain sizes, strategies, environments)?
- Time-to-failure analysis
Variance Analysis
- How consistent are results within the same size/strategy?
- Which configurations have highest variance?
- Bootstrap confidence intervals

📊 Visualization Enhancements

Summary Dashboard
- Heatmap: Strategy × Size → Performance
- Bar charts comparing strategies aggregated across environments
- Box plots showing distribution of max modes
Normalized Comparisons
- Normalize by environment's theoretical max modes
- Compare "fraction of modes found" instead of raw counts
- Time-normalized curves (modes per hour)
Interactive Plots
- Plotly/Bokeh version for zooming/hovering
- Filter by environment, strategy, or size interactively

🔧 Tool Improvements

Caching Layer
- Cache wandb data locally to speed up reruns
- Incremental updates for new runs only
Export Capabilities
- Save plots as publication-quality PDFs
- Export summary statistics to CSV
- Generate LaTeX tables automatically
Configuration File
- Move STRATEGY_MAPPING to external YAML/JSON
- Allow environment filtering via config
- Customizable color/marker schemes

📝 Documentation & Reporting

Automated Report Generation
- Markdown/HTML report with key findings
- Include best/worst performing configurations
- Trend analysis summary
Experiment Recommendations
- Based on current results, suggest next experiments
- Identify under-explored regions of parameter space

Technical Stack

Language: Python 3.12
Data: pandas, numpy
Visualization: matplotlib, seaborn
Experiment Tracking: Weights & Biases API
CLI: argparse

Quick Reference

# Key constants at top of script
STRATEGY_PARAMS = ['average_every', 'replacement_ratio', ...]
STRATEGY_MAPPING = {...}  # Long-form → shorthand

# Key functions
fetch_wandb_runs()                        # Get data from wandb
analyze_communities_within_environments() # Main analysis loop
plot_communities_in_environment()         # Generate plots
format_strategy_for_legend()              # Uses STRATEGY_MAPPING

codecov · 2026-01-14T04:25:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.38%. Comparing base (a47bf73) to head (5b13588).
⚠️ Report is 35 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master     #463       +/-   ##
===========================================
+ Coverage    0.55%   74.38%   +73.83%     
===========================================
  Files          48       47        -1     
  Lines        6845     6891       +46     
  Branches      802      825       +23     
===========================================
+ Hits           38     5126     +5088     
+ Misses       6806     1454     -5352     
- Partials        1      311      +310

Flag	Coverage Δ
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.