Extract, analyze, and cluster atomic gameplay scenes from Super Mario Bros replay data.
This package processes Super Mario Bros gameplay recordings (.bk2 files) from the Courtois NeuroMod project to:
- Extract individual scene clips from full-level replays
- Annotate scenes with 27 gameplay features (enemies, gaps, platforms, etc.)
- Analyze scenes using dimensionality reduction (PCA, UMAP, t-SNE)
- Cluster scenes by gameplay similarity
Scenes are atomic gameplay segments with consistent mechanics (e.g., "gap with enem─ies", "staircase descent"). This decomposition enables fine-grained behavioral and neural analysis.
This package is a companion to cneuromod.mario and integrates with:
- mario.replays - Extract replay-level metadata (recommended for enriched clip metadata)
- mario.annotations
- mario_learning
- mario_curiosity.scene_agents
# Clone and install
git clone git@github.com:courtois-neuromod/mario.scenes
cd mario.scenes
pip install -e .
# Download scene metadata
invoke get-scenes-data
# (Optional) Download full Mario dataset
invoke setup-mario-datasetHPC Setup (computing clusters):
pip install invoke
invoke setup-env-on-hpcExtract video clips for each scene traversal from replay files:
# Extract clips from all replays
invoke create-clips --datapath sourcedata/mario --output outputdata/ \
--save-videos --video-format mp4
# Save additional outputs (savestates, variables, ramdumps)
invoke create-clips --save-states --save-variables --save-ramdumpsProcess specific subjects/sessions:
# Process only sub-01 and sub-02
invoke create-clips --subjects "sub-01 sub-02"
# Process only session 001
invoke create-clips --sessions "ses-001"
# Combine filters with parallel jobs
invoke create-clips \
--subjects "sub-01 sub-02" \
--sessions "ses-001 ses-002" \
--n-jobs 8Enhanced metadata with mario.replays: For richer metadata including replay-level statistics (score gained, enemies killed, coins collected, etc.), first process replays with mario.replays:
# Generate replay-level metadata (in mario.replays directory)
cd ../mario.replays
invoke create-replays --save-variables --output outputdata/replays
# Then use this data when extracting clips (back in mario.scenes)
cd ../mario.scenes
invoke create-clips --replays-path ../mario.replays/outputdata/replaysOutput: BIDS-structured directories with videos, savestates, and JSON metadata:
outputdata/
└── sub-01/ses-001/beh/
├── videos/sub-01_ses-001_run-01_level-w1l1_scene-1_clip-*.mp4
├── savestates/sub-01_ses-001_run-01_level-w1l1_scene-1_clip-*.state
├── variables/sub-01_ses-001_run-01_level-w1l1_scene-1_clip-*.json
└── infos/sub-01_ses-001_run-01_level-w1l1_scene-1_clip-*.json
Reduce 27-dimensional annotations to 2D for visualization:
invoke dimensionality-reductionOutput: outputs/dimensionality_reduction/{pca,umap,tsne}.csv
Group scenes by gameplay similarity:
# Generate clusters with 5-30 groups
invoke cluster-scenes
# Custom cluster counts
invoke cluster-scenes --n-clusters "10 15 20"Output: outputs/cluster_scenes/hierarchical_clusters.pkl
Create canonical level/scene backgrounds by averaging replay frames:
# Generate all backgrounds
invoke make-scene-images
# Specific level
invoke make-scene-images --level w1l1 --subjects sub-03Output: sourcedata/{level,scene}_backgrounds/*.png
Run all processing steps:
invoke full-pipelinefrom mario_scenes.load_data import (
load_scenes_info,
load_annotation_data,
load_background_images,
load_reduced_data
)
# Load scene boundaries
scenes = load_scenes_info(format='dict') # {scene_id: {start, end, layout}}
print(scenes['w1l1s1']) # {'start': 0, 'end': 256, 'level_layout': 0}
# Load 27 feature annotations
features = load_annotation_data() # DataFrame (n_scenes × 27)
print(features.loc['w1l1s1'])
# Load 2D embeddings
umap_coords = load_reduced_data(method='umap') # DataFrame (n_scenes × 2)
# Load background images
backgrounds = load_background_images(level='scene') # {scene_id: PIL.Image}from mario_scenes.create_clips.create_clips import cut_scene_clips, get_rep_order
from cneuromod_vg_utils.replay import get_variables_from_replay
# Replay a BK2 file
variables, info, frames, states, audio, audio_rate = get_variables_from_replay('path/to/file.bk2')
# Find scene traversals
scene_bounds = {'start': 0, 'end': 256, 'level_layout': 0}
rep_order = get_rep_order(ses=1, run=1, bk2_idx=0)
clips = cut_scene_clips(variables, rep_order, scene_bounds)
# clips = {'0010100000000': (start_frame, end_frame), ...}Command-line filtering:
# Python script
python code/mario_scenes/create_clips/create_clips.py \
--datapath sourcedata/mario \
--subjects sub-01 sub-02 \
--sessions ses-001 ses-002 \
--save-videos
# Invoke task
invoke create-clips --subjects "sub-01" --sessions "ses-001"from mario_scenes.scenes_analysis.cluster_scenes import generate_clusters
# Generate hierarchical clustering
clusters = generate_clusters([10, 20, 30])
# Examine 10-cluster solution
print(clusters[0]['n_clusters']) # 10
summary = clusters[0]['summary']
print(summary[0]) # {'n_scenes': 23, 'labels': ..., 'homogeneity': ...}27 binary features capture gameplay elements:
| Category | Features |
|---|---|
| Enemies | Enemy, 2-Horde, 3-Horde, 4-Horde, Gap enemy |
| Terrain | Roof, Gap, Multiple gaps, Variable gaps, Pillar gap |
| Valleys | Valley, Pipe valley, Empty valley, Enemy valley, Roof valley |
| Paths | 2-Path, 3-Path |
| Stairs | Stair up, Stair down, Empty stair valley, Enemy stair valley, Gap stair valley |
| Platforms | Moving platform |
| Rewards | Risk/Reward, Reward, Bonus zone |
| Landmarks | Flagpole, Beginning |
See sourcedata/scenes_info/mario_scenes_manual_annotation.pdf for details.
Recorded with gym-retro at 60 Hz. Files store button presses for deterministic replay.
Expected structure:
sourcedata/mario/
└── sub-{subject}/ses-{session}/beh/
├── sub-{subject}_ses-{session}_run-{run}_level-{level}.bk2
└── sub-{subject}_ses-{session}_run-{run}_events.tsv
BIDS-compliant format with unique clip identifiers:
{output}/sub-{subject}/ses-{session}/beh/
├── videos/ # .mp4/.gif/.webp clips
├── savestates/ # .state files (gzipped RAM)
├── ramdumps/ # .npz files (per-frame RAM)
├── infos/ # .json metadata sidecars
└── variables/ # .json game state variables
Filename format: sub-{subject}_ses-{session}_run-{run}_level-{level}_scene-{scene}_clip-{code}.{ext}
| Task | Description |
|---|---|
setup-env |
Create virtual environment and install dependencies |
setup-env-on-hpc |
HPC-specific environment setup |
setup-mario-dataset |
Download Mario dataset via datalad |
get-scenes-data |
Download scene metadata from Zenodo |
dimensionality-reduction |
Apply PCA, UMAP, t-SNE to annotations |
cluster-scenes |
Hierarchical clustering on scene features |
create-clips |
Extract scene clips from replays |
make-scene-images |
Generate background images |
full-pipeline |
Run complete analysis workflow |
Run invoke --list for full options.
- Dataset: Courtois NeuroMod
- Scene Definitions: Zenodo Record 15586709
- Related Packages:
- mario.replays - Replay-level metadata extraction (use with --replays-path for enriched scene metadata)
- videogames.utils - Replay processing utilities
- airoh - Reproducible workflow framework
If you use this package, please cite:
@misc{mario_scenes,
title={Mario Scenes: Atomic Scene Decomposition for Super Mario Bros},
author={Courtois NeuroMod Team},
year={2025},
url={https://github.com/courtois-neuromod/mario.scenes}
}MIT License - See LICENSE file for details.