Toolkit to process and enrich your personal Spotify listening history, with memory-safe merges and resumable Spotify API enrichment.
This project processes your extended Spotify streaming history data to:
- Combine multiple JSON history files into a single dataset
- Clean and enrich the data with temporal features
- Enrich metadata using multiple sources (Spotify API + external datasets)
- Analyze listening patterns, genre evolution, and music trends
- Generate visualizations of your musical journey
- Data Processing: Combine and clean multiple Spotify history JSON files
- Multi-Source Metadata Enrichment:
- Spotify Web API (complete but rate-limited)
- External Kaggle datasets (fast, high coverage)
- Automatic matching and merging
- Comprehensive Analysis: Audio features, genres, popularity, and trends
- Organized Structure: Clean, scalable codebase with proper separation
- Resume Capability: All processes can be safely interrupted and resumed
spotify-data/
├── README.md
├── PROJECT_STATUS.md
├── requirements.txt
├── Makefile
├── docs/ # Original Spotify data from export
├── data/
│ ├── processed/
│ │ ├── cleaned_streaming_history.csv
│ │ └── combined_streaming_history.csv
│ └── enriched/ # (symlinked under scripts/spotify_api/data)
│ ├── ultimate_spotify_enriched_streaming_history.csv
│ ├── spotify_api_metadata.csv # append-only API results
│ ├── progress.sqlite # crash-safe progress
│ ├── spotify_api_enriched_streaming_history.csv # merged final (base ← meta)
│ ├── spotify_api_enriched_streaming_history_songs.csv
│ └── spotify_api_enriched_streaming_history_podcasts.csv
├── scripts/
│ ├── data_processing/
│ │ ├── clean-history.py
│ │ └── combine-history.py
│ ├── external_matching/
│ │ └── ultimate_spotify_matcher.py
│ ├── enrichment/
│ │ └── merge_enrichments.py # legacy (superseded by DuckDB)
│ ├── analysis/
│ ├── app/
│ ├── orchestrate.py
│ └── spotify_api/
│ ├── __init__.py
│ ├── duckdb_merge.py # memory-safe on-disk LEFT JOIN (DuckDB)
│ ├── smart_metadata_enrichment.py# API enrichment + merge-only mode
│ ├── split_media_types.py # split final CSV into songs/podcasts
│ └── data -> ../../data # symlink so BASE_DIR-relative paths work
└── external_datasets/
- Final merged CSV generated by DuckDB without large in-memory joins.
- Songs vs. podcasts split available for downstream analysis.
- Spotify API enrichment runs in batches, appends to CSV, and records progress in SQLite for safe resume.
Coverage snapshot (Aug 15, 2025):
- Base unique tracks: ~27,400
- Kaggle-covered unique: ~6,069
- API-covered unique (deduped): ~20,777 (prev ~10,866; +~9,911 today)
- Final merged rows: 138,762 (includes header)
- Split outputs: songs 127,099 rows; podcasts 11,664 rows (each includes header)
- Estimated remaining unique for API next pass: ~3,000 (start-of-run to_process was 12,896; ~9.9k covered this run)
- Audio Features: Danceability, energy, valence, tempo, acousticness, instrumentalness, liveness, loudness, speechiness
- Track Info: Release dates, popularity scores, explicit content flags
- Genre Data: Detailed genre classifications with play counts
- Artist Info: Complete artist metadata and classifications
- Request your extended streaming history from Spotify Privacy Settings
- Wait for the email with your data (can take up to 30 days)
- Extract the JSON files to a
docs/folder in this project
# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install required packages
pip install -r requirements.txtWhile external datasets provide good coverage, you may want API access for additional data:
- Go to Spotify Developer Dashboard
- Create a new app to get your Client ID and Client Secret
- Copy
.env.exampleto.env:cp .env.example .env
- Fill in your credentials:
SPOTIFY_CLIENT_ID=your_client_id_here SPOTIFY_CLIENT_SECRET=your_client_secret_here
For the fastest results with good metadata coverage:
-
Process Your Raw Data
python scripts/data_processing/combine-history.py python scripts/data_processing/clean-history.py
-
Enrich with External Datasets (Fast, no API limits)
python scripts/external_matching/ultimate_spotify_matcher.py
This provides ~30% coverage with complete audio features and genres.
For Maximum Coverage: Combine external datasets with Spotify API enrichment
- Set up your
.envwith Spotify API credentials - Run external matching first (above)
- Run API enrichment (background, resumable):
. .venv/bin/activate [ -f scripts/__init__.py ] || touch scripts/__init__.py [ -f scripts/spotify_api/__init__.py ] || touch scripts/spotify_api/__init__.py [ -e scripts/spotify_api/data ] || ln -s ../../data scripts/spotify_api/data LOG=/tmp/enrich_$(date +%Y%m%d-%H%M%S).log TQDM_DISABLE=1 nohup python -m scripts.spotify_api.smart_metadata_enrichment > "$LOG" 2>&1 & echo "PID=$! LOG=$LOG"
Monitor and resume safely:
tail -n 80 "$LOG"
pgrep -af scripts.spotify_api.smart_metadata_enrichment
sqlite3 scripts/spotify_api/data/enriched/progress.sqlite "SELECT COUNT(*) FROM progress WHERE meta_json IS NOT NULL;"Merge-only (low-memory, on-disk):
LOG=/tmp/merge_$(date +%Y%m%d-%H%M%S).log
nohup python -m scripts.spotify_api.smart_metadata_enrichment --merge-only > "$LOG" 2>&1 & echo "PID=$! LOG=$LOG"python scripts/data_processing/combine-history.py- Combines all JSON files from
docs/folder - Creates
data/processed/combined_streaming_history.csv
python scripts/data_processing/clean-history.py- Creates
data/processed/cleaned_streaming_history.csvwith:- Converted timestamps and time-based features
- Filtered short plays (<30 seconds)
- Play duration in minutes
- Weekday, hour, and temporal analysis features
Option A: External Datasets (Recommended First)
python scripts/external_matching/ultimate_spotify_matcher.py- Downloads Ultimate Spotify DB from Kaggle
- Matches tracks using fuzzy string matching
- Provides audio features, genres, popularity data
- No rate limits, fast processing
Option B: Spotify Web API (memory-safe, resumable)
nohup python -m scripts.spotify_api.smart_metadata_enrichment > "$LOG" 2>&1 &- Skips tracks covered by external datasets and already-processed API keys
- Streams results to
spotify_api_metadata.csvand records progress in SQLite - Performs final DuckDB LEFT JOIN into
spotify_api_enriched_streaming_history.csv
Both create enriched_streaming_history.csv with additional metadata:
- Spotify track IDs
After processing, you'll have these datasets:
data/processed/cleaned_streaming_history.csv- Cleaned listening history with temporal featuresdata/enriched/ultimate_spotify_enriched_*.csv- External dataset enrichment resultsdata/enriched/spotify_api_enriched_*.csv- API enrichment results (when available)
Temporal Patterns:
- Listening habits by time of day, day of week, season
- Evolution of music taste over 15 years
- Monthly and yearly listening volume trends
Audio Feature Analysis:
- Preference for danceability, energy, valence over time
- Tempo preferences and changes
- Acousticness vs. electronic music trends
Genre Evolution:
- Top genres by play count and listening time
- Genre diversity and discovery patterns
- Seasonal genre preferences
Discovery Patterns:
- Track popularity vs. personal preference
- Artist loyalty and discovery rates
- Repeat listening behavior
scripts/spotify_api/data/enriched/
├── ultimate_spotify_enriched_streaming_history.csv
├── spotify_api_metadata.csv
├── progress.sqlite
├── spotify_api_enriched_streaming_history.csv
├── spotify_api_enriched_streaming_history_songs.csv
└── spotify_api_enriched_streaming_history_podcasts.csv
With enriched data, you can analyze:
- Total hours and play counts by year, month, season
- Peak listening hours and days of week
- Seasonal and temporal trends across 15 years
- Musical taste evolution over time
- Trends in danceability, energy, valence, tempo
- Acoustic vs. electronic preferences
- Mood patterns (valence/energy correlations)
- Top genres by play count and listening time
- Genre discovery and evolution patterns
- Seasonal genre preferences
- Musical diversity metrics
- Track popularity vs. personal preference
- Artist loyalty and discovery rates
- Repeat listening behavior analysis
- Mainstream vs. niche music preferences
The enrichment is designed to run daily until quota is hit (≈10k calls/day), then resume the next day:
- Default pacing: 50 items/batch, 0.3s/request, 5s between batches
- Auto-resume and skip logic ensure no redundant calls
- Safe to stop and restart anytime
- No Rate Limits: Process entire dataset quickly
- High Coverage: 30%+ match rates with quality data
- Complete Features: All audio features and genres included
- Reliable: No authentication or quota concerns
- Fuzzy string matching for robust track identification
- Duplicate detection and handling
- Comprehensive metadata validation
- Multiple source integration for maximum coverage
- Your
.envfile with API credentials is excluded from git - All CSV files with personal listening data are gitignored
- External datasets contain only public metadata
- Never commit or share personal listening data
This project welcomes contributions! Areas for improvement:
- Additional external dataset integrations
- Advanced analysis and visualization scripts
- Performance optimizations
- New matching algorithms
- API Rate Limits: Use external datasets first, then supplement with API
- Low Match Rates: Check for special characters in track/artist names
- Missing Data: Some tracks may not exist in any dataset
- Memory Issues: Process in smaller batches if needed
- Check the progress files for API enrichment status
- Review console output for match statistics
- Verify your raw data format matches expected structure
This project is for personal use. Respect Spotify's API terms of service and your data privacy.