Skip to content

glitch-bee/spotify.data

Repository files navigation

Spotify Data Analysis

Toolkit to process and enrich your personal Spotify listening history, with memory-safe merges and resumable Spotify API enrichment.

Overview

This project processes your extended Spotify streaming history data to:

  • Combine multiple JSON history files into a single dataset
  • Clean and enrich the data with temporal features
  • Enrich metadata using multiple sources (Spotify API + external datasets)
  • Analyze listening patterns, genre evolution, and music trends
  • Generate visualizations of your musical journey

Features

  • Data Processing: Combine and clean multiple Spotify history JSON files
  • Multi-Source Metadata Enrichment:
    • Spotify Web API (complete but rate-limited)
    • External Kaggle datasets (fast, high coverage)
    • Automatic matching and merging
  • Comprehensive Analysis: Audio features, genres, popularity, and trends
  • Organized Structure: Clean, scalable codebase with proper separation
  • Resume Capability: All processes can be safely interrupted and resumed

Project Structure (current)

spotify-data/
├── README.md
├── PROJECT_STATUS.md
├── requirements.txt
├── Makefile
├── docs/                                # Original Spotify data from export
├── data/
│   ├── processed/
│   │   ├── cleaned_streaming_history.csv
│   │   └── combined_streaming_history.csv
│   └── enriched/                        # (symlinked under scripts/spotify_api/data)
│       ├── ultimate_spotify_enriched_streaming_history.csv
│       ├── spotify_api_metadata.csv                      # append-only API results
│       ├── progress.sqlite                               # crash-safe progress
│       ├── spotify_api_enriched_streaming_history.csv    # merged final (base ← meta)
│       ├── spotify_api_enriched_streaming_history_songs.csv
│       └── spotify_api_enriched_streaming_history_podcasts.csv
├── scripts/
│   ├── data_processing/
│   │   ├── clean-history.py
│   │   └── combine-history.py
│   ├── external_matching/
│   │   └── ultimate_spotify_matcher.py
│   ├── enrichment/
│   │   └── merge_enrichments.py        # legacy (superseded by DuckDB)
│   ├── analysis/
│   ├── app/
│   ├── orchestrate.py
│   └── spotify_api/
│       ├── __init__.py
│       ├── duckdb_merge.py             # memory-safe on-disk LEFT JOIN (DuckDB)
│       ├── smart_metadata_enrichment.py# API enrichment + merge-only mode
│       ├── split_media_types.py        # split final CSV into songs/podcasts
│       └── data -> ../../data          # symlink so BASE_DIR-relative paths work
└── external_datasets/

Current Status (Aug 2025)

  • Final merged CSV generated by DuckDB without large in-memory joins.
  • Songs vs. podcasts split available for downstream analysis.
  • Spotify API enrichment runs in batches, appends to CSV, and records progress in SQLite for safe resume.

Coverage snapshot (Aug 15, 2025):

  • Base unique tracks: ~27,400
  • Kaggle-covered unique: ~6,069
  • API-covered unique (deduped): ~20,777 (prev ~10,866; +~9,911 today)
  • Final merged rows: 138,762 (includes header)
  • Split outputs: songs 127,099 rows; podcasts 11,664 rows (each includes header)
  • Estimated remaining unique for API next pass: ~3,000 (start-of-run to_process was 12,896; ~9.9k covered this run)

Available Metadata

  • Audio Features: Danceability, energy, valence, tempo, acousticness, instrumentalness, liveness, loudness, speechiness
  • Track Info: Release dates, popularity scores, explicit content flags
  • Genre Data: Detailed genre classifications with play counts
  • Artist Info: Complete artist metadata and classifications

Setup

1. Get Your Spotify Data

  1. Request your extended streaming history from Spotify Privacy Settings
  2. Wait for the email with your data (can take up to 30 days)
  3. Extract the JSON files to a docs/ folder in this project

2. Set Up Python Environment

# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install required packages
pip install -r requirements.txt

3. Set Up Spotify API Credentials

While external datasets provide good coverage, you may want API access for additional data:

  1. Go to Spotify Developer Dashboard
  2. Create a new app to get your Client ID and Client Secret
  3. Copy .env.example to .env:
    cp .env.example .env
  4. Fill in your credentials:
    SPOTIFY_CLIENT_ID=your_client_id_here
    SPOTIFY_CLIENT_SECRET=your_client_secret_here

Usage

Quick Start

For the fastest results with good metadata coverage:

  1. Process Your Raw Data

    python scripts/data_processing/combine-history.py
    python scripts/data_processing/clean-history.py
  2. Enrich with External Datasets (Fast, no API limits)

    python scripts/external_matching/ultimate_spotify_matcher.py

    This provides ~30% coverage with complete audio features and genres.

Advanced Usage

For Maximum Coverage: Combine external datasets with Spotify API enrichment

  1. Set up your .env with Spotify API credentials
  2. Run external matching first (above)
  3. Run API enrichment (background, resumable):
    . .venv/bin/activate
    [ -f scripts/__init__.py ] || touch scripts/__init__.py
    [ -f scripts/spotify_api/__init__.py ] || touch scripts/spotify_api/__init__.py
    [ -e scripts/spotify_api/data ] || ln -s ../../data scripts/spotify_api/data
    LOG=/tmp/enrich_$(date +%Y%m%d-%H%M%S).log
    TQDM_DISABLE=1 nohup python -m scripts.spotify_api.smart_metadata_enrichment > "$LOG" 2>&1 & echo "PID=$! LOG=$LOG"

Monitor and resume safely:

tail -n 80 "$LOG"
pgrep -af scripts.spotify_api.smart_metadata_enrichment
sqlite3 scripts/spotify_api/data/enriched/progress.sqlite "SELECT COUNT(*) FROM progress WHERE meta_json IS NOT NULL;"

Merge-only (low-memory, on-disk):

LOG=/tmp/merge_$(date +%Y%m%d-%H%M%S).log
nohup python -m scripts.spotify_api.smart_metadata_enrichment --merge-only > "$LOG" 2>&1 & echo "PID=$! LOG=$LOG"

Data Processing Steps

Step 1: Combine History Files

python scripts/data_processing/combine-history.py
  • Combines all JSON files from docs/ folder
  • Creates data/processed/combined_streaming_history.csv

Step 2: Clean and Enhance Data

python scripts/data_processing/clean-history.py
  • Creates data/processed/cleaned_streaming_history.csv with:
    • Converted timestamps and time-based features
    • Filtered short plays (<30 seconds)
    • Play duration in minutes
    • Weekday, hour, and temporal analysis features

Step 3: Metadata Enrichment

Option A: External Datasets (Recommended First)

python scripts/external_matching/ultimate_spotify_matcher.py
  • Downloads Ultimate Spotify DB from Kaggle
  • Matches tracks using fuzzy string matching
  • Provides audio features, genres, popularity data
  • No rate limits, fast processing

Option B: Spotify Web API (memory-safe, resumable)

nohup python -m scripts.spotify_api.smart_metadata_enrichment > "$LOG" 2>&1 &
  • Skips tracks covered by external datasets and already-processed API keys
  • Streams results to spotify_api_metadata.csv and records progress in SQLite
  • Performs final DuckDB LEFT JOIN into spotify_api_enriched_streaming_history.csv

Both create enriched_streaming_history.csv with additional metadata:

  • Spotify track IDs

Data Analysis

Available Data Files

After processing, you'll have these datasets:

  • data/processed/cleaned_streaming_history.csv - Cleaned listening history with temporal features
  • data/enriched/ultimate_spotify_enriched_*.csv - External dataset enrichment results
  • data/enriched/spotify_api_enriched_*.csv - API enrichment results (when available)

What You Can Analyze

Temporal Patterns:

  • Listening habits by time of day, day of week, season
  • Evolution of music taste over 15 years
  • Monthly and yearly listening volume trends

Audio Feature Analysis:

  • Preference for danceability, energy, valence over time
  • Tempo preferences and changes
  • Acousticness vs. electronic music trends

Genre Evolution:

  • Top genres by play count and listening time
  • Genre diversity and discovery patterns
  • Seasonal genre preferences

Discovery Patterns:

  • Track popularity vs. personal preference
  • Artist loyalty and discovery rates
  • Repeat listening behavior

Generated Files Overview

scripts/spotify_api/data/enriched/
├── ultimate_spotify_enriched_streaming_history.csv
├── spotify_api_metadata.csv
├── progress.sqlite
├── spotify_api_enriched_streaming_history.csv
├── spotify_api_enriched_streaming_history_songs.csv
└── spotify_api_enriched_streaming_history_podcasts.csv

Available Analysis Features

With enriched data, you can analyze:

Listening Patterns

  • Total hours and play counts by year, month, season
  • Peak listening hours and days of week
  • Seasonal and temporal trends across 15 years

Audio Features Evolution

  • Musical taste evolution over time
  • Trends in danceability, energy, valence, tempo
  • Acoustic vs. electronic preferences
  • Mood patterns (valence/energy correlations)

Genre Analysis

  • Top genres by play count and listening time
  • Genre discovery and evolution patterns
  • Seasonal genre preferences
  • Musical diversity metrics

Discovery & Popularity

  • Track popularity vs. personal preference
  • Artist loyalty and discovery rates
  • Repeat listening behavior analysis
  • Mainstream vs. niche music preferences

Technical Notes

API Rate Limiting

The enrichment is designed to run daily until quota is hit (≈10k calls/day), then resume the next day:

  • Default pacing: 50 items/batch, 0.3s/request, 5s between batches
  • Auto-resume and skip logic ensure no redundant calls
  • Safe to stop and restart anytime

External Dataset Benefits

  • No Rate Limits: Process entire dataset quickly
  • High Coverage: 30%+ match rates with quality data
  • Complete Features: All audio features and genres included
  • Reliable: No authentication or quota concerns

Data Quality

  • Fuzzy string matching for robust track identification
  • Duplicate detection and handling
  • Comprehensive metadata validation
  • Multiple source integration for maximum coverage

Data Privacy & Security

  • Your .env file with API credentials is excluded from git
  • All CSV files with personal listening data are gitignored
  • External datasets contain only public metadata
  • Never commit or share personal listening data

Contributing

This project welcomes contributions! Areas for improvement:

  • Additional external dataset integrations
  • Advanced analysis and visualization scripts
  • Performance optimizations
  • New matching algorithms

Troubleshooting

Common Issues

  1. API Rate Limits: Use external datasets first, then supplement with API
  2. Low Match Rates: Check for special characters in track/artist names
  3. Missing Data: Some tracks may not exist in any dataset
  4. Memory Issues: Process in smaller batches if needed

Getting Help

  • Check the progress files for API enrichment status
  • Review console output for match statistics
  • Verify your raw data format matches expected structure

License

This project is for personal use. Respect Spotify's API terms of service and your data privacy.

About

Data trends from my spotify listening history

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published