Skip to content

Label-aware dedup ranking from WXYC flowsheet data#10

Merged
jakebromberg merged 2 commits intomainfrom
feat/dedup-label-match
Feb 20, 2026
Merged

Label-aware dedup ranking from WXYC flowsheet data#10
jakebromberg merged 2 commits intomainfrom
feat/dedup-label-match

Conversation

@jakebromberg
Copy link
Member

Summary

  • Add --library-labels option to dedup that uses WXYC flowsheet label data to prefer releases whose Discogs label matches WXYC's known pressing
  • Extract connect_mysql to shared lib/wxyc.py module
  • Add scripts/extract_library_labels.py for extracting label triples from WXYC MySQL
  • Wire label extraction and --library-labels through run_pipeline.py

Closes #9

Test plan

  • Unit tests: 148 pass (14 new: 4 wxyc + 10 extract_library_labels)
  • Integration tests: 90 pass (6 new: TestDedupWithLabels)
  • E2E tests: TestPipelineWithLabels (4 new tests pass)
  • Verify label-aware dedup with production WXYC MySQL data

Jake Bromberg added 2 commits February 19, 2026 20:00
When --library-labels is provided, dedup prefers releases whose Discogs
label matches WXYC's known pressing over releases with more tracks but
a different label. This ensures the cached edition matches what WXYC
actually owns.

New files:
- lib/wxyc.py: shared MySQL connection utility (extracted from enrich)
- scripts/extract_library_labels.py: extracts (artist, title, label)
  triples from WXYC FLOWSHEET_ENTRY_PROD

Modified:
- dedup_releases.py: --library-labels loads wxyc_label_pref table,
  creates release_label_match via JOIN, ranks by label_match DESC
  before track_count DESC
- run_pipeline.py: --library-labels passed to dedup; auto-extracts
  from --wxyc-db-url if no CSV provided
- enrich_library_artists.py: uses lib.wxyc.connect_mysql
@jakebromberg jakebromberg force-pushed the feat/dedup-label-match branch from 45f3a24 to 5218002 Compare February 20, 2026 04:11
@jakebromberg jakebromberg merged commit 8f5820a into main Feb 20, 2026
3 checks passed
@jakebromberg jakebromberg deleted the feat/dedup-label-match branch February 20, 2026 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Label-aware dedup ranking from WXYC flowsheet data

1 participant

Comments