Skip to content

[MAINTENANCE] Update data dump indexing to support v2-only format #347

@adambuttrick

Description

@adambuttrick

Service/repository

Describe the current state/issue

With the deprecation of schema v1, ROR data dumps now contain only a single JSON file in v2 format. However, the indexing code and associated Github action workflows still expect the old format with two files (v1 and v2).

Old data dump format:

v1.73-2025-10-28-ror-data/
├── v1.73-2025-10-28-ror-data.json
├── v1.73-2025-10-28-ror-data.csv
├── v1.73-2025-10-28-ror-data_schema_v2.json
└── v1.73-2025-10-28-ror-data_schema_v2.csv

New data dump format (v2.0 onwards):

v2.0-2025-12-16-ror-data/
├── v2.0-2025-12-16-ror-data.json 
└── v2.0-2025-12-16-ror-data.csv

The current indexrordump.py uses filename pattern matching to detect schema version:

  • Files with schema_v2 in filename are used for the v2 index
  • Files without schema_v2 in filename are used v1 index

This logic breaks with v2.0-2025-12-16-ror-data.json, which does not contain schema_v2 in the filename, causing it to be incorrectly treated as the v1 data dump.

Describe the desired state/solution

Complete removal of v1 indexing support from both repositories.

ror-api tasks:

ror-records workflow tasks:

Testing tasks:

  • Test all changes in dev-ror-records environment first
  • Verify indexing works with new v2.0 data dump format
  • Verify staging workflows function correctly
  • Verify production workflows function correctly

Metadata

Metadata

Assignees

Labels

maintenanceWork needed to maintain long-term health/performance of code and infrastructure

Projects

Status

Complete

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions