-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Summary
We should consider migrating the data dump storage mechanism from committing zip files to the ror-data repository to attaching them as release assets on ror-records. This would consolidate all release data into a single repository, eliminates cross-repo commits during the release process, and simplifies the dependency graph between repositories.
Motivation
The current data dump publication process works across multiple repositories and involves cross-repo Git commits, which creates unnecessary coupling and complexity:
- The
generate_dump.ymlworkflow inror-recordschecks outror-data, copies the previous release zip from it, generates a new dump, and then commits the new zip back toror-data. - The indexing workflows (
prod_index_dump.yml/staging_index_dump.yml) pass adata-envparameter to the ror-api, which resolves it to either theror-dataorror-data-testGitHub repository and downloads the dump via the GitHub Contents/Blob API. - The Zenodo publication workflow (
publish_dump_zenodo.yml) checks outror-datato locate the dump zip, then runs a script downloaded fromcuration_opsto upload it.
Since ror-records already uses GitHub releases (the main_release.yml workflow triggers on release: [published] events), attaching data dump zips as release assets is a natural fit. It removes the need for ror-data as a storage intermediary and keeps all release artifacts co-located with the source records they were generated from.
Current State
1. Dump generation (generate_dump.yml in ror-records)
The workflow accepts new-release and prev-release as inputs. It:
- Checks out
ror-recordsandror-community/ror-data. - Copies the previous dump zip from the
ror-datacheckout:cp -R ./ror-data/${{prev-release}}.zip ./ror-records. - Checks out
ror-community/curation_opsand runsgenerate_dump.pywith input/output paths pointing at theror-recordscheckout. - Copies the generated zip back to
ror-data:cp -rf ./ror-records/${{new-release}}*.zip ./ror-data. - Commits and pushes to
ror-datawith message "add new data dump file".
2. Dump indexing (prod_index_dump.yml / staging_index_dump.yml in ror-records)
Both workflows accept release-dump, schema-version, and data-env inputs. They run index_dump.py, which constructs a URL by joining the API base URL, the filename, and the data environment: full_url = os.path.join(url, filename, dataenv). This calls the ror-api IndexDataDump endpoint at /v2/indexdatadump/{filename}/{dataenv}.
On the API side (getrordump.py), the data-env parameter is resolved to a repository URL via settings.py:
ROR_DUMP['PROD_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-data'
ROR_DUMP['TEST_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-data-test'The API then downloads the dump zip from the resolved repository using the GitHub Contents API and Blob API.
3. Zenodo publication (publish_dump_zenodo.yml in ror-records)
The workflow:
- Checks out
ror-community/ror-data(usingactions/checkout@v2). - Changes directory into the
ror-datacheckout. - Downloads
upload_dump_zenodo.pyand itsrequirements.txtfrom thecuration_opsrepository (schema-v2-1branch) via raw GitHub URLs. - Runs the script, which expects the dump zip to be in the current working directory.
Proposed Change
Attach data dump zips as GitHub release assets on the ror-records repository instead of committing them to ror-data. All downstream consumers (indexing workflows, Zenodo publication, ror-api) should then be updated to download dump files from the GitHub Releases API on ror-records.
The GitHub Releases API provides stable, versioned download URLs in the form:
https://github.com/ror-community/ror-records/releases/download/{tag}/{filename}.zip
or via the API:
https://api.github.com/repos/ror-community/ror-records/releases/tags/{tag}
These changes would eliminate the need for our current use of a GitHub Contents API + Blob API workaround in ror-api, while also lifting the 100 MB file size limitation for committed files.
Affected Components
| Repository | Component | File(s) | Current ror-data dependency | Required change |
|---|---|---|---|---|
| ror-records | Dump generation | .github/workflows/generate_dump.yml |
Checks out ror-data to read previous dump (line 27-31) and commits new dump (lines 57-70) |
Download previous dump from release asset; upload new dump as release asset instead of committing |
| ror-records | Zenodo publication | .github/workflows/publish_dump_zenodo.yml |
Checks out ror-data to find dump zips (lines 35-40) |
Download dump zip from release asset |
| ror-records | Prod dump indexing | .github/workflows/prod_index_dump.yml |
Passes data-env to API which resolves to ror-data repo (line 63) |
Remove or repurpose data-env parameter; pass release tag or asset URL instead |
| ror-records | Staging dump indexing | .github/workflows/staging_index_dump.yml |
Same as prod (line 63) | Same as prod |
| ror-records | Index dump script | .github/workflows/index_dump.py |
Constructs URL with dataenv path segment (line 30) |
Update URL construction to use new API parameter scheme |
| ror-api | Dump download | getrordump.py |
Downloads dump from ror-data via GitHub Contents + Blob API |
Download from ror-records release asset URL |
| ror-api | Settings | settings.py (lines 274-277) |
Hardcoded ror-data / ror-data-test repo URLs in ROR_DUMP config |
Point to ror-records releases or accept asset URL directly |
| ror-api | URL routing | urls.py |
URL regex includes ror-data in filename pattern |
Update pattern to match new filename convention (if changed) |
| curation_ops | Zenodo upload | upload_dump_zenodo.py |
Assumes dump zip is in current directory (from ror-data checkout); error messages reference "ror-data" |
Accept dump file path as argument or download from release asset URL; update error messages |
Sub-issues
Sub-issues are tagged on this epic. The would have to follow the following sequencing because of their dependencies:
- Sub-issues 1, 2, and 5 can be developed in parallel, but must be deployed together. The generation workflow needs to produce release assets, the API needs to consume them, and the curation_ops scripts need to be compatible with both.
- Sub-issue 3 depends on sub-issue 1 being complete and deployed, since the dump must exist as a release asset before the Zenodo workflow can download it.
- Sub-issue 4 depends on sub-issue 2 being complete and deployed, since the API must accept the new parameter scheme before the indexing workflows change how they call it.
- Sub-issue 6 is obviously last, implemented only after all workflows have been migrated and verified in production.
Testing Strategy
- Deploy all changes to the dev environment and run a full dump generation, indexing, and Zenodo sandbox upload cycle before moving into staging, then production.
- Temporarily keep the
ror-datacommit step ingenerate_dump.ymlalongside the new release asset upload, so both locations have the dump during the transition. Remove theror-datastep once all consumers are migrated. - If we need to make any changes to the ROR API endpoint signature, we need to make sure the old
data-envparameter is handled gracefully (e.g., return a clear error or redirect) during the transition period across different environments.
Acceptance Criteria
- Data dump zips are attached as release assets on
ror-recordsGitHub releases. -
generate_dump.ymlno longer checks out or commits toror-data. -
publish_dump_zenodo.ymlno longer checks outror-data. -
prod_index_dump.ymlandstaging_index_dump.ymlno longer passdata-envthat resolves toror-data. - ror-api downloads dump files from
ror-recordsrelease assets. -
settings.pyno longer referencesror-data/ror-data-testrepository URLs. -
upload_dump_zenodo.pydoes not assume aror-datacheckout as working directory. - Full end-to-end dump generation, indexing, and Zenodo publication succeeds using only
ror-recordsrelease assets. -
ror-datarepository is archived with a redirect notice.
Sub-issues
Metadata
Metadata
Assignees
Labels
Type
Projects
Status