Skip to content

[EPIC] Publish data dump as GitHub release artifact instead of committing to ror-data repository #353

@adambuttrick

Description

@adambuttrick

Summary

We should consider migrating the data dump storage mechanism from committing zip files to the ror-data repository to attaching them as release assets on ror-records. This would consolidate all release data into a single repository, eliminates cross-repo commits during the release process, and simplifies the dependency graph between repositories.

Motivation

The current data dump publication process works across multiple repositories and involves cross-repo Git commits, which creates unnecessary coupling and complexity:

  • The generate_dump.yml workflow in ror-records checks out ror-data, copies the previous release zip from it, generates a new dump, and then commits the new zip back to ror-data.
  • The indexing workflows (prod_index_dump.yml / staging_index_dump.yml) pass a data-env parameter to the ror-api, which resolves it to either the ror-data or ror-data-test GitHub repository and downloads the dump via the GitHub Contents/Blob API.
  • The Zenodo publication workflow (publish_dump_zenodo.yml) checks out ror-data to locate the dump zip, then runs a script downloaded from curation_ops to upload it.

Since ror-records already uses GitHub releases (the main_release.yml workflow triggers on release: [published] events), attaching data dump zips as release assets is a natural fit. It removes the need for ror-data as a storage intermediary and keeps all release artifacts co-located with the source records they were generated from.

Current State

1. Dump generation (generate_dump.yml in ror-records)

The workflow accepts new-release and prev-release as inputs. It:

  1. Checks out ror-records and ror-community/ror-data.
  2. Copies the previous dump zip from the ror-data checkout: cp -R ./ror-data/${{prev-release}}.zip ./ror-records.
  3. Checks out ror-community/curation_ops and runs generate_dump.py with input/output paths pointing at the ror-records checkout.
  4. Copies the generated zip back to ror-data: cp -rf ./ror-records/${{new-release}}*.zip ./ror-data.
  5. Commits and pushes to ror-data with message "add new data dump file".

2. Dump indexing (prod_index_dump.yml / staging_index_dump.yml in ror-records)

Both workflows accept release-dump, schema-version, and data-env inputs. They run index_dump.py, which constructs a URL by joining the API base URL, the filename, and the data environment: full_url = os.path.join(url, filename, dataenv). This calls the ror-api IndexDataDump endpoint at /v2/indexdatadump/{filename}/{dataenv}.

On the API side (getrordump.py), the data-env parameter is resolved to a repository URL via settings.py:

ROR_DUMP['PROD_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-data'
ROR_DUMP['TEST_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-data-test'

The API then downloads the dump zip from the resolved repository using the GitHub Contents API and Blob API.

3. Zenodo publication (publish_dump_zenodo.yml in ror-records)

The workflow:

  1. Checks out ror-community/ror-data (using actions/checkout@v2).
  2. Changes directory into the ror-data checkout.
  3. Downloads upload_dump_zenodo.py and its requirements.txt from the curation_ops repository (schema-v2-1 branch) via raw GitHub URLs.
  4. Runs the script, which expects the dump zip to be in the current working directory.

Proposed Change

Attach data dump zips as GitHub release assets on the ror-records repository instead of committing them to ror-data. All downstream consumers (indexing workflows, Zenodo publication, ror-api) should then be updated to download dump files from the GitHub Releases API on ror-records.

The GitHub Releases API provides stable, versioned download URLs in the form:

https://github.com/ror-community/ror-records/releases/download/{tag}/{filename}.zip

or via the API:

https://api.github.com/repos/ror-community/ror-records/releases/tags/{tag}

These changes would eliminate the need for our current use of a GitHub Contents API + Blob API workaround in ror-api, while also lifting the 100 MB file size limitation for committed files.

Affected Components

Repository Component File(s) Current ror-data dependency Required change
ror-records Dump generation .github/workflows/generate_dump.yml Checks out ror-data to read previous dump (line 27-31) and commits new dump (lines 57-70) Download previous dump from release asset; upload new dump as release asset instead of committing
ror-records Zenodo publication .github/workflows/publish_dump_zenodo.yml Checks out ror-data to find dump zips (lines 35-40) Download dump zip from release asset
ror-records Prod dump indexing .github/workflows/prod_index_dump.yml Passes data-env to API which resolves to ror-data repo (line 63) Remove or repurpose data-env parameter; pass release tag or asset URL instead
ror-records Staging dump indexing .github/workflows/staging_index_dump.yml Same as prod (line 63) Same as prod
ror-records Index dump script .github/workflows/index_dump.py Constructs URL with dataenv path segment (line 30) Update URL construction to use new API parameter scheme
ror-api Dump download getrordump.py Downloads dump from ror-data via GitHub Contents + Blob API Download from ror-records release asset URL
ror-api Settings settings.py (lines 274-277) Hardcoded ror-data / ror-data-test repo URLs in ROR_DUMP config Point to ror-records releases or accept asset URL directly
ror-api URL routing urls.py URL regex includes ror-data in filename pattern Update pattern to match new filename convention (if changed)
curation_ops Zenodo upload upload_dump_zenodo.py Assumes dump zip is in current directory (from ror-data checkout); error messages reference "ror-data" Accept dump file path as argument or download from release asset URL; update error messages

Sub-issues

Sub-issues are tagged on this epic. The would have to follow the following sequencing because of their dependencies:

  • Sub-issues 1, 2, and 5 can be developed in parallel, but must be deployed together. The generation workflow needs to produce release assets, the API needs to consume them, and the curation_ops scripts need to be compatible with both.
  • Sub-issue 3 depends on sub-issue 1 being complete and deployed, since the dump must exist as a release asset before the Zenodo workflow can download it.
  • Sub-issue 4 depends on sub-issue 2 being complete and deployed, since the API must accept the new parameter scheme before the indexing workflows change how they call it.
  • Sub-issue 6 is obviously last, implemented only after all workflows have been migrated and verified in production.

Testing Strategy

  • Deploy all changes to the dev environment and run a full dump generation, indexing, and Zenodo sandbox upload cycle before moving into staging, then production.
  • Temporarily keep the ror-data commit step in generate_dump.yml alongside the new release asset upload, so both locations have the dump during the transition. Remove the ror-data step once all consumers are migrated.
  • If we need to make any changes to the ROR API endpoint signature, we need to make sure the old data-env parameter is handled gracefully (e.g., return a clear error or redirect) during the transition period across different environments.

Acceptance Criteria

  • Data dump zips are attached as release assets on ror-records GitHub releases.
  • generate_dump.yml no longer checks out or commits to ror-data.
  • publish_dump_zenodo.yml no longer checks out ror-data.
  • prod_index_dump.yml and staging_index_dump.yml no longer pass data-env that resolves to ror-data.
  • ror-api downloads dump files from ror-records release assets.
  • settings.py no longer references ror-data / ror-data-test repository URLs.
  • upload_dump_zenodo.py does not assume a ror-data checkout as working directory.
  • Full end-to-end dump generation, indexing, and Zenodo publication succeeds using only ror-records release assets.
  • ror-data repository is archived with a redirect notice.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicA large body of work that contains multiple issues or user stories. Represents a significant feature

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions