[EPIC] Publish data dump as GitHub release artifact instead of committing to ror-data repository


## Summary

We should consider migrating the data dump storage mechanism from committing zip files to the `ror-data` repository to attaching them as release assets on `ror-records`. This would consolidate all release data into a single repository, eliminates cross-repo commits during the release process, and simplifies the dependency graph between repositories.

## Motivation

The current data dump publication process works across multiple repositories and involves cross-repo Git commits, which creates unnecessary coupling and complexity:

- The `generate_dump.yml` workflow in `ror-records` checks out `ror-data`, copies the previous release zip from it, generates a new dump, and then commits the new zip back to `ror-data`.
- The indexing workflows (`prod_index_dump.yml` / `staging_index_dump.yml`) pass a `data-env` parameter to the ror-api, which resolves it to either the `ror-data` or `ror-data-test` GitHub repository and downloads the dump via the GitHub Contents/Blob API.
- The Zenodo publication workflow (`publish_dump_zenodo.yml`) checks out `ror-data` to locate the dump zip, then runs a script downloaded from `curation_ops` to upload it.

Since `ror-records` already uses GitHub releases (the `main_release.yml` workflow triggers on `release: [published]` events), attaching data dump zips as release assets is a natural fit. It removes the need for `ror-data` as a storage intermediary and keeps all release artifacts co-located with the source records they were generated from.

## Current State

### 1. Dump generation (`generate_dump.yml` in ror-records)

The workflow accepts `new-release` and `prev-release` as inputs. It:

1. Checks out `ror-records` and `ror-community/ror-data`.
2. Copies the previous dump zip from the `ror-data` checkout: `cp -R ./ror-data/${{prev-release}}.zip ./ror-records`.
3. Checks out `ror-community/curation_ops` and runs `generate_dump.py` with input/output paths pointing at the `ror-records` checkout.
4. Copies the generated zip back to `ror-data`: `cp -rf ./ror-records/${{new-release}}*.zip ./ror-data`.
5. Commits and pushes to `ror-data` with message "add new data dump file".

### 2. Dump indexing (`prod_index_dump.yml` / `staging_index_dump.yml` in ror-records)

Both workflows accept `release-dump`, `schema-version`, and `data-env` inputs. They run `index_dump.py`, which constructs a URL by joining the API base URL, the filename, and the data environment: `full_url = os.path.join(url, filename, dataenv)`. This calls the ror-api `IndexDataDump` endpoint at `/v2/indexdatadump/{filename}/{dataenv}`.

On the API side (`getrordump.py`), the `data-env` parameter is resolved to a repository URL via `settings.py`:

```python
ROR_DUMP['PROD_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-data'
ROR_DUMP['TEST_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-data-test'
```

The API then downloads the dump zip from the resolved repository using the GitHub Contents API and Blob API.

### 3. Zenodo publication (`publish_dump_zenodo.yml` in ror-records)

The workflow:

1. Checks out `ror-community/ror-data` (using `actions/checkout@v2`).
2. Changes directory into the `ror-data` checkout.
3. Downloads `upload_dump_zenodo.py` and its `requirements.txt` from the `curation_ops` repository (`schema-v2-1` branch) via raw GitHub URLs.
4. Runs the script, which expects the dump zip to be in the current working directory.

## Proposed Change

Attach data dump zips as GitHub release assets on the `ror-records` repository instead of committing them to `ror-data`. All downstream consumers (indexing workflows, Zenodo publication, ror-api) should then be updated to download dump files from the GitHub Releases API on `ror-records`.

The GitHub Releases API provides stable, versioned download URLs in the form:

```
https://github.com/ror-community/ror-records/releases/download/{tag}/{filename}.zip
```

or via the API:

```
https://api.github.com/repos/ror-community/ror-records/releases/tags/{tag}
```

These changes would eliminate the need for our current use of a GitHub Contents API + Blob API workaround in ror-api, while also lifting the 100 MB file size limitation  for committed files.

## Affected Components

| Repository | Component | File(s) | Current ror-data dependency | Required change |
|---|---|---|---|---|
| ror-records | Dump generation | `.github/workflows/generate_dump.yml` | Checks out `ror-data` to read previous dump (line 27-31) and commits new dump (lines 57-70) | Download previous dump from release asset; upload new dump as release asset instead of committing |
| ror-records | Zenodo publication | `.github/workflows/publish_dump_zenodo.yml` | Checks out `ror-data` to find dump zips (lines 35-40) | Download dump zip from release asset |
| ror-records | Prod dump indexing | `.github/workflows/prod_index_dump.yml` | Passes `data-env` to API which resolves to `ror-data` repo (line 63) | Remove or repurpose `data-env` parameter; pass release tag or asset URL instead |
| ror-records | Staging dump indexing | `.github/workflows/staging_index_dump.yml` | Same as prod (line 63) | Same as prod |
| ror-records | Index dump script | `.github/workflows/index_dump.py` | Constructs URL with `dataenv` path segment (line 30) | Update URL construction to use new API parameter scheme |
| ror-api | Dump download | `getrordump.py` | Downloads dump from `ror-data` via GitHub Contents + Blob API | Download from `ror-records` release asset URL |
| ror-api | Settings | `settings.py` (lines 274-277) | Hardcoded `ror-data` / `ror-data-test` repo URLs in `ROR_DUMP` config | Point to `ror-records` releases or accept asset URL directly |
| ror-api | URL routing | `urls.py` | URL regex includes `ror-data` in filename pattern | Update pattern to match new filename convention (if changed) |
| curation_ops | Zenodo upload | `upload_dump_zenodo.py` | Assumes dump zip is in current directory (from `ror-data` checkout); error messages reference "ror-data" | Accept dump file path as argument or download from release asset URL; update error messages |

## Sub-issues

Sub-issues are tagged on this epic. The would have to follow the following sequencing because of their dependencies:

- Sub-issues 1, 2, and 5 can be developed in parallel, but must be deployed together. The generation workflow needs to produce release assets, the API needs to consume them, and the curation_ops scripts need to be compatible with both.
- Sub-issue 3 depends on sub-issue 1 being complete and deployed, since the dump must exist as a release asset before the Zenodo workflow can download it.
- Sub-issue 4 depends on sub-issue 2 being complete and deployed, since the API must accept the new parameter scheme before the indexing workflows change how they call it.
- Sub-issue 6 is obviously last, implemented only after all workflows have been migrated and verified in production.

## Testing Strategy

- Deploy all changes to the dev environment and run a full dump generation, indexing, and Zenodo sandbox upload cycle before moving into staging, then production.
- Temporarily keep the `ror-data` commit step in `generate_dump.yml` alongside the new release asset upload, so both locations have the dump during the transition. Remove the `ror-data` step once all consumers are migrated.
- If we need to make any changes to the ROR API endpoint signature, we need to make sure the old `data-env` parameter is handled gracefully (e.g., return a clear error or redirect) during the transition period across different environments.

## Acceptance Criteria

- [ ] Data dump zips are attached as release assets on `ror-records` GitHub releases.
- [ ] `generate_dump.yml` no longer checks out or commits to `ror-data`.
- [ ] `publish_dump_zenodo.yml` no longer checks out `ror-data`.
- [ ] `prod_index_dump.yml` and `staging_index_dump.yml` no longer pass `data-env` that resolves to `ror-data`.
- [ ] ror-api downloads dump files from `ror-records` release assets.
- [ ] `settings.py` no longer references `ror-data` / `ror-data-test` repository URLs.
- [ ] `upload_dump_zenodo.py` does not assume a `ror-data` checkout as working directory.
- [ ] Full end-to-end dump generation, indexing, and Zenodo publication succeeds using only `ror-records` release assets.
- [ ] `ror-data` repository is archived with a redirect notice.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Publish data dump as GitHub release artifact instead of committing to ror-data repository #353

Summary

Motivation

Current State

1. Dump generation (`generate_dump.yml` in ror-records)

2. Dump indexing (`prod_index_dump.yml` / `staging_index_dump.yml` in ror-records)

3. Zenodo publication (`publish_dump_zenodo.yml` in ror-records)

Proposed Change

Affected Components

Sub-issues

Testing Strategy

Acceptance Criteria

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Repository	Component	File(s)	Current ror-data dependency	Required change
ror-records	Dump generation	`.github/workflows/generate_dump.yml`	Checks out `ror-data` to read previous dump (line 27-31) and commits new dump (lines 57-70)	Download previous dump from release asset; upload new dump as release asset instead of committing
ror-records	Zenodo publication	`.github/workflows/publish_dump_zenodo.yml`	Checks out `ror-data` to find dump zips (lines 35-40)	Download dump zip from release asset
ror-records	Prod dump indexing	`.github/workflows/prod_index_dump.yml`	Passes `data-env` to API which resolves to `ror-data` repo (line 63)	Remove or repurpose `data-env` parameter; pass release tag or asset URL instead
ror-records	Staging dump indexing	`.github/workflows/staging_index_dump.yml`	Same as prod (line 63)	Same as prod
ror-records	Index dump script	`.github/workflows/index_dump.py`	Constructs URL with `dataenv` path segment (line 30)	Update URL construction to use new API parameter scheme
ror-api	Dump download	`getrordump.py`	Downloads dump from `ror-data` via GitHub Contents + Blob API	Download from `ror-records` release asset URL
ror-api	Settings	`settings.py` (lines 274-277)	Hardcoded `ror-data` / `ror-data-test` repo URLs in `ROR_DUMP` config	Point to `ror-records` releases or accept asset URL directly
ror-api	URL routing	`urls.py`	URL regex includes `ror-data` in filename pattern	Update pattern to match new filename convention (if changed)
curation_ops	Zenodo upload	`upload_dump_zenodo.py`	Assumes dump zip is in current directory (from `ror-data` checkout); error messages reference "ror-data"	Accept dump file path as argument or download from release asset URL; update error messages

[EPIC] Publish data dump as GitHub release artifact instead of committing to ror-data repository #353

Description

Summary

Motivation

Current State

1. Dump generation (generate_dump.yml in ror-records)

2. Dump indexing (prod_index_dump.yml / staging_index_dump.yml in ror-records)

3. Zenodo publication (publish_dump_zenodo.yml in ror-records)

Proposed Change

Affected Components

Sub-issues

Testing Strategy

Acceptance Criteria

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Dump generation (`generate_dump.yml` in ror-records)

2. Dump indexing (`prod_index_dump.yml` / `staging_index_dump.yml` in ror-records)

3. Zenodo publication (`publish_dump_zenodo.yml` in ror-records)