Skip to content

[FEATURE] Modify generate_dump.yml to use release artifacts instead of ror-data #354

@adambuttrick

Description

@adambuttrick

Summary

Refactor the generate_dump.yml workflow in https://github.com/ror-community/ror-records to stop depending on https://github.com/ror-community/ror-data for reading/writing data dump zip files. Instead, the workflow should download the previous dump from a GitHub release on https://github.com/ror-community/ror-records and upload the newly generated dump as a release artifact there.

Current behavior

The workflow (.github/workflows/generate_dump.yml) currently works as follows:

  1. Check out ror-community/ror-data (lines 26-31) using a repo secret
  2. Copy the previous dump zip from ./ror-data/{prev-release}.zip to ./ror-records (line 35)
  3. Checksout ror-community/curation_ops (lines 36-42) to access generate_dump.py
  4. Run generate_dump.py with args -r {new-release} -e {prev-release} -i '../../ror-records' -o '../../ror-records' (lines 47-51)
  5. Copy the generated zip from ./ror-records/{new-release}*.zip back to ./ror-data (line 60)
  6. Commits and pushes the zip to ror-data (lines 61-70)
  7. Sends a Slack notification referencing the commit step outcome (lines 71-79)

Workflow inputs

Input Description Example value
new-release Name of the directory containing the new release v2.2
prev-release Name of the existing release zip file (without .zip) v2.1-2024-01-15-ror-data

How generate_dump.py works

  • Reads individual JSON records from {input_path}/{release_dir}/
  • Reads the previous dump from {output_path}/{prev-release}.zip
  • Generates a new dump at {input_path}/{release}-{YYYY-MM-DD}-ror-data.json and .csv
  • Creates a zip at {output_path}/{release}-{YYYY-MM-DD}-ror-data.zip
  • The date portion of the filename is auto-generated from execution time

Proposed changes

1. Remove ror-data checkout step

Delete the step that checks out ror-community/ror-data (lines 26-31). This repository should no longer be involved in the dump workflow.

2. Fetch previous dump from a GitHub release

Replace the copy previous data dump file step (lines 32-35) with a step that downloads the dump zip from the previous ror-records release using the GitHub CLI:

- name: Download previous dump from release
  run: |
    gh release download ${{ github.event.inputs.prev-release-tag }} \
      --pattern '*ror-data*.zip' \
      --dir ./prev-dump
    cp ./prev-dump/*.zip ./ror-records/
  env:
    GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

This requires knowing the release tag for the previous release (see input changes below).

3. Upload new dump as a release artifact

Replace the copy-to-ror-data step (lines 57-60) and the git commit/push step (lines 61-70) with a step that uploads the generated zip to a ror-records release:

- name: Upload dump to release
  run: |
    gh release upload ${{ github.event.inputs.new-release }} \
      ./ror-records/${{ github.event.inputs.new-release }}*-ror-data.zip \
      --clobber
  env:
    GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

4. Update workflow inputs

The inputs need to change to support release-based operations:

Input Description Example value
new-release Release tag to upload the new dump to (must already exist) v2.2
prev-release Zip filename stem (existing behavior, used by generate_dump.py) v2.1-2024-01-15-ror-data
prev-release-tag Release tag to download the previous dump from v2.1

Alternatively, prev-release-tag could be derived from prev-release by extracting the version prefix (everything before the first date segment). This should be evaluated during implementation.

5. Update Slack notification

The notification currently references steps.commitdumpfile.outcome (line 79). This must be updated to reference the upload step outcome instead.

6. Evaluate token requirements

The current workflow uses a repo secrect for all checkout steps and the push to ror-data. With the move to release artifacts on the same repository:

  • The ror-records checkout can likely use the default GITHUB_TOKEN associated with the user who triggers the action
  • The curation_ops checkout may still need to use a uses a repo secrect, since it's a separate repository
  • The gh release download and gh release upload commands should work with GITHUB_TOKEN for the same repository

Open questions

  1. Should the target release (new-release tag) exist as a draft before this workflow runs, or should this workflow create it? If the release is created by main_release.yml, confirm ordering.
  2. Verify that GITHUB_TOKEN associated with the user can handle gh release download and gh release upload for the same repository, or if additional permissions needed

Acceptance criteria

  • The ror-data checkout step is removed from generate_dump.yml
  • The previous dump zip is downloaded from a ror-records GitHub release using gh release download
  • The newly generated dump zip is uploaded to a ror-records GitHub release using gh release upload
  • The git commit/push to ror-data steps are removed
  • generate_dump.py continues to function correctly with the same arguments and produces the same output artifacts
  • Workflow inputs are updated to support release tag references
  • The Slack notification references the correct step outcome
  • Token usage is minimized -- GITHUB_TOKEN is used wherever PERSONAL_ACCESS_TOKEN is not required
  • The workflow is tested end-to-end with a real release cycle

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureTotally new functionality that does not exist in ROR currently

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions