-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Summary
Refactor the generate_dump.yml workflow in https://github.com/ror-community/ror-records to stop depending on https://github.com/ror-community/ror-data for reading/writing data dump zip files. Instead, the workflow should download the previous dump from a GitHub release on https://github.com/ror-community/ror-records and upload the newly generated dump as a release artifact there.
Current behavior
The workflow (.github/workflows/generate_dump.yml) currently works as follows:
- Check out
ror-community/ror-data(lines 26-31) using a repo secret - Copy the previous dump zip from
./ror-data/{prev-release}.zipto./ror-records(line 35) - Checksout
ror-community/curation_ops(lines 36-42) to accessgenerate_dump.py - Run
generate_dump.pywith args-r {new-release} -e {prev-release} -i '../../ror-records' -o '../../ror-records'(lines 47-51) - Copy the generated zip from
./ror-records/{new-release}*.zipback to./ror-data(line 60) - Commits and pushes the zip to
ror-data(lines 61-70) - Sends a Slack notification referencing the commit step outcome (lines 71-79)
Workflow inputs
| Input | Description | Example value |
|---|---|---|
new-release |
Name of the directory containing the new release | v2.2 |
prev-release |
Name of the existing release zip file (without .zip) |
v2.1-2024-01-15-ror-data |
How generate_dump.py works
- Reads individual JSON records from
{input_path}/{release_dir}/ - Reads the previous dump from
{output_path}/{prev-release}.zip - Generates a new dump at
{input_path}/{release}-{YYYY-MM-DD}-ror-data.jsonand.csv - Creates a zip at
{output_path}/{release}-{YYYY-MM-DD}-ror-data.zip - The date portion of the filename is auto-generated from execution time
Proposed changes
1. Remove ror-data checkout step
Delete the step that checks out ror-community/ror-data (lines 26-31). This repository should no longer be involved in the dump workflow.
2. Fetch previous dump from a GitHub release
Replace the copy previous data dump file step (lines 32-35) with a step that downloads the dump zip from the previous ror-records release using the GitHub CLI:
- name: Download previous dump from release
run: |
gh release download ${{ github.event.inputs.prev-release-tag }} \
--pattern '*ror-data*.zip' \
--dir ./prev-dump
cp ./prev-dump/*.zip ./ror-records/
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}This requires knowing the release tag for the previous release (see input changes below).
3. Upload new dump as a release artifact
Replace the copy-to-ror-data step (lines 57-60) and the git commit/push step (lines 61-70) with a step that uploads the generated zip to a ror-records release:
- name: Upload dump to release
run: |
gh release upload ${{ github.event.inputs.new-release }} \
./ror-records/${{ github.event.inputs.new-release }}*-ror-data.zip \
--clobber
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}4. Update workflow inputs
The inputs need to change to support release-based operations:
| Input | Description | Example value |
|---|---|---|
new-release |
Release tag to upload the new dump to (must already exist) | v2.2 |
prev-release |
Zip filename stem (existing behavior, used by generate_dump.py) |
v2.1-2024-01-15-ror-data |
prev-release-tag |
Release tag to download the previous dump from | v2.1 |
Alternatively, prev-release-tag could be derived from prev-release by extracting the version prefix (everything before the first date segment). This should be evaluated during implementation.
5. Update Slack notification
The notification currently references steps.commitdumpfile.outcome (line 79). This must be updated to reference the upload step outcome instead.
6. Evaluate token requirements
The current workflow uses a repo secrect for all checkout steps and the push to ror-data. With the move to release artifacts on the same repository:
- The
ror-recordscheckout can likely use the defaultGITHUB_TOKENassociated with the user who triggers the action - The
curation_opscheckout may still need to use a uses a repo secrect, since it's a separate repository - The
gh release downloadandgh release uploadcommands should work withGITHUB_TOKENfor the same repository
Open questions
- Should the target release (
new-releasetag) exist as a draft before this workflow runs, or should this workflow create it? If the release is created bymain_release.yml, confirm ordering. - Verify that
GITHUB_TOKENassociated with the user can handlegh release downloadandgh release uploadfor the same repository, or if additional permissions needed
Acceptance criteria
- The
ror-datacheckout step is removed fromgenerate_dump.yml - The previous dump zip is downloaded from a
ror-recordsGitHub release usinggh release download - The newly generated dump zip is uploaded to a
ror-recordsGitHub release usinggh release upload - The git commit/push to
ror-datasteps are removed -
generate_dump.pycontinues to function correctly with the same arguments and produces the same output artifacts - Workflow inputs are updated to support release tag references
- The Slack notification references the correct step outcome
- Token usage is minimized --
GITHUB_TOKENis used whereverPERSONAL_ACCESS_TOKENis not required - The workflow is tested end-to-end with a real release cycle
Metadata
Metadata
Assignees
Labels
Type
Projects
Status