Skip to content

[FEATURE] Update curation_ops scripts for release artifact workflow #358

@adambuttrick

Description

@adambuttrick

Context

Several scripts in https://github.com/ror-community/curation_ops repository are used by https://github.com/ror-community/ror-records workflows to generate and publish data dumps. While most of these scripts are path-agnostic and require minimal changes, some contain hard-coded references and conventions that should be updated to complete #353.

Scripts

generate_dump.py

Current behavior:

  • Takes -r (release dir), -e (previous dump name), -i (input path), -o (output path)
  • Reads previous dump from {output_path}/{prev-release}.zip
  • Generates new dump at {input_path}/{release}-{YYYY-MM-DD}-ror-data.json/.csv
  • Creates zip at {output_path}/{release}-{YYYY-MM-DD}-ror-data.zip
  • Has no hardcoded ror-data repo references; the workflow handles all repo interactions

Changes needed:

  • Minimal or none. The script is I/O path agnostic and reads from paths passed as arguments. The workflow changes (sub-issue 01) handle downloading from release artifacts and uploading back.
  • Optional: update the -ror-data suffix constant (line 18) if the naming convention changes. This is likely best kept as-is since "ror-data" is a known brand/convention for the dump files.

upload_dump_zenodo.py

Current behavior:

  • DUMP_FILE_DIR = "./" (line 15) scans the current directory for the dump zip
  • get_dump_file() (lines 19-24) matches the filename prefix against the release name
  • check_release_data() (line 221) checks "ror-data.zip" in the filename
  • Error message (line 228): "Dump file not found in ror-data"
  • GITHUB_API_URL (line 12) points to ror-community/ror-updates for release notes (this is a different repo from ror-data, used for release notes metadata)
  • format_description() (lines 50-108) contains hardcoded HTML with links to ror-schema and ror-updates repos

Changes needed:

  • Update error message on line 228 from "ror-data" to something generic (e.g., "Dump file not found in working directory")
  • Consider parameterizing DUMP_FILE_DIR as a CLI argument instead of hardcoding "./", so the workflow can explicitly pass the download directory
  • No other changes are strictly required. [FEATURE] Broken link checker for ROR records #337 handles downloading the artifact to the correct directory before running the script
  • requirements.txt currently pinsrequests to requests==2.27.1. We should update to a more recent version

Files to review/modify

  • generate_dump.py
  • upload_dump_zenodo.py
  • requirements.txt

Acceptance criteria

  • generate_dump.py continues to work correctly with the modified workflow from sub-issue 01
  • upload_dump_zenodo.py error messages are accurate and no longer reference "ror-data" as a directory
  • Optional: DUMP_FILE_DIR is parameterizable via CLI argument with "./" as the default
  • requirements.txt dependencies are reviewed and updated where appropriate

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureTotally new functionality that does not exist in ROR currently

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions