Skip to content

[FEATURE] Update ROR API data dump indexing to download from ror-records release artifacts #355

@adambuttrick

Description

@adambuttrick

Summary

The IndexDataDump endpoint in ROR API currently downloads ROR data dump zip files from https://github.com/ror-community/ror-data repository using the GitHub Contents and Blob APIs. This needs to be updated to download from GitHub release assets on https://github.com/ror-community/ror-records instead.

Current behavior

When the IndexDataDump endpoint receives a request (GET /v2/indexdatadump/{filename}/{dataenv}), the following sequence of actions occurs:

  1. views.py:302-316 (IndexDataDump): Converts dataenv to a testdata boolean (prod -> False, test -> True), then calls the setup management command.

  2. setup.py:42-68: Calls get_ror_dump_sha() to verify the file exists, then orchestrates the pipeline: GetRorDumpCommand -> DeleteIndexCommand -> CreateIndexCommand -> IndexRorDumpCommand.

  3. setup.py:15-20 (get_ror_dump_sha): Uses ROR_DUMP['PROD_REPO_URL'] or ROR_DUMP['TEST_REPO_URL'] to call the GitHub Contents API, finds the file by name, and returns its SHA.

  4. getrordump.py:14-31 (get_ror_dump_sha): Calls GET https://api.github.com/repos/ror-community/ror-data/contents and finds the matching filename in the response.

  5. getrordump.py:33-56 (get_ror_dump_zip): Uses the SHA to call the GitHub Blob API (/git/blobs/{sha}), decodes the base64 response, and saves it as a local zip file.

  6. indexrordump.py:90-132: Extracts the zip, reads JSON files, and indexes them to Elasticsearch in bulk.

Configuration (settings.py:274-277)

ROR_DUMP = {}
ROR_DUMP['PROD_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-data'
ROR_DUMP['TEST_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-data-test'
ROR_DUMP['GITHUB_TOKEN'] = os.environ.get('GITHUB_TOKEN')

URL pattern (urls.py:20)

url(r"^(?P<version>(v1|v2))\/indexdatadump\/(?P<filename>v(\d+\.)?(\d+\.)?(\*|\d+)-\d{4}-\d{2}-\d{2}-ror-data)\/(?P<dataenv>(test|prod))$", IndexDataDump.as_view())

Note: The regex hardcodes ror-data as part of the expected filename pattern.

Proposed changes

Option A: Download from GitHub release assets (recommended)

Modify getrordump.py to download the dump zip from a GitHub release asset on ror-records instead of from the ror-data repo contents.

Changes needed:

  1. settings.py: Update ROR_DUMP config:

    ROR_DUMP['RELEASES_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-records'

    Keep TEST_REPO_URL if ror-data-test is still needed for testing, or create a test release on ror-records.

  2. getrordump.py: Replace the contents/blob download with release asset download:

    • Call GET /repos/ror-community/ror-records/releases (or /releases/tags/{tag})
    • Find the asset matching the filename
    • Download from asset['browser_download_url'] or use the asset API with Accept: application/octet-stream
  3. setup.py: Update get_ror_dump_sha() -- SHA verification no longer applies for release assets. Replace with a function that verifies the asset exists on the release.

  4. urls.py: The filename regex ror-data in the URL pattern is just a naming convention for the dump files (e.g., v2.2-2024-02-16-ror-data). This can stay as-is since the dump files will keep the same naming convention.

  5. views.py: The dataenv parameter may need rethinking. Currently it maps to which repo to download from. Options:

    • Keep dataenv but change its semantics (e.g., prod = latest release, test = pre-release/draft)
    • Replace with a release-tag parameter
    • Add an optional release-tag parameter alongside dataenv

Option B: Download from S3

If the dump zip is also uploaded to S3 as part of the release workflow (#354), the API could download from S3 instead, using the existing boto3 infrastructure in indexror.py.

Files to modify

File Change
rorapi/settings.py (lines 274-277) Update ROR_DUMP URLs to point to ror-records
rorapi/management/commands/getrordump.py (entire file) New download logic for release assets
rorapi/management/commands/setup.py (lines 15-20, 42-68) Update verification and orchestration
rorapi/common/views.py (lines 302-316) Possibly update parameter handling
rorapi/common/urls.py (line 20) Possibly update URL pattern if dataenv changes

Open questions

  1. Should we keep the dataenv (test/prod) parameter or replace it with release-tag?
  2. How should test/staging indexing work? Currently uses ror-data-test repo. Options: use draft releases, use a separate tag convention, or keep ror-data-test for now.
  3. Should the API also support downloading from S3 as a fallback?

Acceptance criteria

  • IndexDataDump endpoint downloads dump zip files from ror-records GitHub release assets instead of ror-data repo contents
  • The SHA-based verification in setup.py is replaced with release asset existence verification
  • getrordump.py uses the GitHub Releases API (or S3) to fetch the zip file rather than the Contents/Blob API
  • settings.py ROR_DUMP configuration points to ror-records repository
  • Existing indexing pipeline (DeleteIndex -> CreateIndex -> IndexRorDump) continues to work unchanged after the download step
  • dev and staging indexing paths are defined and functional
  • Error handling covers: release not found, asset not found, download failure, authentication failure
  • Existing tests are updated and passing
  • Manual verification: successfully index a dump from a ror-records release in a staging environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureTotally new functionality that does not exist in ROR currently

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions