-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Summary
The IndexDataDump endpoint in ROR API currently downloads ROR data dump zip files from https://github.com/ror-community/ror-data repository using the GitHub Contents and Blob APIs. This needs to be updated to download from GitHub release assets on https://github.com/ror-community/ror-records instead.
Current behavior
When the IndexDataDump endpoint receives a request (GET /v2/indexdatadump/{filename}/{dataenv}), the following sequence of actions occurs:
-
views.py:302-316(IndexDataDump): Convertsdataenvto atestdataboolean (prod->False,test->True), then calls thesetupmanagement command. -
setup.py:42-68: Callsget_ror_dump_sha()to verify the file exists, then orchestrates the pipeline:GetRorDumpCommand->DeleteIndexCommand->CreateIndexCommand->IndexRorDumpCommand. -
setup.py:15-20(get_ror_dump_sha): UsesROR_DUMP['PROD_REPO_URL']orROR_DUMP['TEST_REPO_URL']to call the GitHub Contents API, finds the file by name, and returns its SHA. -
getrordump.py:14-31(get_ror_dump_sha): CallsGET https://api.github.com/repos/ror-community/ror-data/contentsand finds the matching filename in the response. -
getrordump.py:33-56(get_ror_dump_zip): Uses the SHA to call the GitHub Blob API (/git/blobs/{sha}), decodes the base64 response, and saves it as a local zip file. -
indexrordump.py:90-132: Extracts the zip, reads JSON files, and indexes them to Elasticsearch in bulk.
Configuration (settings.py:274-277)
ROR_DUMP = {}
ROR_DUMP['PROD_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-data'
ROR_DUMP['TEST_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-data-test'
ROR_DUMP['GITHUB_TOKEN'] = os.environ.get('GITHUB_TOKEN')URL pattern (urls.py:20)
url(r"^(?P<version>(v1|v2))\/indexdatadump\/(?P<filename>v(\d+\.)?(\d+\.)?(\*|\d+)-\d{4}-\d{2}-\d{2}-ror-data)\/(?P<dataenv>(test|prod))$", IndexDataDump.as_view())Note: The regex hardcodes ror-data as part of the expected filename pattern.
Proposed changes
Option A: Download from GitHub release assets (recommended)
Modify getrordump.py to download the dump zip from a GitHub release asset on ror-records instead of from the ror-data repo contents.
Changes needed:
-
settings.py: UpdateROR_DUMPconfig:ROR_DUMP['RELEASES_REPO_URL'] = 'https://api.github.com/repos/ror-community/ror-records'
Keep
TEST_REPO_URLifror-data-testis still needed for testing, or create a test release onror-records. -
getrordump.py: Replace the contents/blob download with release asset download:- Call
GET /repos/ror-community/ror-records/releases(or/releases/tags/{tag}) - Find the asset matching the filename
- Download from
asset['browser_download_url']or use the asset API withAccept: application/octet-stream
- Call
-
setup.py: Updateget_ror_dump_sha()-- SHA verification no longer applies for release assets. Replace with a function that verifies the asset exists on the release. -
urls.py: The filename regexror-datain the URL pattern is just a naming convention for the dump files (e.g.,v2.2-2024-02-16-ror-data). This can stay as-is since the dump files will keep the same naming convention. -
views.py: Thedataenvparameter may need rethinking. Currently it maps to which repo to download from. Options:- Keep
dataenvbut change its semantics (e.g.,prod= latest release,test= pre-release/draft) - Replace with a
release-tagparameter - Add an optional
release-tagparameter alongsidedataenv
- Keep
Option B: Download from S3
If the dump zip is also uploaded to S3 as part of the release workflow (#354), the API could download from S3 instead, using the existing boto3 infrastructure in indexror.py.
Files to modify
| File | Change |
|---|---|
rorapi/settings.py (lines 274-277) |
Update ROR_DUMP URLs to point to ror-records |
rorapi/management/commands/getrordump.py (entire file) |
New download logic for release assets |
rorapi/management/commands/setup.py (lines 15-20, 42-68) |
Update verification and orchestration |
rorapi/common/views.py (lines 302-316) |
Possibly update parameter handling |
rorapi/common/urls.py (line 20) |
Possibly update URL pattern if dataenv changes |
Open questions
- Should we keep the
dataenv(test/prod) parameter or replace it withrelease-tag? - How should test/staging indexing work? Currently uses
ror-data-testrepo. Options: use draft releases, use a separate tag convention, or keepror-data-testfor now. - Should the API also support downloading from S3 as a fallback?
Acceptance criteria
-
IndexDataDumpendpoint downloads dump zip files fromror-recordsGitHub release assets instead ofror-datarepo contents - The SHA-based verification in
setup.pyis replaced with release asset existence verification -
getrordump.pyuses the GitHub Releases API (or S3) to fetch the zip file rather than the Contents/Blob API -
settings.pyROR_DUMPconfiguration points toror-recordsrepository - Existing indexing pipeline (
DeleteIndex->CreateIndex->IndexRorDump) continues to work unchanged after the download step - dev and staging indexing paths are defined and functional
- Error handling covers: release not found, asset not found, download failure, authentication failure
- Existing tests are updated and passing
- Manual verification: successfully index a dump from a
ror-recordsrelease in a staging environment
Metadata
Metadata
Assignees
Labels
Type
Projects
Status