-
Notifications
You must be signed in to change notification settings - Fork 1
Enhance datagouv file retrieval with API fallback for missing URLs (local mode without S3) #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ocal mode without S3)
WalkthroughThe datagouv-get-files download flow has been reworked to fetch URLs from local catalog entries or query the API as a fallback, with added fallback logic for .gz variants, conditional recoding for "deces" files, and explicit success/failure logging replacing previous inline SHA-1 validation. Changes
Sequence Diagram(s)sequenceDiagram
participant Script
participant LocalCatalog as Local Catalog
participant API
participant URL as Remote Server
rect rgb(200, 220, 255)
Note over Script,LocalCatalog: New Flow: URL Resolution Phase
Script->>LocalCatalog: Fetch URL from catalog entry
LocalCatalog-->>Script: URL found or not found
alt URL not found locally
Script->>API: Query API for URL discovery
API-->>Script: Return URL
end
end
rect rgb(200, 220, 255)
Note over Script,URL: Download Phase with Fallback
Script->>URL: Attempt download .txt
alt Success
URL-->>Script: .txt downloaded
Script->>Script: Compute SHA-1 (txt)
alt Filename starts with "deces" && after deces-2010
Script->>Script: Apply recoding
end
else Failure
Script->>API: Query for .gz variant URL
Script->>URL: Attempt download .txt.gz
alt Success
URL-->>Script: .txt.gz downloaded
Script->>Script: Compute SHA-1 (gz)
end
end
end
rect rgb(200, 220, 255)
Note over Script,Script: Logging
Script->>Script: Log explicit success/failure messages
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
Makefile (3)
966-977: Clear URL resolution strategy with good fallback messaging.The two-tier lookup (catalog then API) is well-structured and provides good observability. The diagnostic messages (lines 973, 976) improve traceability.
However, consider validating that the file variable doesn't require shell-escaping for the jq query. If catalog entries could contain special characters, quote escaping in the jq filter might be needed for robustness (e.g., using
@jsonor proper escaping).I can generate a shell script to verify that catalog entries are safe for direct jq interpolation, or we could add explicit escaping.
981-985: Use POSIX-compatible shell syntax for portability.Line 981 uses bash-specific
[[for pattern matching. While the Makefile specifies bash (line 8), consider switching to POSIX[for better portability:- if [[ "$$file" == "deces"* ]]; then + if [ "$${file#deces}" != "$$file" ]; thenAdditionally, if the
recodecommand fails on line 983, execution continues to line 986 (gzip) without explicit error handling. Consider adding error checks if recode failures should block further processing:recode utf8..latin1 ${DATA_DIR}/$$file; + if [ $$? -ne 0 ]; then echo " ✗ Recode failed for $$file"; continue; fi
987-1002: Fallback logic is sound, but implicit error handling could be more explicit.The .gz fallback correctly implements two paths: API lookup if URL was initially missing (line 990), or URL construction if it existed (line 994). The flow handles all scenarios, though error cases are implicit.
When both .txt and .gz downloads fail, the error is silently suppressed by
2>/dev/nulland only reported at line 999. This is acceptable but could be clearer. Consider explicitly verifying URL_GZ is non-empty before attempting download:if [ ! -z "$$URL_GZ" ] && curl -s --fail $$URL_GZ > ${DATA_DIR}/$$file.gz 2>/dev/null; then echo " ✓ Downloaded $$file.gz"; + elif [ -z "$$URL_GZ" ]; then + echo " ✗ No .gz URL found (neither catalog nor API)"; else echo " ✗ Failed to download $$file (both .txt and .txt.gz)";This distinguishes between "URL not found" vs "URL found but download failed."
🔧 Improve robustness of data.gouv.fr download
Context
The
datagouv-get-filestarget downloads files from the data.gouv.fr API. This PR fixes several robustness issues identified during the download process.Issues Fixed
Handling of Missing URLs
Improved
.txtto.txt.gzFallbackBetter Traceability
Technical Changes
Lines 967-1002: Rework of the download logic
.txtfirst, then fallback to.txt.gzon failure..txtand.txt.gzare unavailable).Impact
Tests Performed
.txtfiles..txt.gz.Summary by CodeRabbit
Bug Fixes
Improvements