Skip to content

Conversation

@sebfournier95
Copy link

@sebfournier95 sebfournier95 commented Oct 27, 2025

🔧 Improve robustness of data.gouv.fr download

Context

The datagouv-get-files target downloads files from the data.gouv.fr API. This PR fixes several robustness issues identified during the download process.

Issues Fixed

  1. Handling of Missing URLs

    • Added an API lookup as a fallback when the URL is not in the catalog.
    • Explicit informational messages for easier debugging.
  2. Improved .txt to .txt.gz Fallback

    • Clearer fallback logic between text and compressed formats.
    • Distinctive diagnostic messages (✓/✗).
    • Proper handling of both failure cases.
  3. Better Traceability

    • Explicit display of the URLs being used.
    • Standardized success/failure messages with visual symbols.
    • Counter for downloaded/refreshed files.

Technical Changes

Lines 967-1002: Rework of the download logic

  • Catalog URL check with a fallback to an API lookup.
  • Attempt to download .txt first, then fallback to .txt.gz on failure.
  • Diagnostic messages at each step.
  • Handling of total failure (both .txt and .txt.gz are unavailable).

Impact

  • ✅ Reduced silent failures.
  • ✅ Easier debugging with explicit logs.
  • ✅ Increased robustness against changes in the data.gouv.fr API.
  • ✅ Better user experience.

Tests Performed

  • Download of available .txt files.
  • Automatic fallback to .txt.gz.
  • API lookup when URL is missing from the catalog.
  • Handling of download errors.

Summary by CodeRabbit

  • Bug Fixes

    • Improved file download reliability with enhanced fallback mechanisms for alternative file formats.
    • Enhanced error handling and status logging for download operations.
  • Improvements

    • Optimized download flow with better handling of unavailable primary file sources.

@coderabbitai
Copy link

coderabbitai bot commented Oct 27, 2025

Walkthrough

The datagouv-get-files download flow has been reworked to fetch URLs from local catalog entries or query the API as a fallback, with added fallback logic for .gz variants, conditional recoding for "deces" files, and explicit success/failure logging replacing previous inline SHA-1 validation.

Changes

Cohort / File(s) Summary
Download flow refactoring
Makefile
Replaced direct catalog-based URL and SHA-1 extraction with API-assisted URL resolution; added fallback mechanism to attempt .gz variant on .txt download failure; introduced conditional recoding for "deces" files; changed from inline SHA-1 diff validation to post-download SHA-1 computation with explicit success/failure status logging

Sequence Diagram(s)

sequenceDiagram
    participant Script
    participant LocalCatalog as Local Catalog
    participant API
    participant URL as Remote Server
    
    rect rgb(200, 220, 255)
    Note over Script,LocalCatalog: New Flow: URL Resolution Phase
    Script->>LocalCatalog: Fetch URL from catalog entry
    LocalCatalog-->>Script: URL found or not found
    alt URL not found locally
        Script->>API: Query API for URL discovery
        API-->>Script: Return URL
    end
    end
    
    rect rgb(200, 220, 255)
    Note over Script,URL: Download Phase with Fallback
    Script->>URL: Attempt download .txt
    alt Success
        URL-->>Script: .txt downloaded
        Script->>Script: Compute SHA-1 (txt)
        alt Filename starts with "deces" && after deces-2010
            Script->>Script: Apply recoding
        end
    else Failure
        Script->>API: Query for .gz variant URL
        Script->>URL: Attempt download .txt.gz
        alt Success
            URL-->>Script: .txt.gz downloaded
            Script->>Script: Compute SHA-1 (gz)
        end
    end
    end
    
    rect rgb(200, 220, 255)
    Note over Script,Script: Logging
    Script->>Script: Log explicit success/failure messages
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • The logic refactoring introduces multiple new control paths (API fallback, .gz variant handling, conditional recoding) that require verification
  • Changes to validation and logging mechanisms need careful review for correctness
  • The interaction between URL resolution phases and download fallback logic should be traced to ensure error handling is comprehensive

Poem

🐰 A rabbit hops through files with glee,
URLs now fetch from API!
When .txt stumbles, .gz comes to aid,
With SHA-1 checksums perfectly made,
And deces files get their special care—
Downloads now flow with elegant flair! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Enhance datagouv file retrieval with API fallback for missing URLs (local mode without S3)" directly aligns with the main changes in the pull request. According to the summary, the reworked datagouv-get-files download flow now includes API-assisted URL resolution when a URL is not found in the local catalog, which is precisely what the title conveys. The title is specific about the component being modified (datagouv file retrieval), the type of enhancement (API fallback), and the scenario being addressed (missing URLs). While the parenthetical qualifier "(local mode without S3)" adds some length, the overall title remains clear and would help a developer scanning history understand the primary change without ambiguity.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
Makefile (3)

966-977: Clear URL resolution strategy with good fallback messaging.

The two-tier lookup (catalog then API) is well-structured and provides good observability. The diagnostic messages (lines 973, 976) improve traceability.

However, consider validating that the file variable doesn't require shell-escaping for the jq query. If catalog entries could contain special characters, quote escaping in the jq filter might be needed for robustness (e.g., using @json or proper escaping).

I can generate a shell script to verify that catalog entries are safe for direct jq interpolation, or we could add explicit escaping.


981-985: Use POSIX-compatible shell syntax for portability.

Line 981 uses bash-specific [[ for pattern matching. While the Makefile specifies bash (line 8), consider switching to POSIX [ for better portability:

- if [[ "$$file" == "deces"* ]]; then
+ if [ "$${file#deces}" != "$$file" ]; then

Additionally, if the recode command fails on line 983, execution continues to line 986 (gzip) without explicit error handling. Consider adding error checks if recode failures should block further processing:

  recode utf8..latin1 ${DATA_DIR}/$$file;
+ if [ $$? -ne 0 ]; then echo "  ✗ Recode failed for $$file"; continue; fi

987-1002: Fallback logic is sound, but implicit error handling could be more explicit.

The .gz fallback correctly implements two paths: API lookup if URL was initially missing (line 990), or URL construction if it existed (line 994). The flow handles all scenarios, though error cases are implicit.

When both .txt and .gz downloads fail, the error is silently suppressed by 2>/dev/null and only reported at line 999. This is acceptable but could be clearer. Consider explicitly verifying URL_GZ is non-empty before attempting download:

  if [ ! -z "$$URL_GZ" ] && curl -s --fail $$URL_GZ > ${DATA_DIR}/$$file.gz 2>/dev/null; then
    echo "  ✓ Downloaded $$file.gz";
+ elif [ -z "$$URL_GZ" ]; then
+   echo "  ✗ No .gz URL found (neither catalog nor API)";
  else
    echo "  ✗ Failed to download $$file (both .txt and .txt.gz)";

This distinguishes between "URL not found" vs "URL found but download failed."

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 19bdb29 and e531d01.

📒 Files selected for processing (1)
  • Makefile (1 hunks)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant