Skip to content

Conversation

@fils
Copy link
Owner

@fils fils commented May 21, 2025

This commit introduces several improvements to the codebase:

  • Bug Fix:

    • defs/etl_fetch.py: Corrected the fetch_resources function to use the source parameter instead of a hardcoded URL.
  • Testing:

    • Established a unit testing framework using unittest.
    • Added tests for defs/etl_fetch.py (mocking AsyncWebCrawler).
    • Added tests for defs/etl_convert.py (including HTML and PDF conversion, with programmatic PDF fixture creation using reportlab).
    • bamlTest.py identified as a utility script; its purpose is now documented.
  • Error Handling & Logging:

    • Python scripts in defs/ (etl_convert, etl_fetch, etl_query) now use Python's logging module for errors, warnings, and info messages, replacing print().
    • defs/etl_query.py: Enhanced CSV parsing to specifically catch polars.exceptions.ShapeError when truncate_ragged_lines=False, logging an informative error and re-raising.
    • Shell scripts (scripts/loadDirToTriplestore.sh, scripts/loadSitemapToTriplestore.sh):
      • Added set -e for safer execution.
      • Added checks for jsonld and curl command existence. - Implemented robust error checking for jsonld and curl command failures during processing, with errors reported to stderr. - loadSitemapToTriplestore.sh now includes a summary of processed URLs and failures.
  • Documentation (README.md):

    • Added a new "Running Tests" section with instructions.
    • Documented the dependency on the jsonld command-line tool, including installation instructions (e.g., via npm).
    • Clarified the purpose of bamlTest.py.

This commit introduces several improvements to the codebase:

- **Bug Fix:**
    - `defs/etl_fetch.py`: Corrected the `fetch_resources` function to use the `source` parameter instead of a hardcoded URL.

- **Testing:**
    - Established a unit testing framework using `unittest`.
    - Added tests for `defs/etl_fetch.py` (mocking `AsyncWebCrawler`).
    - Added tests for `defs/etl_convert.py` (including HTML and PDF conversion, with programmatic PDF fixture creation using `reportlab`).
    - `bamlTest.py` identified as a utility script; its purpose is now documented.

- **Error Handling & Logging:**
    - Python scripts in `defs/` (etl_convert, etl_fetch, etl_query) now use Python's `logging` module for errors, warnings, and info messages, replacing `print()`.
    - `defs/etl_query.py`: Enhanced CSV parsing to specifically catch `polars.exceptions.ShapeError` when `truncate_ragged_lines=False`, logging an informative error and re-raising.
    - Shell scripts (`scripts/loadDirToTriplestore.sh`, `scripts/loadSitemapToTriplestore.sh`):
        - Added `set -e` for safer execution.
        - Added checks for `jsonld` and `curl` command existence.
        - Implemented robust error checking for `jsonld` and `curl` command failures during processing, with errors reported to stderr.
        - `loadSitemapToTriplestore.sh` now includes a summary of processed URLs and failures.

- **Documentation (README.md):**
    - Added a new "Running Tests" section with instructions.
    - Documented the dependency on the `jsonld` command-line tool, including installation instructions (e.g., via `npm`).
    - Clarified the purpose of `bamlTest.py`.
@fils fils self-assigned this May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants