Skip to content

Conversation

@amotl
Copy link
Member

@amotl amotl commented Jul 7, 2025

About

ingestr/CrateDB has been unlocked just recently. This patch makes a start to provide runnable examples and additional quality assurance, by using the examples as integration tests.

What's inside

Runnable examples that target CrateDB, starting with Kafka.

Synopsis

ingestr ingest --yes \
  --source-uri "kafka://?bootstrap_servers=localhost:9092&group_id=test_group" \
  --source-table "demo" \
  --dest-uri "cratedb://crate:crate@localhost:5432/?sslmode=disable" \
  --dest-table "doc.kafka_demo"

Tutorial

See Tutorial: Loading data from Apache Kafka into CrateDB.

/cc @kneth

@coderabbitai
Copy link

coderabbitai bot commented Jul 7, 2025

Walkthrough

A new integration test and demonstration suite for the ingestr tool was added under application/ingestr/. This includes environment and configuration files, a Docker Compose stack for Kafka and CrateDB, a Xonsh-based end-to-end demo script, a test runner, and supporting documentation. GitHub Actions and Dependabot automation were also introduced for this directory.

Changes

File(s) Change Summary
.github/dependabot.yml Added Dependabot configs for Python and Docker Compose in application/ingestr/ (daily checks).
.github/workflows/application-ingestr.yml New GitHub Actions workflow to test application/ingestr on PRs, pushes, schedule, and manually.
application/ingestr/.env New environment file: defines Kafka, CrateDB, and KCAT versions, ports, and connection settings.
application/ingestr/.gitignore New .gitignore: ignores data, artifacts, and virtual envs, but keeps .env.
application/ingestr/README.md New README: describes the ingestr demo, integration tests, and usage instructions.
application/ingestr/kafka-compose.yml New Docker Compose: orchestrates Kafka, CrateDB, and related task services for the demo.
application/ingestr/requirements.txt New requirements file: adds ingestr from a GitHub branch.
application/ingestr/test.sh New shell script: sets up environment and runs Kafka demo integration test.
application/ingestr/.dlt/config.toml New config: disables telemetry for DLT runtime.
application/ingestr/kafka-demo.xsh New Xonsh script: full E2E demo—downloads data, loads to Kafka, ingests into CrateDB, verifies.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Test Script (test.sh)
    participant Kafka Demo (kafka-demo.xsh)
    participant Docker Compose
    participant Kafka
    participant CrateDB
    participant Ingestr

    User->>Test Script (test.sh): Run integration test
    Test Script (test.sh)->>Kafka Demo (kafka-demo.xsh): Execute demo
    Kafka Demo (kafka-demo.xsh)->>Docker Compose: Start Kafka & CrateDB
    Kafka Demo (kafka-demo.xsh)->>Kafka: Create topic, publish data
    Kafka Demo (kafka-demo.xsh)->>Ingestr: Ingest data from Kafka to CrateDB
    Ingestr->>CrateDB: Write records
    Kafka Demo (kafka-demo.xsh)->>CrateDB: Query for verification
    Kafka Demo (kafka-demo.xsh)->>Docker Compose: Tear down stack (optional)
Loading

Poem

🐇
Ingestr hops with Docker flair,
Kafka, CrateDB—what a pair!
From NDJSON to table rows,
The demo script just boldly goes.
Tests now run with nightly cheer,
Integration magic, crystal clear!
—A rabbit who loves pipelines


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4326485 and bb3f987.

📒 Files selected for processing (3)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/kafka-demo.xsh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • application/ingestr/kafka-compose.yml
🧰 Additional context used
🪛 dotenv-linter (3.3.0)
application/ingestr/.env

[warning] 7-7: [SubstitutionKey] The CRATEDB_HOST key is not assigned properly


[warning] 8-8: [SubstitutionKey] The CRATEDB_HTTP_PORT key is not assigned properly


[warning] 9-9: [SubstitutionKey] The CRATEDB_POSTGRESQL_PORT key is not assigned properly


[warning] 10-10: [SubstitutionKey] The CRATEDB_USERNAME key is not assigned properly


[warning] 11-11: [SubstitutionKey] The CRATEDB_PASSWORD key is not assigned properly


[warning] 11-11: [UnorderedKey] The CRATEDB_PASSWORD key should go before the CRATEDB_POSTGRESQL_PORT key


[warning] 12-12: [SubstitutionKey] The CRATEDB_HTTP_SCHEME key is not assigned properly


[warning] 12-12: [UnorderedKey] The CRATEDB_HTTP_SCHEME key should go before the CRATEDB_PASSWORD key

🔇 Additional comments (6)
application/ingestr/.env (2)

7-12: Static analysis warnings are false positives - configuration is correct.

The dotenv-linter warnings about "SubstitutionKey" and "UnorderedKey" are false positives. The shell parameter expansion syntax ${VAR:-default} is correct and commonly used for providing default values in environment files. The logical grouping of related variables is more important than alphabetical ordering.


21-24: Environment variable construction is well-designed.

The derived URLs serve different purposes effectively:

  • CRATEDB_HTTP_URL_DOCKER for authenticated access from containers
  • CRATEDB_HTTP_URL_NOAUTH for local access without authentication

This separation provides flexibility for different usage scenarios.

application/ingestr/kafka-demo.xsh (4)

184-207: Excellent implementation of HTTP client using Python stdlib.

The query_cratedb function is well-implemented using Python's built-in libraries:

  • Proper HTTP request construction with JSON payload
  • Correct Basic authentication header encoding
  • Appropriate error handling through urllib's exception mechanism
  • Clean separation of concerns

This is a significant improvement over the previous container-based approach mentioned in past reviews.


34-41: Comprehensive error handling successfully implemented.

The script now includes proper error handling for all critical operations:

  • Docker Compose service startup
  • Kafka topic creation
  • Archive extraction
  • Data publishing to Kafka
  • Ingestr ingestion command

Each operation checks return codes and exits with appropriate error messages, addressing all previous feedback on error handling.

Also applies to: 59-64, 95-98, 111-115, 128-139


154-171: Robust data verification with clear success/failure reporting.

The verification logic is well-implemented:

  • Compares actual vs expected record counts
  • Provides clear visual feedback with color-coded messages
  • Exits with appropriate error codes for CI integration
  • Includes helpful context about expected vs actual values

26-71: Well-structured Infrastructure class with proper service lifecycle management.

The Infrastructure class provides a clean abstraction for:

  • Service startup/shutdown with Docker Compose
  • Kafka topic lifecycle management
  • Proper method chaining for fluent interface
  • Comprehensive error handling throughout
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ingestr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai auto-generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
application/ingestr/util.sh (1)

38-46: title() may use colors before they are initialised

title prints ${BYELLOW} but init_colors might not have been called yet by a script that merely sources this file and directly calls title. Either:

  1. Call init_colors lazily inside title when BYELLOW is empty, or
  2. Document that callers must invoke init_colors first.
application/ingestr/test.sh (1)

5-9: Suppress command lookup noise & hard failures

command -v uv prints the path on success and errors on failure. Redirect stdout/stderr and explicitly fail when installation fails.

-if ! command -v uv; then
-  pip install uv
+if ! command -v uv >/dev/null 2>&1; then
+  pip install uv || {
+    echo "Failed to install uv" >&2
+    exit 1
+  }
 fi
application/ingestr/kafka-cmd.sh (1)

124-129: Quote variables in arithmetic comparison (SC2053)

Unquoted variables may trigger glob expansion or word-splitting.

-if [[ ${size_actual} = ${size_reference} ]]; then
+if [[ "${size_actual}" = "${size_reference}" ]]; then
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceec7a9 and 732a4be.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • application/ingestr/.dlt/config.toml
  • application/ingestr/README.md
🚧 Files skipped from review as they are similar to previous changes (6)
  • application/ingestr/requirements.txt
  • application/ingestr/.gitignore
  • .github/dependabot.yml
  • .github/workflows/application-ingestr.yml
  • application/ingestr/kafka-compose.yml
  • application/ingestr/.env
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
  • setup (36-42)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh

[warning] 124-124: Quote the right-hand side of = in [[ ]] to prevent glob matching.

(SC2053)

application/ingestr/test.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

application/ingestr/util.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 23-23: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 34-34: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 53-53: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 54-54: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 55-55: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 56-56: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 57-57: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 66-66: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
application/ingestr/util.sh (2)

1-4: Add she-bang & strict mode header
Still missing after the previous review – scripts that source this file get no safety nets and ShellCheck keeps complaining (SC2148).

+#/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'

5-14: getopt still runs at sourcing time – breaks callers
Executing getopt before read_options() means every script that merely sources util.sh has its own $@ parsed and may exit on bad options. Same concern was raised earlier but not addressed.

-options=$(getopt --options k --longoptions keepalive -- "$@")
-function read_options() {
+function read_options() {
+  local options
+  options=$(getopt --options k --longoptions keepalive -- "$@")
   # Call getopt to validate the provided input.
application/ingestr/test.sh (1)

1-4: Script lacks she-bang & strict mode
Identical comment existed in the last round; please fix to avoid SC2148 and silent failures.

+#/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'
application/ingestr/kafka-cmd.sh (2)

16-18: Sourcing files via relative paths is still brittle
Running the script from outside its directory will fail to find .env and util.sh. Previous refactor suggestion stands.

-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-source .env
-source util.sh
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+source "${SCRIPT_DIR}/.env"
+source "${SCRIPT_DIR}/util.sh"

191-192: CLI parsing disabled – SUBCOMMAND is never set
read_options is still commented out, yet start_subcommand relies on SUBCOMMAND. Either enable parsing or remove dead code.

-  #read_options
+  read_options "$@"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 732a4be and c112cd5.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • application/ingestr/.dlt/config.toml
  • application/ingestr/requirements.txt
  • application/ingestr/.gitignore
  • .github/dependabot.yml
  • application/ingestr/README.md
  • application/ingestr/.env
  • .github/workflows/application-ingestr.yml
  • application/ingestr/kafka-compose.yml
🧰 Additional context used
🧬 Code Graph Analysis (2)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
  • title (38-46)
  • init_colors (48-70)
application/ingestr/test.sh (1)
  • setup (5-9)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
  • setup (36-41)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh

[warning] 144-144: Quote the right-hand side of = in [[ ]] to prevent glob matching.

(SC2053)

application/ingestr/test.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

application/ingestr/util.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 23-23: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 34-34: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 53-53: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 54-54: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 55-55: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 56-56: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 57-57: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 66-66: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: OS ubuntu-latest

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
application/ingestr/test.sh (1)

1-2: Correct the she-bang

Same issue as in util.sh: add ! so the script can be run directly.

-#/usr/bin/env bash
+#!/usr/bin/env bash
application/ingestr/kafka-cmd.sh (1)

193-195: CLI parsing still disabled – previous remark stands

read_options remains commented out, so SUBCOMMAND and KEEPALIVE are never populated.
Uncomment and forward "$@", or delete the dead code.

-  #read_options
+  read_options "$@"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c112cd5 and 28f635a.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • application/ingestr/.dlt/config.toml
  • application/ingestr/.gitignore
  • application/ingestr/README.md
  • .github/dependabot.yml
  • application/ingestr/requirements.txt
  • application/ingestr/.env
  • application/ingestr/kafka-compose.yml
  • .github/workflows/application-ingestr.yml
🧰 Additional context used
🧬 Code Graph Analysis (2)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
  • title (43-51)
  • init_colors (53-75)
application/ingestr/test.sh (1)
  • setup (10-14)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
  • setup (38-43)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

application/ingestr/test.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

application/ingestr/util.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)


[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

@amotl amotl force-pushed the ingestr branch 2 times, most recently from dd46448 to bdb738c Compare July 7, 2025 17:49
@amotl amotl requested review from hlcianfagna and surister July 7, 2025 17:51
@amotl amotl changed the title Ingestr: Add runnable examples Ingestr: Add runnable example loading from Kafka Jul 7, 2025
@amotl amotl marked this pull request as ready for review July 7, 2025 17:52
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
application/ingestr/util.sh (1)

1-2: Fix the malformed she-bang (again)
The first line still misses the exclamation mark and therefore isn’t recognised as an executable script.

-#/usr/bin/env bash
+#!/usr/bin/env bash
application/ingestr/test.sh (1)

1-6: Add proper she-bang + strict mode

-#/usr/bin/env bash
+#!/usr/bin/env bash
+set -euo pipefail

pipefail is important here because later commands rely on pipes (|) to detect failures.

application/ingestr/kafka-cmd.sh (3)

1-2: She-bang still malformed

Same issue as in the other scripts—add the missing !.

-#/usr/bin/env bash
+#!/usr/bin/env bash

185-196: CLI parsing still disabled – dead code path

read_options remains commented out, meaning SUBCOMMAND/KEEPALIVE can never be set from the CLI. Either re-enable the call or delete the unused logic.

-  #read_options
+  read_options "$@"

22-23: Make COMPOSE_FILE path-safe

Running the script outside its directory fails because the compose file is looked-up relative to $PWD.

-COMPOSE_FILE=kafka-compose.yml
+COMPOSE_FILE="${SCRIPT_DIR}/kafka-compose.yml"
🧹 Nitpick comments (2)
application/ingestr/util.sh (1)

44-51: Quote and localise variables inside title()
Unquoted expansions will break on spaces/new-lines and the function leaks globals.

-function title() {
-  text=$1
-  len=${#text}
+function title() {
+  local text=$1
+  local len=${#text}
-  echo ${guard}
-  echo -e "${BYELLOW}${text}${NC}"
-  echo ${guard}
+  echo "${guard}"
+  echo -e "${BYELLOW}${text}${NC}"
+  echo "${guard}"
application/ingestr/test.sh (1)

10-14: Ensure uvx is available after installing uv
pip install uv puts the binary under the Python installation’s bin directory, which might not be on $PATH in CI runners. Consider:

python -m pip install --user uv
export PATH="$HOME/.local/bin:$PATH"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd46448 and bdb738c.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • application/ingestr/.dlt/config.toml
  • application/ingestr/requirements.txt
  • .github/dependabot.yml
  • application/ingestr/.gitignore
  • .github/workflows/application-ingestr.yml
  • application/ingestr/README.md
  • application/ingestr/.env
  • application/ingestr/kafka-compose.yml
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
  • setup (38-43)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

application/ingestr/test.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

application/ingestr/util.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)


[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
application/ingestr/kafka-cmd.sh (1)

192-196: CLI parsing remains disabled – SUBCOMMAND & KEEPALIVE never set
read_options is still commented out, so the flags & subcommand logic are dead code and KEEPALIVE will always be empty. Either wire the parser back in or drop the unused branches.

-  #read_options
+  read_options "$@"
🧹 Nitpick comments (3)
application/ingestr/util.sh (1)

43-51: Quote expansions inside title()
echo ${guard} and echo -e "${BYELLOW}${text}${NC}" rely on word-splitting rules; titles containing spaces or glob characters will break/expand unexpectedly.

-  echo
-  echo ${guard}
-  echo -e "${BYELLOW}${text}${NC}"
-  echo ${guard}
+  echo
+  echo "${guard}"
+  echo -e "${BYELLOW}${text}${NC}"
+  echo "${guard}"
application/ingestr/kafka-cmd.sh (2)

28-36: Quote path variables passed to Docker Compose
${COMPOSE_FILE} may contain spaces (e.g. when the repo is checked out into “/tmp/ingestr demo/”). Quote it to avoid argument splitting.

-  docker compose --file ${COMPOSE_FILE} up --detach
+  docker compose --file "${COMPOSE_FILE}" up --detach

Apply the same quoting pattern at lines 35-36, 57-65, 105-157, etc.


88-100: Avoid useless cat in data subset creation
cat file | head … spawns an unnecessary process; head can read the file directly.

-    cat nyc-yellow-taxi-2017.json | head -n 5000 > nyc-yellow-taxi-2017-subset.ndjson
+    head -n 5000 nyc-yellow-taxi-2017.json > nyc-yellow-taxi-2017-subset.ndjson
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bdb738c and bf6736b.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • application/ingestr/requirements.txt
  • application/ingestr/.dlt/config.toml
  • application/ingestr/test.sh
  • .github/dependabot.yml
  • application/ingestr/.gitignore
  • application/ingestr/README.md
  • .github/workflows/application-ingestr.yml
  • application/ingestr/kafka-compose.yml
  • application/ingestr/.env
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
  • title (43-51)
  • init_colors (53-75)
application/ingestr/test.sh (1)
  • setup (10-14)
🪛 Shellcheck (0.10.0)
application/ingestr/util.sh

[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: OS ubuntu-latest


# Run job each night after CrateDB nightly has been published.
schedule:
- cron: '0 3 * * *'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to avoid bit rot - thanks

@@ -0,0 +1 @@
ingestr @ git+https://github.com/crate-workbench/ingestr.git@next
Copy link
Member Author

@amotl amotl Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backlog: This needs to be replaced by a real package as we go. Because ingestr might currently not be the topmost priority for their authors, we may need to friendly-fork and run separate releases to gain speed and bundling convienience until individual patches will make it into mainline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many patches are queuing up already, some of them crucial for operating ingestr with CrateDB, some convenient, and some currently impossible to merge.

Bundling all of them into the next branch makes sense, and from there, packaging the bundle and publishing it to PyPI is just a minor additional step, because automation on GHA will do all the work for us. Adding a change log will optimally inform what's inside, and actual software releases will give users the ability to deterministically roll back and forth as we go, carrying CrateDB support for ingestr forward as a special animal, until it isn't.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
application/ingestr/util.sh (1)

1-6: Add pipefail & IFS to complete strict-mode
The script still omits set -o pipefail (and a sane IFS) even after earlier feedback. Silent failures inside pipelines can slip through.

 #!/usr/bin/env bash
-set -eu
+set -euo pipefail
+IFS=$'\n\t'
application/ingestr/kafka-cmd.sh (2)

193-196: CLI parsing still disabled — dead code persists

read_options remains commented, so SUBCOMMAND is never set and related logic is inert. Either uncomment or remove the sub-command machinery.


16-17: Enable pipefail for reliable failure propagation

Without pipefail, any failure on the left side of a pipeline is ignored, which is dangerous for commands like cat | docker … and HTTP queries.

-set -eu
+set -euo pipefail
🧹 Nitpick comments (3)
application/ingestr/util.sh (1)

43-51: Scope internal vars in title() to avoid side-effects

text, len, and guard are written as globals; downstream scripts may accidentally override them.

-function title() {
-  text=$1
-  len=${#text}
-  guard=$(printf "%${len}s" | sed 's/ /=/g')
+function title() {
+  local text=$1
+  local len=${#text}
+  local guard
+  guard=$(printf '%*s' "${len}" '' | sed 's/ /=/g')
application/ingestr/kafka-cmd.sh (1)

28-36: Quote ${COMPOSE_FILE} to handle spaces & special chars

Unquoted path variables can break if the script lives in a directory with spaces.

-  docker compose --file ${COMPOSE_FILE} up --detach
+  docker compose --file "${COMPOSE_FILE}" up --detach

Apply the same quoting to every later docker compose --file invocation.

application/ingestr/README.md (1)

1-6: Minor Markdown spacing / heading style

A blank line after the main heading improves readability and prevents some renderers from merging the following heading.

-# Use CrateDB with ingestr
-## About
+# Use CrateDB with ingestr
+
+## About
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bf6736b and 9a3e842.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • application/ingestr/.dlt/config.toml
  • application/ingestr/requirements.txt
  • .github/dependabot.yml
  • application/ingestr/test.sh
  • application/ingestr/.gitignore
  • .github/workflows/application-ingestr.yml
  • application/ingestr/.env
  • application/ingestr/kafka-compose.yml
🧰 Additional context used
🧠 Learnings (1)
application/ingestr/README.md (1)
Learnt from: amotl
PR: crate/cratedb-examples#937
File: topic/machine-learning/llm-langchain/requirements-dev.txt:2-2
Timestamp: 2025-05-12T20:10:38.614Z
Learning: The cratedb-toolkit package supports various extras including "io", "datasets", "influxdb", "mongodb", "testing", and many others.
🧬 Code Graph Analysis (1)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
  • title (43-51)
  • init_colors (53-75)
application/ingestr/test.sh (1)
  • setup (10-14)
🪛 LanguageTool
application/ingestr/README.md

[grammar] ~3-~3: Use proper spacing conventions.
Context: # Use CrateDB with ingestr ## About [ingestr] is a command-line application t...

(QB_NEW_EN_OTHER_ERROR_IDS_000007)


[grammar] ~5-~5: There might be a mistake here.
Context: ...CrateDB with ingestr ## About [ingestr] is a command-line application that allo...

(QB_NEW_EN_OTHER)


[grammar] ~5-~5: There might be a mistake here.
Context: ...ine application that allows copying data from any source into any destination dat...

(QB_NEW_EN_OTHER)


[grammar] ~6-~6: Use proper spacing conventions.
Context: ...ny source into any destination database. This folder includes runnable examples t...

(QB_NEW_EN_OTHER_ERROR_IDS_000007)


[grammar] ~8-~8: Use proper spacing conventions.
Context: ... examples that use ingestr with CrateDB. They are also used as integration tests ...

(QB_NEW_EN_OTHER_ERROR_IDS_000007)


[grammar] ~9-~9: There might be a mistake here.
Context: ...tion tests to ensure software components fit together well. ## Usage To start c...

(QB_NEW_EN_OTHER)


[grammar] ~10-~10: Use proper spacing conventions.
Context: ...e software components fit together well. ## Usage To start cycling without tearing ...

(QB_NEW_EN_OTHER_ERROR_IDS_000007)


[grammar] ~12-~12: Use proper spacing conventions.
Context: ... components fit together well. ## Usage To start cycling without tearing down th...

(QB_NEW_EN_OTHER_ERROR_IDS_000007)


[grammar] ~14-~14: There might be a mistake here.
Context: ...earing down the backend stack each time, use the KEEPALIVE environment variable...

(QB_NEW_EN_OTHER)


[grammar] ~15-~15: Use proper spacing conventions.
Context: ...se the KEEPALIVE environment variable. shell export KEEPALIVE=true sh test.sh [ingestr]: https://bruin-data.github.io/ingestr/

(QB_NEW_EN_OTHER_ERROR_IDS_000007)

🪛 Shellcheck (0.10.0)
application/ingestr/util.sh

[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: OS ubuntu-latest

Comment on lines 1 to 22
# Use CrateDB with ingestr

## About

[ingestr] is a command-line application that allows copying data
from any source into any destination database.

This folder includes runnable examples that use ingestr with CrateDB.
They are also used as integration tests to ensure software components
fit together well.

## Usage

To start cycling without tearing down the backend stack each time,
use the `KEEPALIVE` environment variable.
```shell
export KEEPALIVE=true
sh test.sh
```


[ingestr]: https://bruin-data.github.io/ingestr/
Copy link
Member Author

@amotl amotl Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backlog:

  • Add "What's inside" section, to educate readers about the bunch of files presented here, and how to use them.
  • Add "Synopsis" section including a typical ingestr ingest command to quickly present what it is actually all about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added with 9f50e4e.

Comment on lines 138 to 152
function verify-data() {
title "Verifying data in CrateDB"
size_reference=5000
size_actual=$(
docker compose --file ${COMPOSE_FILE} run --rm httpie \
http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \
| jq .rows[0][0]
)
if [[ "${size_actual}" = "${size_reference}" ]]; then
echo -e "${BGREEN}SUCCESS: Database table contains expected number of ${size_reference} records.${NC}"
else
echo -e "${BRED}ERROR: Expected database table to contain ${size_reference} records, but it contains ${size_actual} records.${NC}"
exit 2
fi
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backlog: Data validation is currently a bit thin, just counting the number of records and failing if it's not 5_000. Please expand the procedure, to also consider if data is in the right format within the target table. Also, why in hell is this written in Bash?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The joy of shell scripting?

Copy link
Member Author

@amotl amotl Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With e580c61, we've removed Bashisms here and added more flavours of Python instead, using Xonsh now, retaining all the joys of shell scripting 1 while adding more sanity after just a little learning curve.

Footnotes

  1. Which in turn retains the capability to copy/paste each individual fragment of the whole shebang into successive shell invocations.

Comment on lines 8 to 55
# ---------------
# Confluent Kafka
# ---------------
# https://docs.confluent.io/platform/current/installation/docker/config-reference.html
# https://gist.github.com/everpeace/7a317860cab6c7fb39d5b0c13ec2543e
# https://github.com/framiere/a-kafka-story/blob/master/step14/docker-compose.yml
kafka-zookeeper:
image: confluentinc/cp-zookeeper:${CONFLUENT_VERSION}
environment:
ZOOKEEPER_CLIENT_PORT: ${PORT_KAFKA_ZOOKEEPER}
KAFKA_OPTS: -Dzookeeper.4lw.commands.whitelist=ruok
networks:
- ingestr-demo

# Define health check for Zookeeper.
healthcheck:
# https://github.com/confluentinc/cp-docker-images/issues/827
test: ["CMD", "bash", "-c", "echo ruok | nc localhost ${PORT_KAFKA_ZOOKEEPER} | grep imok"]
start_period: 3s
interval: 2s
timeout: 30s
retries: 60

kafka-broker:
image: confluentinc/cp-kafka:${CONFLUENT_VERSION}
ports:
- "${PORT_KAFKA_BROKER_INTERNAL}:${PORT_KAFKA_BROKER_INTERNAL}"
- "${PORT_KAFKA_BROKER_EXTERNAL}:${PORT_KAFKA_BROKER_EXTERNAL}"
environment:
KAFKA_ZOOKEEPER_CONNECT: kafka-zookeeper:${PORT_KAFKA_ZOOKEEPER}
KAFKA_LISTENERS: INTERNAL://0.0.0.0:${PORT_KAFKA_BROKER_INTERNAL},EXTERNAL://0.0.0.0:${PORT_KAFKA_BROKER_EXTERNAL}
KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka-broker:${PORT_KAFKA_BROKER_INTERNAL},EXTERNAL://localhost:${PORT_KAFKA_BROKER_EXTERNAL}
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: INTERNAL
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
depends_on:
- kafka-zookeeper
networks:
- ingestr-demo

# Define health check for Kafka broker.
healthcheck:
#test: ps augwwx | egrep "kafka.Kafka"
test: ["CMD", "nc", "-vz", "localhost", "${PORT_KAFKA_BROKER_INTERNAL}"]
start_period: 3s
interval: 0.5s
timeout: 30s
retries: 60
Copy link
Member Author

@amotl amotl Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backlog: Please trim this down. A bare-bones setup of Kafka doesn't need any Zookeeper at all. See also Tutorial: Loading data from Apache Kafka into CrateDB for a very minimal example, as suggested by @hlcianfagna -- thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like vanilla Kafka 4.0.0 per docker.io/apache/kafka has a deviating configuration syntax for the variables defined above. When using them, it fails on startup:

Exception in thread "main" org.apache.kafka.common.config.ConfigException:
Missing required configuration "process.roles" which has no default value.

When using earlier versions like Kafka 3.9.1, it needs a Zookeeper again.

Exception in thread "main" org.apache.kafka.common.config.ConfigException:
Missing required configuration `zookeeper.connect` which has no default value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recipe is now using Apache Kafka 4.0.0, which in turn uses KRaft, so it no longer needs Zookeeper.

Apache Kafka Raft (KRaft) is the consensus protocol that was introduced in KIP-500 to remove Apache Kafka’s dependency on ZooKeeper for metadata management.

This greatly simplifies Kafka’s architecture by consolidating responsibility for metadata into Kafka itself, rather than splitting it between two different systems: ZooKeeper and Kafka. KRaft mode makes use of a new quorum controller service in Kafka which replaces the previous controller and makes use of an event-based variant of the Raft consensus protocol.

-- KRaft: Apache Kafka Without ZooKeeper

Comment on lines 99 to 156
# -----
# Tasks
# -----

# Create database table in CrateDB.
create-table:
image: westonsteimel/httpie
networks: [ingestr-demo]
command: http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='CREATE TABLE "kafka_demo" ("payload" OBJECT(DYNAMIC))' --ignore-stdin
deploy:
replicas: 0

# Create Kafka topic.
create-topic:
image: confluentinc/cp-kafka:${CONFLUENT_VERSION}
networks: [ingestr-demo]
command: kafka-topics --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --create --if-not-exists --replication-factor 1 --partitions 1 --topic demo
deploy:
replicas: 0

# Delete Kafka topic.
delete-topic:
image: confluentinc/cp-kafka:${CONFLUENT_VERSION}
networks: [ingestr-demo]
command: kafka-topics --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --delete --if-exists --topic demo
deploy:
replicas: 0

# Drop database table in CrateDB.
drop-table:
image: westonsteimel/httpie
networks: [ingestr-demo]
command: http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='DROP TABLE IF EXISTS "kafka_demo"' --ignore-stdin
deploy:
replicas: 0

# Invoke HTTPie via Docker.
httpie:
image: westonsteimel/httpie
networks: [ingestr-demo]
deploy:
replicas: 0

# Publish data to Kafka topic.
publish-data:
image: confluentinc/cp-kafka:${CONFLUENT_VERSION}
networks: [ingestr-demo]
command: kafka-console-producer --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --topic demo
deploy:
replicas: 0

# Subscribe to Kafka topic.
subscribe-topic:
image: edenhill/kcat:${KCAT_VERSION}
networks: [ingestr-demo]
command: kcat -b kafka-broker -C -t demo # -o end
deploy:
replicas: 0
Copy link
Member Author

@amotl amotl Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of this and just perform all tasks from userspace, possibly entering the Kafka container to invoke utility programs like kafka-topics.sh or kafka-console-producer.sh like @hlcianfagna was doing it?

podman exec -it kafka "bash"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We trimmed down the recipe by other ways a bit, and also reduced arcaneness by using Xonsh instead of Bash. In this spirit, we will keep this section like it is, to keep the promise that userspace needs nothing but Docker and Python.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
application/ingestr/kafka-demo.xsh (2)

91-91: Improve robustness of data subset creation.

The current approach using head -n 5000 assumes the file has at least 5000 lines. Consider adding error handling or using a more robust approach.

-        cat nyc-yellow-taxi-2017.json | head -n 5000 > nyc-yellow-taxi-2017-subset.ndjson
+        # Create subset with error handling
+        if [ $(wc -l < nyc-yellow-taxi-2017.json) -ge 5000 ]; then
+            cat nyc-yellow-taxi-2017.json | head -n 5000 > nyc-yellow-taxi-2017-subset.ndjson
+        else
+            echo "Warning: Dataset has fewer than 5000 records, using all available data"
+            cp nyc-yellow-taxi-2017.json nyc-yellow-taxi-2017-subset.ndjson
+        fi

140-156: Consider using proper exception handling instead of exit().

Using exit 2 directly in the script is not the cleanest approach. Consider raising an exception or returning an error status that can be handled by the caller.

-            # FIXME: Exiting from code is not the nicest thing. But well, this is a script anyway.
-            exit 2
+            raise RuntimeError(f"Expected database table to contain at least {$size_reference} records, but it contains {$size_actual} records.")

Then handle this in the main function:

 def main():
+    try:
         infra = Infrastructure()
         data = Datawrapper()
         
         # Standard path.
         infra.start().setup()
         data.acquire().publish().load().display().verify()
         
         if not __xonsh__.env.get("KEEPALIVE"):
             infra.stop()
+    except RuntimeError as e:
+        print_color("{BOLD_RED}")
+        echo f"ERROR: {e}"
+        print_color("{RESET}")
+        return 2
+    return 0
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a3e842 and 8b6047a.

📒 Files selected for processing (6)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/kafka-demo.xsh (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • application/ingestr/README.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • application/ingestr/requirements.txt
  • application/ingestr/test.sh
  • application/ingestr/kafka-compose.yml
🧰 Additional context used
🪛 dotenv-linter (3.3.0)
application/ingestr/.env

[warning] 5-5: [UnorderedKey] The CRATEDB_VERSION key should go before the KAFKA_VERSION key


[warning] 16-16: [SubstitutionKey] The CRATEDB_HOST key is not assigned properly


[warning] 17-17: [SubstitutionKey] The CRATEDB_HTTP_PORT key is not assigned properly


[warning] 18-18: [SubstitutionKey] The CRATEDB_POSTGRESQL_PORT key is not assigned properly


[warning] 19-19: [SubstitutionKey] The CRATEDB_USERNAME key is not assigned properly


[warning] 20-20: [SubstitutionKey] The CRATEDB_PASSWORD key is not assigned properly


[warning] 20-20: [UnorderedKey] The CRATEDB_PASSWORD key should go before the CRATEDB_POSTGRESQL_PORT key


[warning] 21-21: [SubstitutionKey] The CRATEDB_HTTP_SCHEME key is not assigned properly


[warning] 21-21: [UnorderedKey] The CRATEDB_HTTP_SCHEME key should go before the CRATEDB_PASSWORD key

🔇 Additional comments (7)
application/ingestr/.env (3)

16-21: LGTM: Parameter expansion syntax is correct.

The bash parameter expansion syntax ${VAR:-default} is correctly used for providing default values. The static analysis tool warnings about "SubstitutionKey" are false positives - this is the standard and correct way to set defaults in shell environment files.


30-30: LGTM: HTTP URL construction is correct.

The URL construction properly combines all the required components (scheme, credentials, host, port) into a valid HTTP URL format for CrateDB connection.


4-4: No action needed: Kafka version 4.0.0 is confirmed stable

Kafka 4.0.0 was released March 18, 2025 and is the latest stable release. The specified version in application/ingestr/.env (export KAFKA_VERSION=4.0.0) is correct and compatible—no changes required.

application/ingestr/kafka-demo.xsh (4)

11-12: LGTM: Good configuration for demo environment.

Setting XONSH_SHOW_TRACEBACK = True for debugging and INGESTR_DISABLE_TELEMETRY = True for privacy are appropriate choices for a demo script.


113-118: LGTM: Well-structured ingestr invocation.

The ingestr command is properly structured with clear source and destination URIs. Using uvx with specific Python version and requirements ensures reproducible execution.


161-171: LGTM: Clean utility function implementation.

The title function provides nice visual separation with proper color formatting and is well-implemented.


184-184: LGTM: Appropriate environment variable check.

Using __xonsh__.env.get("KEEPALIVE") is the correct way to check for environment variables in Xonsh with a default of None/False.

Comment on lines 141 to 145
$size_actual=$(
docker compose --file $COMPOSE_FILE run --rm httpie \
http "$CRATEDB_HTTP_URL/_sql?pretty" 'stmt=SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \
| jq .rows[0][0]
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command still needs jq to be installed on your workstation. Maybe it's better to use crash instead?
Probably, it's better to use Python stdlib's built-in HTTP client and the JSON module natively, to further reduce noise here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9e82a83 makes it much better, no longer relying on the container invocation, but doing it with pure Python instead.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
application/ingestr/kafka-demo.xsh (1)

85-85: Add checksum verification for downloaded dataset.

The script downloads data from an external source without verifying its integrity. This could be a security risk and may cause issues if the file is corrupted.

Consider adding checksum verification:

-        wget --no-clobber --continue "https://gist.githubusercontent.com/kovrus/328ba1b041dfbd89e55967291ba6e074/raw/7818724cb64a5d283db7f815737c9e198a22bee4/nyc-yellow-taxi-2017.tar.gz"
+        wget --no-clobber --continue "https://gist.githubusercontent.com/kovrus/328ba1b041dfbd89e55967291ba6e074/raw/7818724cb64a5d283db7f815737c9e198a22bee4/nyc-yellow-taxi-2017.tar.gz"
+        # TODO: Add checksum verification
+        # echo "expected_checksum  nyc-yellow-taxi-2017.tar.gz" | sha256sum -c
🧹 Nitpick comments (2)
application/ingestr/.env (1)

30-31: Add missing newline at end of file.

 export CRATEDB_HTTP_URL="${CRATEDB_HTTP_SCHEME}://${CRATEDB_USERNAME}:${CRATEDB_PASSWORD}@${CRATEDB_HOST}:${CRATEDB_HTTP_PORT}"
+
application/ingestr/kafka-demo.xsh (1)

113-118: Consider using environment variables for ingestr configuration.

The ingestr command has hardcoded connection parameters. Consider using environment variables for better flexibility and consistency with the rest of the configuration.

         uvx --python=3.12 --prerelease=allow --with-requirements=requirements.txt \
             ingestr ingest --yes \
-            --source-uri "kafka://?bootstrap_servers=localhost:9092&group_id=test_group&value_type=json&select=value" \
+            --source-uri "kafka://?bootstrap_servers=localhost:$PORT_KAFKA_BROKER_EXTERNAL&group_id=test_group&value_type=json&select=value" \
             --source-table "demo" \
-            --dest-uri "cratedb://crate:crate@localhost:5432/?sslmode=disable" \
+            --dest-uri "cratedb://$CRATEDB_USERNAME:$CRATEDB_PASSWORD@$CRATEDB_HOST:$CRATEDB_POSTGRESQL_PORT/?sslmode=disable" \
             --dest-table "doc.kafka_demo"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b6047a and e580c61.

📒 Files selected for processing (6)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/kafka-demo.xsh (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • application/ingestr/requirements.txt
  • application/ingestr/README.md
  • application/ingestr/test.sh
  • application/ingestr/kafka-compose.yml
🧰 Additional context used
🪛 dotenv-linter (3.3.0)
application/ingestr/.env

[warning] 5-5: [UnorderedKey] The CRATEDB_VERSION key should go before the KAFKA_VERSION key


[warning] 16-16: [SubstitutionKey] The CRATEDB_HOST key is not assigned properly


[warning] 17-17: [SubstitutionKey] The CRATEDB_HTTP_PORT key is not assigned properly


[warning] 18-18: [SubstitutionKey] The CRATEDB_POSTGRESQL_PORT key is not assigned properly


[warning] 19-19: [SubstitutionKey] The CRATEDB_USERNAME key is not assigned properly


[warning] 20-20: [SubstitutionKey] The CRATEDB_PASSWORD key is not assigned properly


[warning] 20-20: [UnorderedKey] The CRATEDB_PASSWORD key should go before the CRATEDB_POSTGRESQL_PORT key


[warning] 21-21: [SubstitutionKey] The CRATEDB_HTTP_SCHEME key is not assigned properly


[warning] 21-21: [UnorderedKey] The CRATEDB_HTTP_SCHEME key should go before the CRATEDB_PASSWORD key

🔇 Additional comments (5)
application/ingestr/.env (2)

16-21: Static analysis warnings are false positives.

The dotenv-linter warnings about SubstitutionKey are false positives. The shell parameter expansion syntax ${VAR:-default} is correct and widely used for providing default values in environment files.


4-4: Confirm Kafka 4.0.0 stability and KRaft compatibility

  • Location: application/ingestr/.env, line 4 (export KAFKA_VERSION=4.0.0)
  • Kafka 4.0.0 was officially released on March 18, 2025 and introduces KRaft mode (no ZooKeeper) (cite 3).
  • As of July 17, 2025, the latest stable Kafka release is 3.9.1 (May 21, 2025) (cite 1).
  • Please verify that your infrastructure and the ingestr tool fully support KRaft-only clusters before committing to 4.0.0. If you rely on ZooKeeper or require the absolute latest patch stability, consider sticking with 3.9.1 or testing 4.0.0 in a staging environment first.

Citations:

  1. https://kafka.apache.org/downloads.html
  2. https://medium.com/@usersk7/unveiling-apache-kafka-4-0-the-new-era-of-streaming-is-here-6d8ecb8a9742
application/ingestr/kafka-demo.xsh (3)

11-12: Good configuration for debugging and privacy.

The traceback and telemetry settings are appropriate for a demo/test environment.


184-185: Good use of environment variable for service lifecycle.

The KEEPALIVE environment variable provides flexible control over service cleanup, which is useful for debugging and development.


155-156: Use proper exit code handling instead of direct exit.

The direct exit call breaks the flow and makes testing difficult. Consider using exceptions or return codes.

-            # FIXME: Exiting from code is not the nicest thing. But well, this is a script anyway.
-            exit 2
+            raise SystemExit(2)

Likely an incorrect or invalid review comment.

Comment on lines 141 to 145
$size_actual=$(
docker compose --file $COMPOSE_FILE run --rm httpie \
http "$CRATEDB_HTTP_URL/_sql?pretty" 'stmt=SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \
| jq .rows[0][0]
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error handling for JSON parsing.

The jq command could fail if the HTTP response format changes. Add error handling for the JSON parsing.

         $size_actual=$(
             docker compose --file $COMPOSE_FILE run --rm httpie \
                 http "$CRATEDB_HTTP_URL/_sql?pretty" 'stmt=SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \
-                | jq .rows[0][0]
+                | jq -r '.rows[0][0] // empty'
         )
+        if [ -z "$size_actual" ]; then
+            echo "ERROR: Failed to retrieve record count from CrateDB"
+            exit 1
+        fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
$size_actual=$(
docker compose --file $COMPOSE_FILE run --rm httpie \
http "$CRATEDB_HTTP_URL/_sql?pretty" 'stmt=SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \
| jq .rows[0][0]
)
$size_actual=$(
docker compose --file $COMPOSE_FILE run --rm httpie \
http "$CRATEDB_HTTP_URL/_sql?pretty" 'stmt=SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \
| jq -r '.rows[0][0] // empty'
)
if [ -z "$size_actual" ]; then
echo "ERROR: Failed to retrieve record count from CrateDB"
exit 1
fi
🤖 Prompt for AI Agents
In application/ingestr/kafka-demo.xsh around lines 141 to 145, the jq command
parsing the JSON response lacks error handling, which can cause failures if the
response format changes. Modify the script to check if jq successfully parses
the JSON and handle errors gracefully, such as by verifying the output before
using it or adding conditional logic to catch parsing failures and respond
accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9e82a83 makes it much better, no longer relying on the container invocation, but doing it with pure Python instead.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
application/ingestr/kafka-demo.xsh (1)

92-92: Add checksum verification for downloaded dataset.

The script downloads data from an external source without verifying its integrity, which could be a security risk and may cause issues if the file is corrupted.

Consider adding checksum verification after the wget command to ensure file integrity.

🧹 Nitpick comments (2)
application/ingestr/kafka-demo.xsh (2)

43-47: Consider adding error handling to stop operations.

While the start operations have good error handling, the stop operations (Docker Compose down and Kafka topic deletion) lack error checking. Consider adding similar error handling for consistency.

 def stop(self):
     title "Stopping services"
-    docker compose --file $COMPOSE_FILE down --remove-orphans
+    result = ![ docker compose --file $COMPOSE_FILE down --remove-orphans ]
+    if result.returncode != 0:
+        echo "WARNING: Failed to stop services cleanly"
     echo "Done."
     return self

 def kafka_delete_topic(self):
     title "Deleting Kafka topic"
-    docker compose --file $COMPOSE_FILE run --rm --no-TTY delete-topic
+    result = ![ docker compose --file $COMPOSE_FILE run --rm --no-TTY delete-topic ]
+    if result.returncode != 0:
+        echo "WARNING: Failed to delete Kafka topic"
     echo "Done."
     return self

Also applies to: 66-70


110-112: Add error handling for Kafka data publishing.

The data publishing operation lacks error handling. Consider adding error checking similar to other Docker Compose operations.

 title "Publishing NDJSON data to Kafka topic"
-cat nyc-yellow-taxi-2017-subset.ndjson | docker compose --file $COMPOSE_FILE run --rm --no-TTY publish-topic
+result = ![ cat nyc-yellow-taxi-2017-subset.ndjson | docker compose --file $COMPOSE_FILE run --rm --no-TTY publish-topic ]
+if result.returncode != 0:
+    echo "ERROR: Failed to publish data to Kafka topic"
+    exit 1
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a61ed2 and 3b8c222.

📒 Files selected for processing (4)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/kafka-demo.xsh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • application/ingestr/README.md
  • application/ingestr/kafka-compose.yml
🧰 Additional context used
🪛 GitHub Actions: Ingestr
application/ingestr/kafka-demo.xsh

[error] 22-22: SyntaxError: ('code: Print',) at line 22. Likely an invalid or malformed docstring or code syntax.

🪛 dotenv-linter (3.3.0)
application/ingestr/.env

[warning] 7-7: [SubstitutionKey] The CRATEDB_HOST key is not assigned properly


[warning] 8-8: [SubstitutionKey] The CRATEDB_HTTP_PORT key is not assigned properly


[warning] 9-9: [SubstitutionKey] The CRATEDB_POSTGRESQL_PORT key is not assigned properly


[warning] 10-10: [SubstitutionKey] The CRATEDB_USERNAME key is not assigned properly


[warning] 11-11: [SubstitutionKey] The CRATEDB_PASSWORD key is not assigned properly


[warning] 11-11: [UnorderedKey] The CRATEDB_PASSWORD key should go before the CRATEDB_POSTGRESQL_PORT key


[warning] 12-12: [SubstitutionKey] The CRATEDB_HTTP_SCHEME key is not assigned properly


[warning] 12-12: [UnorderedKey] The CRATEDB_HTTP_SCHEME key should go before the CRATEDB_PASSWORD key

🔇 Additional comments (12)
application/ingestr/.env (4)

1-5: LGTM: Kafka configuration is well-structured.

The Kafka port configuration follows standard conventions with appropriate default values for external broker, internal broker, and Zookeeper.


6-13: Valid shell parameter expansion syntax - ignore static analysis warnings.

The shell parameter expansion syntax ${VAR:-default} is correct and idiomatic for providing default values. The static analysis tool is generating false positives by flagging these as improper assignments. The CrateDB configuration provides sensible defaults for all connection parameters.


14-19: Good documentation for cloud configuration.

The commented-out CrateDB Cloud configuration provides helpful examples for users who need to connect to cloud instances rather than local Docker containers.


20-25: Well-designed URL construction for different use cases.

The computed URLs appropriately serve different needs - one with authentication for HTTPie and one without for direct local access. This follows good configuration practices by deriving values from base parameters.

application/ingestr/kafka-demo.xsh (8)

34-42: Good error handling implementation.

The error handling for Docker Compose startup has been properly implemented, checking the return code and exiting with a clear error message if the operation fails.


57-65: Consistent error handling for Kafka topic creation.

The error handling pattern is consistently applied to the Kafka topic creation, which is good for robustness.


94-99: Good implementation of tar extraction error handling.

The error handling for archive extraction has been properly implemented, checking the return code and exiting with a clear error message if extraction fails.


177-200: Excellent implementation using Python stdlib.

The query_cratedb function effectively uses Python's standard library HTTP client and JSON module instead of external tools like jq. This reduces dependencies and provides better error handling capabilities.


135-146: Clean data display implementation.

The display method properly refreshes the table before querying and shows both sample records and total count, providing good visibility into the ingested data.


147-165: Robust verification with improved error handling.

The verification method now uses the Python-based query function instead of jq, providing better error handling and eliminating the JSON parsing concerns from previous reviews. The colored output and proper exit codes make it easy to identify success or failure.


167-176: Good helper function for visual formatting.

The title function provides clear section separation with appropriate color formatting, enhancing the script's readability during execution.


202-219: Well-orchestrated main function.

The main function provides clear pipeline orchestration with useful development shortcuts and proper environment variable handling for service lifecycle management.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
application/ingestr/kafka-demo.xsh (2)

92-92: Add checksum verification for downloaded dataset.

The script downloads data from an external source without verifying its integrity. This could be a security risk and may cause issues if the file is corrupted.


125-131: Add error handling for ingestr command.

The ingestr ingestion command lacks error handling. Since this is a critical operation, consider adding error checking to ensure the ingestion completes successfully.

🧹 Nitpick comments (4)
application/ingestr/.env (2)

10-11: Consider using more secure default credentials.

The default credentials crate:crate are well-known defaults that should be changed in production environments. Consider adding a comment or documentation emphasizing this.

-export CRATEDB_USERNAME=${CRATEDB_USERNAME:-crate}
-export CRATEDB_PASSWORD=${CRATEDB_PASSWORD:-crate}
+export CRATEDB_USERNAME=${CRATEDB_USERNAME:-crate}  # Change in production
+export CRATEDB_PASSWORD=${CRATEDB_PASSWORD:-crate}  # Change in production

14-18: Improve the CrateDB Cloud configuration section.

The commented-out CrateDB Cloud configuration could be more helpful with clearer instructions.

-# Data sink configuration (CrateDB Cloud).
-# export CRATEDB_HTTP_SCHEME=https
-# export CRATEDB_HOST=example.aks1.westeurope.azure.cratedb.net
-# export CRATEDB_USERNAME='admin'
-# export CRATEDB_PASSWORD='<PASSWORD>'
+# Data sink configuration (CrateDB Cloud).
+# Uncomment and modify the following lines when using CrateDB Cloud:
+# export CRATEDB_HTTP_SCHEME=https
+# export CRATEDB_HOST=your-cluster.region.provider.cratedb.net
+# export CRATEDB_USERNAME='your-username'
+# export CRATEDB_PASSWORD='your-password'
application/ingestr/kafka-demo.xsh (2)

68-70: Add error handling for Kafka topic deletion.

The kafka_delete_topic method lacks error handling similar to the create topic method. While this is less critical than creation, it should still be handled consistently.

 def kafka_delete_topic(self):
     title "Deleting Kafka topic"
-    docker compose --file $COMPOSE_FILE run --rm --no-TTY delete-topic
+    result = ![ docker compose --file $COMPOSE_FILE run --rm --no-TTY delete-topic ]
+    if result.returncode != 0:
+        echo "WARNING: Failed to delete Kafka topic (may not exist)"
     echo "Done."
     return self

161-162: Consider using return codes instead of exit() for better modularity.

The script uses exit() calls which make it difficult to integrate as a module. Consider returning error codes or raising exceptions instead.

-            # FIXME: Exiting from code is not the nicest thing. But well, this is currently a script anyway.
-            exit 2
+            # Return error code for better modularity
+            return False

Then modify the main function to handle the return value appropriately.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b8c222 and 4326485.

📒 Files selected for processing (3)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/kafka-demo.xsh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • application/ingestr/kafka-compose.yml
🧰 Additional context used
🪛 dotenv-linter (3.3.0)
application/ingestr/.env

[warning] 7-7: [SubstitutionKey] The CRATEDB_HOST key is not assigned properly


[warning] 8-8: [SubstitutionKey] The CRATEDB_HTTP_PORT key is not assigned properly


[warning] 9-9: [SubstitutionKey] The CRATEDB_POSTGRESQL_PORT key is not assigned properly


[warning] 10-10: [SubstitutionKey] The CRATEDB_USERNAME key is not assigned properly


[warning] 11-11: [SubstitutionKey] The CRATEDB_PASSWORD key is not assigned properly


[warning] 11-11: [UnorderedKey] The CRATEDB_PASSWORD key should go before the CRATEDB_POSTGRESQL_PORT key


[warning] 12-12: [SubstitutionKey] The CRATEDB_HTTP_SCHEME key is not assigned properly


[warning] 12-12: [UnorderedKey] The CRATEDB_HTTP_SCHEME key should go before the CRATEDB_PASSWORD key

🔇 Additional comments (5)
application/ingestr/.env (2)

7-12: Parameter expansion syntax is correct despite linter warnings.

The shell parameter expansion syntax ${VAR:-default} is valid and appropriate for providing default values. The static analysis warnings about "SubstitutionKey" appear to be false positives.


21-24: URL construction logic is well-designed.

The separation of authenticated (for HTTPie) and non-authenticated (for Python urllib) URLs is thoughtful and addresses different client authentication requirements effectively.

application/ingestr/kafka-demo.xsh (3)

11-12: Good configuration of debugging and telemetry settings.

The traceback and telemetry disable settings are appropriate for a demo script, providing better debugging experience while respecting privacy.


177-200: Excellent implementation of native HTTP client.

The query_cratedb function effectively replaces the need for external dependencies like jq by using Python's standard library. This addresses the concern raised in past comments about reducing external dependencies.


206-216: Well-structured main execution flow.

The main function demonstrates good separation of concerns with infrastructure and data handling, plus useful fast-path options for development and debugging.

@amotl amotl merged commit 30e6ec6 into main Jul 18, 2025
2 checks passed
@amotl amotl deleted the ingestr branch July 18, 2025 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants