-
Notifications
You must be signed in to change notification settings - Fork 9
Ingestr: Add runnable example loading from Kafka #1013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughA new integration test and demonstration suite for the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Test Script (test.sh)
participant Kafka Demo (kafka-demo.xsh)
participant Docker Compose
participant Kafka
participant CrateDB
participant Ingestr
User->>Test Script (test.sh): Run integration test
Test Script (test.sh)->>Kafka Demo (kafka-demo.xsh): Execute demo
Kafka Demo (kafka-demo.xsh)->>Docker Compose: Start Kafka & CrateDB
Kafka Demo (kafka-demo.xsh)->>Kafka: Create topic, publish data
Kafka Demo (kafka-demo.xsh)->>Ingestr: Ingest data from Kafka to CrateDB
Ingestr->>CrateDB: Write records
Kafka Demo (kafka-demo.xsh)->>CrateDB: Query for verification
Kafka Demo (kafka-demo.xsh)->>Docker Compose: Tear down stack (optional)
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🪛 dotenv-linter (3.3.0)application/ingestr/.env[warning] 7-7: [SubstitutionKey] The CRATEDB_HOST key is not assigned properly [warning] 8-8: [SubstitutionKey] The CRATEDB_HTTP_PORT key is not assigned properly [warning] 9-9: [SubstitutionKey] The CRATEDB_POSTGRESQL_PORT key is not assigned properly [warning] 10-10: [SubstitutionKey] The CRATEDB_USERNAME key is not assigned properly [warning] 11-11: [SubstitutionKey] The CRATEDB_PASSWORD key is not assigned properly [warning] 11-11: [UnorderedKey] The CRATEDB_PASSWORD key should go before the CRATEDB_POSTGRESQL_PORT key [warning] 12-12: [SubstitutionKey] The CRATEDB_HTTP_SCHEME key is not assigned properly [warning] 12-12: [UnorderedKey] The CRATEDB_HTTP_SCHEME key should go before the CRATEDB_PASSWORD key 🔇 Additional comments (6)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
ceec7a9 to
732a4be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (3)
application/ingestr/util.sh (1)
38-46:title()may use colors before they are initialised
titleprints${BYELLOW}butinit_colorsmight not have been called yet by a script that merely sources this file and directly callstitle. Either:
- Call
init_colorslazily insidetitlewhenBYELLOWis empty, or- Document that callers must invoke
init_colorsfirst.application/ingestr/test.sh (1)
5-9: Suppress command lookup noise & hard failures
command -v uvprints the path on success and errors on failure. Redirect stdout/stderr and explicitly fail when installation fails.-if ! command -v uv; then - pip install uv +if ! command -v uv >/dev/null 2>&1; then + pip install uv || { + echo "Failed to install uv" >&2 + exit 1 + } fiapplication/ingestr/kafka-cmd.sh (1)
124-129: Quote variables in arithmetic comparison (SC2053)Unquoted variables may trigger glob expansion or word-splitting.
-if [[ ${size_actual} = ${size_reference} ]]; then +if [[ "${size_actual}" = "${size_reference}" ]]; then
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml(1 hunks).github/workflows/application-ingestr.yml(1 hunks)application/ingestr/.dlt/config.toml(1 hunks)application/ingestr/.env(1 hunks)application/ingestr/.gitignore(1 hunks)application/ingestr/README.md(1 hunks)application/ingestr/kafka-cmd.sh(1 hunks)application/ingestr/kafka-compose.yml(1 hunks)application/ingestr/requirements.txt(1 hunks)application/ingestr/test.sh(1 hunks)application/ingestr/util.sh(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- application/ingestr/.dlt/config.toml
- application/ingestr/README.md
🚧 Files skipped from review as they are similar to previous changes (6)
- application/ingestr/requirements.txt
- application/ingestr/.gitignore
- .github/dependabot.yml
- .github/workflows/application-ingestr.yml
- application/ingestr/kafka-compose.yml
- application/ingestr/.env
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
setup(36-42)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh
[warning] 124-124: Quote the right-hand side of = in [[ ]] to prevent glob matching.
(SC2053)
application/ingestr/test.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
application/ingestr/util.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
[warning] 23-23: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 34-34: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 53-53: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 54-54: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 55-55: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 56-56: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 57-57: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 66-66: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
application/ingestr/util.sh (2)
1-4: Add she-bang & strict mode header
Still missing after the previous review – scripts that source this file get no safety nets and ShellCheck keeps complaining (SC2148).+#/usr/bin/env bash +set -euo pipefail +IFS=$'\n\t'
5-14:getoptstill runs at sourcing time – breaks callers
Executinggetoptbeforeread_options()means every script that merely sourcesutil.shhas its own$@parsed and may exit on bad options. Same concern was raised earlier but not addressed.-options=$(getopt --options k --longoptions keepalive -- "$@") -function read_options() { +function read_options() { + local options + options=$(getopt --options k --longoptions keepalive -- "$@") # Call getopt to validate the provided input.application/ingestr/test.sh (1)
1-4: Script lacks she-bang & strict mode
Identical comment existed in the last round; please fix to avoid SC2148 and silent failures.+#/usr/bin/env bash +set -euo pipefail +IFS=$'\n\t'application/ingestr/kafka-cmd.sh (2)
16-18: Sourcing files via relative paths is still brittle
Running the script from outside its directory will fail to find.envandutil.sh. Previous refactor suggestion stands.-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -source .env -source util.sh +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/.env" +source "${SCRIPT_DIR}/util.sh"
191-192: CLI parsing disabled –SUBCOMMANDis never set
read_optionsis still commented out, yetstart_subcommandrelies onSUBCOMMAND. Either enable parsing or remove dead code.- #read_options + read_options "$@"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml(1 hunks).github/workflows/application-ingestr.yml(1 hunks)application/ingestr/.dlt/config.toml(1 hunks)application/ingestr/.env(1 hunks)application/ingestr/.gitignore(1 hunks)application/ingestr/README.md(1 hunks)application/ingestr/kafka-cmd.sh(1 hunks)application/ingestr/kafka-compose.yml(1 hunks)application/ingestr/requirements.txt(1 hunks)application/ingestr/test.sh(1 hunks)application/ingestr/util.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- application/ingestr/.dlt/config.toml
- application/ingestr/requirements.txt
- application/ingestr/.gitignore
- .github/dependabot.yml
- application/ingestr/README.md
- application/ingestr/.env
- .github/workflows/application-ingestr.yml
- application/ingestr/kafka-compose.yml
🧰 Additional context used
🧬 Code Graph Analysis (2)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
title(38-46)init_colors(48-70)application/ingestr/test.sh (1)
setup(5-9)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
setup(36-41)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh
[warning] 144-144: Quote the right-hand side of = in [[ ]] to prevent glob matching.
(SC2053)
application/ingestr/test.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
application/ingestr/util.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
[warning] 23-23: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 34-34: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 53-53: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 54-54: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 55-55: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 56-56: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 57-57: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 66-66: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: OS ubuntu-latest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
application/ingestr/test.sh (1)
1-2: Correct the she-bangSame issue as in
util.sh: add!so the script can be run directly.-#/usr/bin/env bash +#!/usr/bin/env bashapplication/ingestr/kafka-cmd.sh (1)
193-195: CLI parsing still disabled – previous remark stands
read_optionsremains commented out, soSUBCOMMANDandKEEPALIVEare never populated.
Uncomment and forward"$@", or delete the dead code.- #read_options + read_options "$@"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml(1 hunks).github/workflows/application-ingestr.yml(1 hunks)application/ingestr/.dlt/config.toml(1 hunks)application/ingestr/.env(1 hunks)application/ingestr/.gitignore(1 hunks)application/ingestr/README.md(1 hunks)application/ingestr/kafka-cmd.sh(1 hunks)application/ingestr/kafka-compose.yml(1 hunks)application/ingestr/requirements.txt(1 hunks)application/ingestr/test.sh(1 hunks)application/ingestr/util.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- application/ingestr/.dlt/config.toml
- application/ingestr/.gitignore
- application/ingestr/README.md
- .github/dependabot.yml
- application/ingestr/requirements.txt
- application/ingestr/.env
- application/ingestr/kafka-compose.yml
- .github/workflows/application-ingestr.yml
🧰 Additional context used
🧬 Code Graph Analysis (2)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
title(43-51)init_colors(53-75)application/ingestr/test.sh (1)
setup(10-14)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
setup(38-43)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
application/ingestr/test.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
application/ingestr/util.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
dd46448 to
bdb738c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
application/ingestr/util.sh (1)
1-2: Fix the malformed she-bang (again)
The first line still misses the exclamation mark and therefore isn’t recognised as an executable script.-#/usr/bin/env bash +#!/usr/bin/env bashapplication/ingestr/test.sh (1)
1-6: Add proper she-bang + strict mode-#/usr/bin/env bash +#!/usr/bin/env bash +set -euo pipefail
pipefailis important here because later commands rely on pipes (|) to detect failures.application/ingestr/kafka-cmd.sh (3)
1-2: She-bang still malformedSame issue as in the other scripts—add the missing
!.-#/usr/bin/env bash +#!/usr/bin/env bash
185-196: CLI parsing still disabled – dead code path
read_optionsremains commented out, meaningSUBCOMMAND/KEEPALIVEcan never be set from the CLI. Either re-enable the call or delete the unused logic.- #read_options + read_options "$@"
22-23: MakeCOMPOSE_FILEpath-safeRunning the script outside its directory fails because the compose file is looked-up relative to
$PWD.-COMPOSE_FILE=kafka-compose.yml +COMPOSE_FILE="${SCRIPT_DIR}/kafka-compose.yml"
🧹 Nitpick comments (2)
application/ingestr/util.sh (1)
44-51: Quote and localise variables insidetitle()
Unquoted expansions will break on spaces/new-lines and the function leaks globals.-function title() { - text=$1 - len=${#text} +function title() { + local text=$1 + local len=${#text}- echo ${guard} - echo -e "${BYELLOW}${text}${NC}" - echo ${guard} + echo "${guard}" + echo -e "${BYELLOW}${text}${NC}" + echo "${guard}"application/ingestr/test.sh (1)
10-14: Ensureuvxis available after installing uv
pip install uvputs the binary under the Python installation’sbindirectory, which might not be on$PATHin CI runners. Consider:python -m pip install --user uv export PATH="$HOME/.local/bin:$PATH"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml(1 hunks).github/workflows/application-ingestr.yml(1 hunks)application/ingestr/.dlt/config.toml(1 hunks)application/ingestr/.env(1 hunks)application/ingestr/.gitignore(1 hunks)application/ingestr/README.md(1 hunks)application/ingestr/kafka-cmd.sh(1 hunks)application/ingestr/kafka-compose.yml(1 hunks)application/ingestr/requirements.txt(1 hunks)application/ingestr/test.sh(1 hunks)application/ingestr/util.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- application/ingestr/.dlt/config.toml
- application/ingestr/requirements.txt
- .github/dependabot.yml
- application/ingestr/.gitignore
- .github/workflows/application-ingestr.yml
- application/ingestr/README.md
- application/ingestr/.env
- application/ingestr/kafka-compose.yml
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
setup(38-43)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
application/ingestr/test.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
application/ingestr/util.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
application/ingestr/kafka-cmd.sh (1)
192-196: CLI parsing remains disabled –SUBCOMMAND&KEEPALIVEnever set
read_optionsis still commented out, so the flags & subcommand logic are dead code andKEEPALIVEwill always be empty. Either wire the parser back in or drop the unused branches.- #read_options + read_options "$@"
🧹 Nitpick comments (3)
application/ingestr/util.sh (1)
43-51: Quote expansions insidetitle()
echo ${guard}andecho -e "${BYELLOW}${text}${NC}"rely on word-splitting rules; titles containing spaces or glob characters will break/expand unexpectedly.- echo - echo ${guard} - echo -e "${BYELLOW}${text}${NC}" - echo ${guard} + echo + echo "${guard}" + echo -e "${BYELLOW}${text}${NC}" + echo "${guard}"application/ingestr/kafka-cmd.sh (2)
28-36: Quote path variables passed to Docker Compose
${COMPOSE_FILE}may contain spaces (e.g. when the repo is checked out into “/tmp/ingestr demo/”). Quote it to avoid argument splitting.- docker compose --file ${COMPOSE_FILE} up --detach + docker compose --file "${COMPOSE_FILE}" up --detachApply the same quoting pattern at lines 35-36, 57-65, 105-157, etc.
88-100: Avoid uselesscatin data subset creation
cat file | head …spawns an unnecessary process; head can read the file directly.- cat nyc-yellow-taxi-2017.json | head -n 5000 > nyc-yellow-taxi-2017-subset.ndjson + head -n 5000 nyc-yellow-taxi-2017.json > nyc-yellow-taxi-2017-subset.ndjson
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml(1 hunks).github/workflows/application-ingestr.yml(1 hunks)application/ingestr/.dlt/config.toml(1 hunks)application/ingestr/.env(1 hunks)application/ingestr/.gitignore(1 hunks)application/ingestr/README.md(1 hunks)application/ingestr/kafka-cmd.sh(1 hunks)application/ingestr/kafka-compose.yml(1 hunks)application/ingestr/requirements.txt(1 hunks)application/ingestr/test.sh(1 hunks)application/ingestr/util.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- application/ingestr/requirements.txt
- application/ingestr/.dlt/config.toml
- application/ingestr/test.sh
- .github/dependabot.yml
- application/ingestr/.gitignore
- application/ingestr/README.md
- .github/workflows/application-ingestr.yml
- application/ingestr/kafka-compose.yml
- application/ingestr/.env
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
title(43-51)init_colors(53-75)application/ingestr/test.sh (1)
setup(10-14)
🪛 Shellcheck (0.10.0)
application/ingestr/util.sh
[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: OS ubuntu-latest
|
|
||
| # Run job each night after CrateDB nightly has been published. | ||
| schedule: | ||
| - cron: '0 3 * * *' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to avoid bit rot - thanks
application/ingestr/requirements.txt
Outdated
| @@ -0,0 +1 @@ | |||
| ingestr @ git+https://github.com/crate-workbench/ingestr.git@next | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backlog: This needs to be replaced by a real package as we go. Because ingestr might currently not be the topmost priority for their authors, we may need to friendly-fork and run separate releases to gain speed and bundling convienience until individual patches will make it into mainline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many patches are queuing up already, some of them crucial for operating ingestr with CrateDB, some convenient, and some currently impossible to merge.
Bundling all of them into the next branch makes sense, and from there, packaging the bundle and publishing it to PyPI is just a minor additional step, because automation on GHA will do all the work for us. Adding a change log will optimally inform what's inside, and actual software releases will give users the ability to deterministically roll back and forth as we go, carrying CrateDB support for ingestr forward as a special animal, until it isn't.
- Tests: Permit running tests without
xdistbruin-data/ingestr#283 - Tests: Adjust software tests for CrateDB bruin-data/ingestr#284
- CI: Add test matrix slot for Python 3.13 bruin-data/ingestr#285
- Tests: Improve isolation through resource teardown by closing/disposing connections etc. bruin-data/ingestr#287
- CLI: For the
columnsoption, do not use theCOLUMNSshell variable bruin-data/ingestr#288 - Naming things: Rename
_kafka_msg_idto_kafka__msg_idbruin-data/ingestr#289 - Chore: Disable dlt telemetry for sandbox and CI bruin-data/ingestr#290
- Dependencies: Update to
dlt-cratedb==0.0.2bruin-data/ingestr#294 - Chore: Format code using
make lint-cibruin-data/ingestr#297
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
application/ingestr/util.sh (1)
1-6: Addpipefail& IFS to complete strict-mode
The script still omitsset -o pipefail(and a saneIFS) even after earlier feedback. Silent failures inside pipelines can slip through.#!/usr/bin/env bash -set -eu +set -euo pipefail +IFS=$'\n\t'application/ingestr/kafka-cmd.sh (2)
193-196: CLI parsing still disabled — dead code persists
read_optionsremains commented, soSUBCOMMANDis never set and related logic is inert. Either uncomment or remove the sub-command machinery.
16-17: Enablepipefailfor reliable failure propagationWithout
pipefail, any failure on the left side of a pipeline is ignored, which is dangerous for commands likecat | docker …and HTTP queries.-set -eu +set -euo pipefail
🧹 Nitpick comments (3)
application/ingestr/util.sh (1)
43-51: Scope internal vars intitle()to avoid side-effects
text,len, andguardare written as globals; downstream scripts may accidentally override them.-function title() { - text=$1 - len=${#text} - guard=$(printf "%${len}s" | sed 's/ /=/g') +function title() { + local text=$1 + local len=${#text} + local guard + guard=$(printf '%*s' "${len}" '' | sed 's/ /=/g')application/ingestr/kafka-cmd.sh (1)
28-36: Quote${COMPOSE_FILE}to handle spaces & special charsUnquoted path variables can break if the script lives in a directory with spaces.
- docker compose --file ${COMPOSE_FILE} up --detach + docker compose --file "${COMPOSE_FILE}" up --detachApply the same quoting to every later
docker compose --fileinvocation.application/ingestr/README.md (1)
1-6: Minor Markdown spacing / heading styleA blank line after the main heading improves readability and prevents some renderers from merging the following heading.
-# Use CrateDB with ingestr -## About +# Use CrateDB with ingestr + +## About
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml(1 hunks).github/workflows/application-ingestr.yml(1 hunks)application/ingestr/.dlt/config.toml(1 hunks)application/ingestr/.env(1 hunks)application/ingestr/.gitignore(1 hunks)application/ingestr/README.md(1 hunks)application/ingestr/kafka-cmd.sh(1 hunks)application/ingestr/kafka-compose.yml(1 hunks)application/ingestr/requirements.txt(1 hunks)application/ingestr/test.sh(1 hunks)application/ingestr/util.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- application/ingestr/.dlt/config.toml
- application/ingestr/requirements.txt
- .github/dependabot.yml
- application/ingestr/test.sh
- application/ingestr/.gitignore
- .github/workflows/application-ingestr.yml
- application/ingestr/.env
- application/ingestr/kafka-compose.yml
🧰 Additional context used
🧠 Learnings (1)
application/ingestr/README.md (1)
Learnt from: amotl
PR: crate/cratedb-examples#937
File: topic/machine-learning/llm-langchain/requirements-dev.txt:2-2
Timestamp: 2025-05-12T20:10:38.614Z
Learning: The cratedb-toolkit package supports various extras including "io", "datasets", "influxdb", "mongodb", "testing", and many others.
🧬 Code Graph Analysis (1)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
title(43-51)init_colors(53-75)application/ingestr/test.sh (1)
setup(10-14)
🪛 LanguageTool
application/ingestr/README.md
[grammar] ~3-~3: Use proper spacing conventions.
Context: # Use CrateDB with ingestr ## About [ingestr] is a command-line application t...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~5-~5: There might be a mistake here.
Context: ...CrateDB with ingestr ## About [ingestr] is a command-line application that allo...
(QB_NEW_EN_OTHER)
[grammar] ~5-~5: There might be a mistake here.
Context: ...ine application that allows copying data from any source into any destination dat...
(QB_NEW_EN_OTHER)
[grammar] ~6-~6: Use proper spacing conventions.
Context: ...ny source into any destination database. This folder includes runnable examples t...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~8-~8: Use proper spacing conventions.
Context: ... examples that use ingestr with CrateDB. They are also used as integration tests ...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~9-~9: There might be a mistake here.
Context: ...tion tests to ensure software components fit together well. ## Usage To start c...
(QB_NEW_EN_OTHER)
[grammar] ~10-~10: Use proper spacing conventions.
Context: ...e software components fit together well. ## Usage To start cycling without tearing ...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~12-~12: Use proper spacing conventions.
Context: ... components fit together well. ## Usage To start cycling without tearing down th...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~14-~14: There might be a mistake here.
Context: ...earing down the backend stack each time, use the KEEPALIVE environment variable...
(QB_NEW_EN_OTHER)
[grammar] ~15-~15: Use proper spacing conventions.
Context: ...se the KEEPALIVE environment variable. shell export KEEPALIVE=true sh test.sh [ingestr]: https://bruin-data.github.io/ingestr/
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
🪛 Shellcheck (0.10.0)
application/ingestr/util.sh
[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: OS ubuntu-latest
| # Use CrateDB with ingestr | ||
|
|
||
| ## About | ||
|
|
||
| [ingestr] is a command-line application that allows copying data | ||
| from any source into any destination database. | ||
|
|
||
| This folder includes runnable examples that use ingestr with CrateDB. | ||
| They are also used as integration tests to ensure software components | ||
| fit together well. | ||
|
|
||
| ## Usage | ||
|
|
||
| To start cycling without tearing down the backend stack each time, | ||
| use the `KEEPALIVE` environment variable. | ||
| ```shell | ||
| export KEEPALIVE=true | ||
| sh test.sh | ||
| ``` | ||
|
|
||
|
|
||
| [ingestr]: https://bruin-data.github.io/ingestr/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backlog:
- Add "What's inside" section, to educate readers about the bunch of files presented here, and how to use them.
- Add "Synopsis" section including a typical
ingestr ingestcommand to quickly present what it is actually all about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added with 9f50e4e.
application/ingestr/kafka-cmd.sh
Outdated
| function verify-data() { | ||
| title "Verifying data in CrateDB" | ||
| size_reference=5000 | ||
| size_actual=$( | ||
| docker compose --file ${COMPOSE_FILE} run --rm httpie \ | ||
| http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \ | ||
| | jq .rows[0][0] | ||
| ) | ||
| if [[ "${size_actual}" = "${size_reference}" ]]; then | ||
| echo -e "${BGREEN}SUCCESS: Database table contains expected number of ${size_reference} records.${NC}" | ||
| else | ||
| echo -e "${BRED}ERROR: Expected database table to contain ${size_reference} records, but it contains ${size_actual} records.${NC}" | ||
| exit 2 | ||
| fi | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backlog: Data validation is currently a bit thin, just counting the number of records and failing if it's not 5_000. Please expand the procedure, to also consider if data is in the right format within the target table. Also, why in hell is this written in Bash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The joy of shell scripting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With e580c61, we've removed Bashisms here and added more flavours of Python instead, using Xonsh now, retaining all the joys of shell scripting 1 while adding more sanity after just a little learning curve.
Footnotes
-
Which in turn retains the capability to copy/paste each individual fragment of the whole shebang into successive shell invocations. ↩
| # --------------- | ||
| # Confluent Kafka | ||
| # --------------- | ||
| # https://docs.confluent.io/platform/current/installation/docker/config-reference.html | ||
| # https://gist.github.com/everpeace/7a317860cab6c7fb39d5b0c13ec2543e | ||
| # https://github.com/framiere/a-kafka-story/blob/master/step14/docker-compose.yml | ||
| kafka-zookeeper: | ||
| image: confluentinc/cp-zookeeper:${CONFLUENT_VERSION} | ||
| environment: | ||
| ZOOKEEPER_CLIENT_PORT: ${PORT_KAFKA_ZOOKEEPER} | ||
| KAFKA_OPTS: -Dzookeeper.4lw.commands.whitelist=ruok | ||
| networks: | ||
| - ingestr-demo | ||
|
|
||
| # Define health check for Zookeeper. | ||
| healthcheck: | ||
| # https://github.com/confluentinc/cp-docker-images/issues/827 | ||
| test: ["CMD", "bash", "-c", "echo ruok | nc localhost ${PORT_KAFKA_ZOOKEEPER} | grep imok"] | ||
| start_period: 3s | ||
| interval: 2s | ||
| timeout: 30s | ||
| retries: 60 | ||
|
|
||
| kafka-broker: | ||
| image: confluentinc/cp-kafka:${CONFLUENT_VERSION} | ||
| ports: | ||
| - "${PORT_KAFKA_BROKER_INTERNAL}:${PORT_KAFKA_BROKER_INTERNAL}" | ||
| - "${PORT_KAFKA_BROKER_EXTERNAL}:${PORT_KAFKA_BROKER_EXTERNAL}" | ||
| environment: | ||
| KAFKA_ZOOKEEPER_CONNECT: kafka-zookeeper:${PORT_KAFKA_ZOOKEEPER} | ||
| KAFKA_LISTENERS: INTERNAL://0.0.0.0:${PORT_KAFKA_BROKER_INTERNAL},EXTERNAL://0.0.0.0:${PORT_KAFKA_BROKER_EXTERNAL} | ||
| KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka-broker:${PORT_KAFKA_BROKER_INTERNAL},EXTERNAL://localhost:${PORT_KAFKA_BROKER_EXTERNAL} | ||
| KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT | ||
| KAFKA_INTER_BROKER_LISTENER_NAME: INTERNAL | ||
| KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 | ||
| depends_on: | ||
| - kafka-zookeeper | ||
| networks: | ||
| - ingestr-demo | ||
|
|
||
| # Define health check for Kafka broker. | ||
| healthcheck: | ||
| #test: ps augwwx | egrep "kafka.Kafka" | ||
| test: ["CMD", "nc", "-vz", "localhost", "${PORT_KAFKA_BROKER_INTERNAL}"] | ||
| start_period: 3s | ||
| interval: 0.5s | ||
| timeout: 30s | ||
| retries: 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backlog: Please trim this down. A bare-bones setup of Kafka doesn't need any Zookeeper at all. See also Tutorial: Loading data from Apache Kafka into CrateDB for a very minimal example, as suggested by @hlcianfagna -- thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like vanilla Kafka 4.0.0 per docker.io/apache/kafka has a deviating configuration syntax for the variables defined above. When using them, it fails on startup:
Exception in thread "main" org.apache.kafka.common.config.ConfigException:
Missing required configuration "process.roles" which has no default value.When using earlier versions like Kafka 3.9.1, it needs a Zookeeper again.
Exception in thread "main" org.apache.kafka.common.config.ConfigException:
Missing required configuration `zookeeper.connect` which has no default value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recipe is now using Apache Kafka 4.0.0, which in turn uses KRaft, so it no longer needs Zookeeper.
Apache Kafka Raft (KRaft) is the consensus protocol that was introduced in KIP-500 to remove Apache Kafka’s dependency on ZooKeeper for metadata management.
This greatly simplifies Kafka’s architecture by consolidating responsibility for metadata into Kafka itself, rather than splitting it between two different systems: ZooKeeper and Kafka. KRaft mode makes use of a new quorum controller service in Kafka which replaces the previous controller and makes use of an event-based variant of the Raft consensus protocol.
| # ----- | ||
| # Tasks | ||
| # ----- | ||
|
|
||
| # Create database table in CrateDB. | ||
| create-table: | ||
| image: westonsteimel/httpie | ||
| networks: [ingestr-demo] | ||
| command: http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='CREATE TABLE "kafka_demo" ("payload" OBJECT(DYNAMIC))' --ignore-stdin | ||
| deploy: | ||
| replicas: 0 | ||
|
|
||
| # Create Kafka topic. | ||
| create-topic: | ||
| image: confluentinc/cp-kafka:${CONFLUENT_VERSION} | ||
| networks: [ingestr-demo] | ||
| command: kafka-topics --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --create --if-not-exists --replication-factor 1 --partitions 1 --topic demo | ||
| deploy: | ||
| replicas: 0 | ||
|
|
||
| # Delete Kafka topic. | ||
| delete-topic: | ||
| image: confluentinc/cp-kafka:${CONFLUENT_VERSION} | ||
| networks: [ingestr-demo] | ||
| command: kafka-topics --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --delete --if-exists --topic demo | ||
| deploy: | ||
| replicas: 0 | ||
|
|
||
| # Drop database table in CrateDB. | ||
| drop-table: | ||
| image: westonsteimel/httpie | ||
| networks: [ingestr-demo] | ||
| command: http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='DROP TABLE IF EXISTS "kafka_demo"' --ignore-stdin | ||
| deploy: | ||
| replicas: 0 | ||
|
|
||
| # Invoke HTTPie via Docker. | ||
| httpie: | ||
| image: westonsteimel/httpie | ||
| networks: [ingestr-demo] | ||
| deploy: | ||
| replicas: 0 | ||
|
|
||
| # Publish data to Kafka topic. | ||
| publish-data: | ||
| image: confluentinc/cp-kafka:${CONFLUENT_VERSION} | ||
| networks: [ingestr-demo] | ||
| command: kafka-console-producer --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --topic demo | ||
| deploy: | ||
| replicas: 0 | ||
|
|
||
| # Subscribe to Kafka topic. | ||
| subscribe-topic: | ||
| image: edenhill/kcat:${KCAT_VERSION} | ||
| networks: [ingestr-demo] | ||
| command: kcat -b kafka-broker -C -t demo # -o end | ||
| deploy: | ||
| replicas: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of this and just perform all tasks from userspace, possibly entering the Kafka container to invoke utility programs like kafka-topics.sh or kafka-console-producer.sh like @hlcianfagna was doing it?
podman exec -it kafka "bash"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We trimmed down the recipe by other ways a bit, and also reduced arcaneness by using Xonsh instead of Bash. In this spirit, we will keep this section like it is, to keep the promise that userspace needs nothing but Docker and Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
application/ingestr/kafka-demo.xsh (2)
91-91: Improve robustness of data subset creation.The current approach using
head -n 5000assumes the file has at least 5000 lines. Consider adding error handling or using a more robust approach.- cat nyc-yellow-taxi-2017.json | head -n 5000 > nyc-yellow-taxi-2017-subset.ndjson + # Create subset with error handling + if [ $(wc -l < nyc-yellow-taxi-2017.json) -ge 5000 ]; then + cat nyc-yellow-taxi-2017.json | head -n 5000 > nyc-yellow-taxi-2017-subset.ndjson + else + echo "Warning: Dataset has fewer than 5000 records, using all available data" + cp nyc-yellow-taxi-2017.json nyc-yellow-taxi-2017-subset.ndjson + fi
140-156: Consider using proper exception handling instead of exit().Using
exit 2directly in the script is not the cleanest approach. Consider raising an exception or returning an error status that can be handled by the caller.- # FIXME: Exiting from code is not the nicest thing. But well, this is a script anyway. - exit 2 + raise RuntimeError(f"Expected database table to contain at least {$size_reference} records, but it contains {$size_actual} records.")Then handle this in the main function:
def main(): + try: infra = Infrastructure() data = Datawrapper() # Standard path. infra.start().setup() data.acquire().publish().load().display().verify() if not __xonsh__.env.get("KEEPALIVE"): infra.stop() + except RuntimeError as e: + print_color("{BOLD_RED}") + echo f"ERROR: {e}" + print_color("{RESET}") + return 2 + return 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
application/ingestr/.env(1 hunks)application/ingestr/README.md(1 hunks)application/ingestr/kafka-compose.yml(1 hunks)application/ingestr/kafka-demo.xsh(1 hunks)application/ingestr/requirements.txt(1 hunks)application/ingestr/test.sh(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- application/ingestr/README.md
🚧 Files skipped from review as they are similar to previous changes (3)
- application/ingestr/requirements.txt
- application/ingestr/test.sh
- application/ingestr/kafka-compose.yml
🧰 Additional context used
🪛 dotenv-linter (3.3.0)
application/ingestr/.env
[warning] 5-5: [UnorderedKey] The CRATEDB_VERSION key should go before the KAFKA_VERSION key
[warning] 16-16: [SubstitutionKey] The CRATEDB_HOST key is not assigned properly
[warning] 17-17: [SubstitutionKey] The CRATEDB_HTTP_PORT key is not assigned properly
[warning] 18-18: [SubstitutionKey] The CRATEDB_POSTGRESQL_PORT key is not assigned properly
[warning] 19-19: [SubstitutionKey] The CRATEDB_USERNAME key is not assigned properly
[warning] 20-20: [SubstitutionKey] The CRATEDB_PASSWORD key is not assigned properly
[warning] 20-20: [UnorderedKey] The CRATEDB_PASSWORD key should go before the CRATEDB_POSTGRESQL_PORT key
[warning] 21-21: [SubstitutionKey] The CRATEDB_HTTP_SCHEME key is not assigned properly
[warning] 21-21: [UnorderedKey] The CRATEDB_HTTP_SCHEME key should go before the CRATEDB_PASSWORD key
🔇 Additional comments (7)
application/ingestr/.env (3)
16-21: LGTM: Parameter expansion syntax is correct.The bash parameter expansion syntax
${VAR:-default}is correctly used for providing default values. The static analysis tool warnings about "SubstitutionKey" are false positives - this is the standard and correct way to set defaults in shell environment files.
30-30: LGTM: HTTP URL construction is correct.The URL construction properly combines all the required components (scheme, credentials, host, port) into a valid HTTP URL format for CrateDB connection.
4-4: No action needed: Kafka version 4.0.0 is confirmed stableKafka 4.0.0 was released March 18, 2025 and is the latest stable release. The specified version in application/ingestr/.env (
export KAFKA_VERSION=4.0.0) is correct and compatible—no changes required.application/ingestr/kafka-demo.xsh (4)
11-12: LGTM: Good configuration for demo environment.Setting
XONSH_SHOW_TRACEBACK = Truefor debugging andINGESTR_DISABLE_TELEMETRY = Truefor privacy are appropriate choices for a demo script.
113-118: LGTM: Well-structured ingestr invocation.The ingestr command is properly structured with clear source and destination URIs. Using uvx with specific Python version and requirements ensures reproducible execution.
161-171: LGTM: Clean utility function implementation.The title function provides nice visual separation with proper color formatting and is well-implemented.
184-184: LGTM: Appropriate environment variable check.Using
__xonsh__.env.get("KEEPALIVE")is the correct way to check for environment variables in Xonsh with a default of None/False.
application/ingestr/kafka-demo.xsh
Outdated
| $size_actual=$( | ||
| docker compose --file $COMPOSE_FILE run --rm httpie \ | ||
| http "$CRATEDB_HTTP_URL/_sql?pretty" 'stmt=SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \ | ||
| | jq .rows[0][0] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command still needs jq to be installed on your workstation. Maybe it's better to use crash instead?
Probably, it's better to use Python stdlib's built-in HTTP client and the JSON module natively, to further reduce noise here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9e82a83 makes it much better, no longer relying on the container invocation, but doing it with pure Python instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
application/ingestr/kafka-demo.xsh (1)
85-85: Add checksum verification for downloaded dataset.The script downloads data from an external source without verifying its integrity. This could be a security risk and may cause issues if the file is corrupted.
Consider adding checksum verification:
- wget --no-clobber --continue "https://gist.githubusercontent.com/kovrus/328ba1b041dfbd89e55967291ba6e074/raw/7818724cb64a5d283db7f815737c9e198a22bee4/nyc-yellow-taxi-2017.tar.gz" + wget --no-clobber --continue "https://gist.githubusercontent.com/kovrus/328ba1b041dfbd89e55967291ba6e074/raw/7818724cb64a5d283db7f815737c9e198a22bee4/nyc-yellow-taxi-2017.tar.gz" + # TODO: Add checksum verification + # echo "expected_checksum nyc-yellow-taxi-2017.tar.gz" | sha256sum -c
🧹 Nitpick comments (2)
application/ingestr/.env (1)
30-31: Add missing newline at end of file.export CRATEDB_HTTP_URL="${CRATEDB_HTTP_SCHEME}://${CRATEDB_USERNAME}:${CRATEDB_PASSWORD}@${CRATEDB_HOST}:${CRATEDB_HTTP_PORT}" +application/ingestr/kafka-demo.xsh (1)
113-118: Consider using environment variables for ingestr configuration.The ingestr command has hardcoded connection parameters. Consider using environment variables for better flexibility and consistency with the rest of the configuration.
uvx --python=3.12 --prerelease=allow --with-requirements=requirements.txt \ ingestr ingest --yes \ - --source-uri "kafka://?bootstrap_servers=localhost:9092&group_id=test_group&value_type=json&select=value" \ + --source-uri "kafka://?bootstrap_servers=localhost:$PORT_KAFKA_BROKER_EXTERNAL&group_id=test_group&value_type=json&select=value" \ --source-table "demo" \ - --dest-uri "cratedb://crate:crate@localhost:5432/?sslmode=disable" \ + --dest-uri "cratedb://$CRATEDB_USERNAME:$CRATEDB_PASSWORD@$CRATEDB_HOST:$CRATEDB_POSTGRESQL_PORT/?sslmode=disable" \ --dest-table "doc.kafka_demo"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
application/ingestr/.env(1 hunks)application/ingestr/README.md(1 hunks)application/ingestr/kafka-compose.yml(1 hunks)application/ingestr/kafka-demo.xsh(1 hunks)application/ingestr/requirements.txt(1 hunks)application/ingestr/test.sh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- application/ingestr/requirements.txt
- application/ingestr/README.md
- application/ingestr/test.sh
- application/ingestr/kafka-compose.yml
🧰 Additional context used
🪛 dotenv-linter (3.3.0)
application/ingestr/.env
[warning] 5-5: [UnorderedKey] The CRATEDB_VERSION key should go before the KAFKA_VERSION key
[warning] 16-16: [SubstitutionKey] The CRATEDB_HOST key is not assigned properly
[warning] 17-17: [SubstitutionKey] The CRATEDB_HTTP_PORT key is not assigned properly
[warning] 18-18: [SubstitutionKey] The CRATEDB_POSTGRESQL_PORT key is not assigned properly
[warning] 19-19: [SubstitutionKey] The CRATEDB_USERNAME key is not assigned properly
[warning] 20-20: [SubstitutionKey] The CRATEDB_PASSWORD key is not assigned properly
[warning] 20-20: [UnorderedKey] The CRATEDB_PASSWORD key should go before the CRATEDB_POSTGRESQL_PORT key
[warning] 21-21: [SubstitutionKey] The CRATEDB_HTTP_SCHEME key is not assigned properly
[warning] 21-21: [UnorderedKey] The CRATEDB_HTTP_SCHEME key should go before the CRATEDB_PASSWORD key
🔇 Additional comments (5)
application/ingestr/.env (2)
16-21: Static analysis warnings are false positives.The dotenv-linter warnings about
SubstitutionKeyare false positives. The shell parameter expansion syntax${VAR:-default}is correct and widely used for providing default values in environment files.
4-4: Confirm Kafka 4.0.0 stability and KRaft compatibility
- Location:
application/ingestr/.env, line 4 (export KAFKA_VERSION=4.0.0)- Kafka 4.0.0 was officially released on March 18, 2025 and introduces KRaft mode (no ZooKeeper) (cite 3).
- As of July 17, 2025, the latest stable Kafka release is 3.9.1 (May 21, 2025) (cite 1).
- Please verify that your infrastructure and the ingestr tool fully support KRaft-only clusters before committing to 4.0.0. If you rely on ZooKeeper or require the absolute latest patch stability, consider sticking with 3.9.1 or testing 4.0.0 in a staging environment first.
Citations:
application/ingestr/kafka-demo.xsh (3)
11-12: Good configuration for debugging and privacy.The traceback and telemetry settings are appropriate for a demo/test environment.
184-185: Good use of environment variable for service lifecycle.The KEEPALIVE environment variable provides flexible control over service cleanup, which is useful for debugging and development.
155-156: Use proper exit code handling instead of direct exit.The direct exit call breaks the flow and makes testing difficult. Consider using exceptions or return codes.
- # FIXME: Exiting from code is not the nicest thing. But well, this is a script anyway. - exit 2 + raise SystemExit(2)Likely an incorrect or invalid review comment.
application/ingestr/kafka-demo.xsh
Outdated
| $size_actual=$( | ||
| docker compose --file $COMPOSE_FILE run --rm httpie \ | ||
| http "$CRATEDB_HTTP_URL/_sql?pretty" 'stmt=SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \ | ||
| | jq .rows[0][0] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling for JSON parsing.
The jq command could fail if the HTTP response format changes. Add error handling for the JSON parsing.
$size_actual=$(
docker compose --file $COMPOSE_FILE run --rm httpie \
http "$CRATEDB_HTTP_URL/_sql?pretty" 'stmt=SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \
- | jq .rows[0][0]
+ | jq -r '.rows[0][0] // empty'
)
+ if [ -z "$size_actual" ]; then
+ echo "ERROR: Failed to retrieve record count from CrateDB"
+ exit 1
+ fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| $size_actual=$( | |
| docker compose --file $COMPOSE_FILE run --rm httpie \ | |
| http "$CRATEDB_HTTP_URL/_sql?pretty" 'stmt=SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \ | |
| | jq .rows[0][0] | |
| ) | |
| $size_actual=$( | |
| docker compose --file $COMPOSE_FILE run --rm httpie \ | |
| http "$CRATEDB_HTTP_URL/_sql?pretty" 'stmt=SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \ | |
| | jq -r '.rows[0][0] // empty' | |
| ) | |
| if [ -z "$size_actual" ]; then | |
| echo "ERROR: Failed to retrieve record count from CrateDB" | |
| exit 1 | |
| fi |
🤖 Prompt for AI Agents
In application/ingestr/kafka-demo.xsh around lines 141 to 145, the jq command
parsing the JSON response lacks error handling, which can cause failures if the
response format changes. Modify the script to check if jq successfully parses
the JSON and handle errors gracefully, such as by verifying the output before
using it or adding conditional logic to catch parsing failures and respond
accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9e82a83 makes it much better, no longer relying on the container invocation, but doing it with pure Python instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
application/ingestr/kafka-demo.xsh (1)
92-92: Add checksum verification for downloaded dataset.The script downloads data from an external source without verifying its integrity, which could be a security risk and may cause issues if the file is corrupted.
Consider adding checksum verification after the wget command to ensure file integrity.
🧹 Nitpick comments (2)
application/ingestr/kafka-demo.xsh (2)
43-47: Consider adding error handling to stop operations.While the start operations have good error handling, the stop operations (Docker Compose down and Kafka topic deletion) lack error checking. Consider adding similar error handling for consistency.
def stop(self): title "Stopping services" - docker compose --file $COMPOSE_FILE down --remove-orphans + result = ![ docker compose --file $COMPOSE_FILE down --remove-orphans ] + if result.returncode != 0: + echo "WARNING: Failed to stop services cleanly" echo "Done." return self def kafka_delete_topic(self): title "Deleting Kafka topic" - docker compose --file $COMPOSE_FILE run --rm --no-TTY delete-topic + result = ![ docker compose --file $COMPOSE_FILE run --rm --no-TTY delete-topic ] + if result.returncode != 0: + echo "WARNING: Failed to delete Kafka topic" echo "Done." return selfAlso applies to: 66-70
110-112: Add error handling for Kafka data publishing.The data publishing operation lacks error handling. Consider adding error checking similar to other Docker Compose operations.
title "Publishing NDJSON data to Kafka topic" -cat nyc-yellow-taxi-2017-subset.ndjson | docker compose --file $COMPOSE_FILE run --rm --no-TTY publish-topic +result = ![ cat nyc-yellow-taxi-2017-subset.ndjson | docker compose --file $COMPOSE_FILE run --rm --no-TTY publish-topic ] +if result.returncode != 0: + echo "ERROR: Failed to publish data to Kafka topic" + exit 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
application/ingestr/.env(1 hunks)application/ingestr/README.md(1 hunks)application/ingestr/kafka-compose.yml(1 hunks)application/ingestr/kafka-demo.xsh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- application/ingestr/README.md
- application/ingestr/kafka-compose.yml
🧰 Additional context used
🪛 GitHub Actions: Ingestr
application/ingestr/kafka-demo.xsh
[error] 22-22: SyntaxError: ('code: Print',) at line 22. Likely an invalid or malformed docstring or code syntax.
🪛 dotenv-linter (3.3.0)
application/ingestr/.env
[warning] 7-7: [SubstitutionKey] The CRATEDB_HOST key is not assigned properly
[warning] 8-8: [SubstitutionKey] The CRATEDB_HTTP_PORT key is not assigned properly
[warning] 9-9: [SubstitutionKey] The CRATEDB_POSTGRESQL_PORT key is not assigned properly
[warning] 10-10: [SubstitutionKey] The CRATEDB_USERNAME key is not assigned properly
[warning] 11-11: [SubstitutionKey] The CRATEDB_PASSWORD key is not assigned properly
[warning] 11-11: [UnorderedKey] The CRATEDB_PASSWORD key should go before the CRATEDB_POSTGRESQL_PORT key
[warning] 12-12: [SubstitutionKey] The CRATEDB_HTTP_SCHEME key is not assigned properly
[warning] 12-12: [UnorderedKey] The CRATEDB_HTTP_SCHEME key should go before the CRATEDB_PASSWORD key
🔇 Additional comments (12)
application/ingestr/.env (4)
1-5: LGTM: Kafka configuration is well-structured.The Kafka port configuration follows standard conventions with appropriate default values for external broker, internal broker, and Zookeeper.
6-13: Valid shell parameter expansion syntax - ignore static analysis warnings.The shell parameter expansion syntax
${VAR:-default}is correct and idiomatic for providing default values. The static analysis tool is generating false positives by flagging these as improper assignments. The CrateDB configuration provides sensible defaults for all connection parameters.
14-19: Good documentation for cloud configuration.The commented-out CrateDB Cloud configuration provides helpful examples for users who need to connect to cloud instances rather than local Docker containers.
20-25: Well-designed URL construction for different use cases.The computed URLs appropriately serve different needs - one with authentication for HTTPie and one without for direct local access. This follows good configuration practices by deriving values from base parameters.
application/ingestr/kafka-demo.xsh (8)
34-42: Good error handling implementation.The error handling for Docker Compose startup has been properly implemented, checking the return code and exiting with a clear error message if the operation fails.
57-65: Consistent error handling for Kafka topic creation.The error handling pattern is consistently applied to the Kafka topic creation, which is good for robustness.
94-99: Good implementation of tar extraction error handling.The error handling for archive extraction has been properly implemented, checking the return code and exiting with a clear error message if extraction fails.
177-200: Excellent implementation using Python stdlib.The query_cratedb function effectively uses Python's standard library HTTP client and JSON module instead of external tools like jq. This reduces dependencies and provides better error handling capabilities.
135-146: Clean data display implementation.The display method properly refreshes the table before querying and shows both sample records and total count, providing good visibility into the ingested data.
147-165: Robust verification with improved error handling.The verification method now uses the Python-based query function instead of jq, providing better error handling and eliminating the JSON parsing concerns from previous reviews. The colored output and proper exit codes make it easy to identify success or failure.
167-176: Good helper function for visual formatting.The title function provides clear section separation with appropriate color formatting, enhancing the script's readability during execution.
202-219: Well-orchestrated main function.The main function provides clear pipeline orchestration with useful development shortcuts and proper environment variable handling for service lifecycle management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
application/ingestr/kafka-demo.xsh (2)
92-92: Add checksum verification for downloaded dataset.The script downloads data from an external source without verifying its integrity. This could be a security risk and may cause issues if the file is corrupted.
125-131: Add error handling for ingestr command.The ingestr ingestion command lacks error handling. Since this is a critical operation, consider adding error checking to ensure the ingestion completes successfully.
🧹 Nitpick comments (4)
application/ingestr/.env (2)
10-11: Consider using more secure default credentials.The default credentials
crate:crateare well-known defaults that should be changed in production environments. Consider adding a comment or documentation emphasizing this.-export CRATEDB_USERNAME=${CRATEDB_USERNAME:-crate} -export CRATEDB_PASSWORD=${CRATEDB_PASSWORD:-crate} +export CRATEDB_USERNAME=${CRATEDB_USERNAME:-crate} # Change in production +export CRATEDB_PASSWORD=${CRATEDB_PASSWORD:-crate} # Change in production
14-18: Improve the CrateDB Cloud configuration section.The commented-out CrateDB Cloud configuration could be more helpful with clearer instructions.
-# Data sink configuration (CrateDB Cloud). -# export CRATEDB_HTTP_SCHEME=https -# export CRATEDB_HOST=example.aks1.westeurope.azure.cratedb.net -# export CRATEDB_USERNAME='admin' -# export CRATEDB_PASSWORD='<PASSWORD>' +# Data sink configuration (CrateDB Cloud). +# Uncomment and modify the following lines when using CrateDB Cloud: +# export CRATEDB_HTTP_SCHEME=https +# export CRATEDB_HOST=your-cluster.region.provider.cratedb.net +# export CRATEDB_USERNAME='your-username' +# export CRATEDB_PASSWORD='your-password'application/ingestr/kafka-demo.xsh (2)
68-70: Add error handling for Kafka topic deletion.The
kafka_delete_topicmethod lacks error handling similar to the create topic method. While this is less critical than creation, it should still be handled consistently.def kafka_delete_topic(self): title "Deleting Kafka topic" - docker compose --file $COMPOSE_FILE run --rm --no-TTY delete-topic + result = ![ docker compose --file $COMPOSE_FILE run --rm --no-TTY delete-topic ] + if result.returncode != 0: + echo "WARNING: Failed to delete Kafka topic (may not exist)" echo "Done." return self
161-162: Consider using return codes instead of exit() for better modularity.The script uses
exit()calls which make it difficult to integrate as a module. Consider returning error codes or raising exceptions instead.- # FIXME: Exiting from code is not the nicest thing. But well, this is currently a script anyway. - exit 2 + # Return error code for better modularity + return FalseThen modify the main function to handle the return value appropriately.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
application/ingestr/.env(1 hunks)application/ingestr/kafka-compose.yml(1 hunks)application/ingestr/kafka-demo.xsh(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- application/ingestr/kafka-compose.yml
🧰 Additional context used
🪛 dotenv-linter (3.3.0)
application/ingestr/.env
[warning] 7-7: [SubstitutionKey] The CRATEDB_HOST key is not assigned properly
[warning] 8-8: [SubstitutionKey] The CRATEDB_HTTP_PORT key is not assigned properly
[warning] 9-9: [SubstitutionKey] The CRATEDB_POSTGRESQL_PORT key is not assigned properly
[warning] 10-10: [SubstitutionKey] The CRATEDB_USERNAME key is not assigned properly
[warning] 11-11: [SubstitutionKey] The CRATEDB_PASSWORD key is not assigned properly
[warning] 11-11: [UnorderedKey] The CRATEDB_PASSWORD key should go before the CRATEDB_POSTGRESQL_PORT key
[warning] 12-12: [SubstitutionKey] The CRATEDB_HTTP_SCHEME key is not assigned properly
[warning] 12-12: [UnorderedKey] The CRATEDB_HTTP_SCHEME key should go before the CRATEDB_PASSWORD key
🔇 Additional comments (5)
application/ingestr/.env (2)
7-12: Parameter expansion syntax is correct despite linter warnings.The shell parameter expansion syntax
${VAR:-default}is valid and appropriate for providing default values. The static analysis warnings about "SubstitutionKey" appear to be false positives.
21-24: URL construction logic is well-designed.The separation of authenticated (for HTTPie) and non-authenticated (for Python urllib) URLs is thoughtful and addresses different client authentication requirements effectively.
application/ingestr/kafka-demo.xsh (3)
11-12: Good configuration of debugging and telemetry settings.The traceback and telemetry disable settings are appropriate for a demo script, providing better debugging experience while respecting privacy.
177-200: Excellent implementation of native HTTP client.The
query_cratedbfunction effectively replaces the need for external dependencies likejqby using Python's standard library. This addresses the concern raised in past comments about reducing external dependencies.
206-216: Well-structured main execution flow.The main function demonstrates good separation of concerns with infrastructure and data handling, plus useful fast-path options for development and debugging.
About
ingestr/CrateDB has been unlocked just recently. This patch makes a start to provide runnable examples and additional quality assurance, by using the examples as integration tests.
What's inside
Runnable examples that target CrateDB, starting with Kafka.
Synopsis
Tutorial
See Tutorial: Loading data from Apache Kafka into CrateDB.
/cc @kneth