Skip to content

Guides: fix setup scripts to yield correct exit code#3612

Merged
snazy merged 3 commits intoapache:mainfrom
snazy:guides-hc-exit-code
Feb 3, 2026
Merged

Guides: fix setup scripts to yield correct exit code#3612
snazy merged 3 commits intoapache:mainfrom
snazy:guides-hc-exit-code

Conversation

@snazy
Copy link
Member

@snazy snazy commented Jan 29, 2026

The statements in the shell scripts for the setup services are often concatenated using ;, which means that a previous' command exit code is not propagated and the service, although it failed, is determined to be successful.

This change updates those scripts to use && for the statement concatenation.

"Final" setup services (aka "polaris-setup") now have a final sleep 120. This is due to the behavior of docker compose up --detach --wait, which considers any service (without dependants) that exits with exit code 0 as a failure, leading to that docker-compose command yielding an error code. That would break the guides testing code (#3553). That sleep 120 in "polaris-setup" services does not cause a delay of the compose starting up - it is purely a "hack around" Docker Compose not having a notion of "setup services".

To avoid merge conflicts, this change also:

apk add --no-cache jq &&
chmod +x /polaris/create-catalog.sh &&
apk add --no-cache jq &&
token=$$(curl http://keycloak:8080/realms/iceberg/protocol/openid-connect/token --user client1:s3cr3t -d 'grant_type=client_credentials' | jq -r .access_token) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do implicitly --fail-with-body here too for curl?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is that the token would just be "invalid".
It's also that the pipe "eliminates" the exit code from curl, but jq would fail then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fair enough. Yeah. Just thought to be consistent. But not a blocker.

echo Creating Ceph bucket... &&
aws s3 mb s3://${S3_POLARIS_BUCKET} &&
aws s3 ls &&
echo Bucket setup complete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return 0 as well no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have at most 2 setup services for these docker compose and you are adding sleep 120 to pass the CI based on docker-compose (as exit of status code of zero will be treat as failed). Thus, this will also return status code of zero.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I know what you're referring to.
Those setup-bucket services behave in an "interesting" way, which is different than how the "final" polaris-setup services behave (in terms of exit behavior).
An "intermediate" setup-service like setup-bucket, on which for example polaris depends using the service_completed_successfully condition, is fine to exit and does not "break" a docker compose up --detach --wait.
A "final" setup-service like some polaris-setup, a service that no other service depends on, lets docker compose up --detach --wait fail when it exits.

The "crux" with those "final" setup-services is that the behavior isn't immediately obvious and depends on the timing of the relevant services and when exactly docker compose up --detach --wait "thinks" the whole compose thing is ready.

But this ^^ is all from experiments. It's tricky to find good documentation on this behavior. However, the behavior of depends_on (... a setup service) with service_completed_successfully condition makes sense.

/polaris/create-catalog.sh realm-external $$token &&
/polaris/create-catalog.sh realm-mixed $$token
/polaris/create-catalog.sh realm-mixed $$token &&
sleep 120
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yeah, while I was doing a local prototyping, I ran into this issue as well where --exit-code-from is not happy with this. What I ended up doing is to use tail /dev/null to keep the service up then use health check on service ready instead of completed. Sleep 120 here will work but if for whatever reason these bash scripts took more than 120s (assuming they get very complex later on), this will bite us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, docker-compose isn't a good friend at all :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither tail /dev/null nor sleep 120 is particularly great.
I wanted to stay away from the added complexity of health-checks for the setup-scripts though.
Let me think a bit about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to let the polaris-setup services "tail forever" and have health-checks. Added that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That is what I ended up doing in my local prototyping with tail forever as status code of zero is not okay for compose service.

snazy added 3 commits January 31, 2026 11:04
The statements in the shell scripts for the setup services are often concatenated using `;`, which means that a previous' command exit code is _not_ propagated and the service, although it failed, is determined to be successful.

This change updates those scripts to use `&&` for the statement concatenation.

"Final" setup services (aka "polaris-setup") now have a final `sleep 120`. This is due to the behavior of `docker compose up --detach --wait`, which considers _any_ service (without dependants) that exits with exit code 0 as a failure, leading to that docker-compose command yielding an error code. That would break the guides testing code (apache#3553). That `sleep 120` in "polaris-setup" services does **not** cause a delay of the compose starting up - it is purely a "hack around" Docker Compose not having a notion of "setup services".

To avoid merge conflicts, this change also:
* updates affected `curl` invocations (as apache#3610)
* removes superfluous `restart: "no"`
@snazy snazy force-pushed the guides-hc-exit-code branch from f2228d6 to 974046c Compare January 31, 2026 10:09
@snazy snazy requested a review from MonkeyCanCode February 2, 2026 12:15
@snazy
Copy link
Member Author

snazy commented Feb 3, 2026

@MonkeyCanCode mind taking another look?

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Feb 3, 2026
@snazy snazy merged commit 890b33a into apache:main Feb 3, 2026
15 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Feb 3, 2026
@snazy snazy deleted the guides-hc-exit-code branch February 3, 2026 13:59
sungwy pushed a commit to sungwy/polaris that referenced this pull request Feb 7, 2026
The statements in the shell scripts for the setup services are often concatenated using `;`, which means that a previous' command exit code is _not_ propagated and the service, although it failed, is determined to be successful.

This change updates those scripts to use `&&` for the statement concatenation.

"Final" setup services (aka "polaris-setup") now have a final `sleep 120`. This is due to the behavior of `docker compose up --detach --wait`, which considers _any_ service (without dependants) that exits with exit code 0 as a failure, leading to that docker-compose command yielding an error code. That would break the guides testing code (apache#3553). That `sleep 120` in "polaris-setup" services does **not** cause a delay of the compose starting up - it is purely a "hack around" Docker Compose not having a notion of "setup services".

To avoid merge conflicts, this change also:
* updates affected `curl` invocations (as apache#3610)
* removes superfluous `restart: "no"`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants