Use Apache Ozone as backend store in Iceberg quickstart docker environment#246
Use Apache Ozone as backend store in Iceberg quickstart docker environment#246SaketaChalamchala wants to merge 4 commits intodatabricks:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR migrates the Iceberg quickstart Docker environment from MinIO to Apache Ozone as the S3-compatible backend storage system. This change addresses MinIO entering maintenance mode and provides a stable alternative for the quickstart environment.
Changes:
- Replaced MinIO services with a complete Apache Ozone cluster consisting of multiple services (SCM, OM, Datanode, Recon, S3 Gateway)
- Updated all S3 endpoint configurations from
http://minio:9000tohttp://s3.ozone:9878across Spark, Flink, and PyIceberg configurations - Updated documentation to reflect Apache Ozone as the storage backend
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| docker-compose.yml | Replaced MinIO and mc services with Apache Ozone cluster services (ozone-datanode, ozone-om, ozone-scm, ozone-recon, ozone-s3g) and updated S3 endpoint configuration |
| flink-example/docker-compose.yml | Applied identical Ozone migration changes for the Flink example environment |
| spark/spark-defaults.conf | Updated S3 endpoint from minio:9000 to s3.ozone:9878 |
| spark/.pyiceberg.yaml | Updated S3 endpoint from minio:9000 to s3.ozone:9878 |
| flink-example/src/main/java/io/tabular/flink/lor/example/LORSink.java | Updated default S3 endpoint parameter from minio:9000 to s3.ozone:9878 |
| README.md | Updated documentation to mention Apache Ozone instead of MinIO as storage backend |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
docker-compose.yml
Outdated
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | ||
| ozone sh bucket delete /s3v/warehouse || true | ||
| ozone sh bucket create /s3v/warehouse | ||
| wait "$$s3g_pid" |
There was a problem hiding this comment.
The wait command uses double dollar signs ($$s3g_pid) which is incorrect. This should be a single dollar sign ($s3g_pid) to properly reference the shell variable containing the PID of the background s3g process.
| wait "$$s3g_pid" | |
| wait "$s3g_pid" |
flink-example/docker-compose.yml
Outdated
| s3g_pid=$$! | ||
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | ||
| ozone sh bucket delete /s3v/warehouse || true | ||
| ozone sh bucket create /s3v/warehouse | ||
| wait "$$s3g_pid" |
There was a problem hiding this comment.
The shell variable assignment uses double dollar signs ($$!) which is incorrect for capturing the background process PID in a YAML multi-line string. This should be a single dollar sign ($!) to properly capture the PID of the background process. The double dollar sign is typically used in Makefiles or when you need to escape the dollar sign, but not needed here in the YAML command string.
| s3g_pid=$$! | |
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | |
| ozone sh bucket delete /s3v/warehouse || true | |
| ozone sh bucket create /s3v/warehouse | |
| wait "$$s3g_pid" | |
| s3g_pid=$! | |
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | |
| ozone sh bucket delete /s3v/warehouse || true | |
| ozone sh bucket create /s3v/warehouse | |
| wait "$s3g_pid" |
flink-example/docker-compose.yml
Outdated
| s3g_pid=$$! | ||
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | ||
| ozone sh bucket delete /s3v/warehouse || true | ||
| ozone sh bucket create /s3v/warehouse | ||
| wait "$$s3g_pid" |
There was a problem hiding this comment.
The wait command uses double dollar signs ($$s3g_pid) which is incorrect. This should be a single dollar sign ($s3g_pid) to properly reference the shell variable containing the PID of the background s3g process.
| s3g_pid=$$! | |
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | |
| ozone sh bucket delete /s3v/warehouse || true | |
| ozone sh bucket create /s3v/warehouse | |
| wait "$$s3g_pid" | |
| s3g_pid=$! | |
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | |
| ozone sh bucket delete /s3v/warehouse || true | |
| ozone sh bucket create /s3v/warehouse | |
| wait "$s3g_pid" |
docker-compose.yml
Outdated
| ozone-s3g: | ||
| <<: *ozone-image | ||
| ports: | ||
| - 9878:9878 |
There was a problem hiding this comment.
The ozone-s3g service should have explicit depends_on declarations for the other Ozone services it relies on (ozone-om, ozone-scm, ozone-datanode) to ensure proper service startup order. While the WAITFOR environment variable in ozone-om and the polling loop in the command handle some of the initialization sequencing, adding depends_on ensures Docker Compose starts the services in the correct order.
| - 9878:9878 | |
| - 9878:9878 | |
| depends_on: | |
| - ozone-om | |
| - ozone-scm | |
| - ozone-datanode |
flink-example/docker-compose.yml
Outdated
| ports: | ||
| - 9878:9878 | ||
| environment: | ||
| <<: *ozone-common-config |
There was a problem hiding this comment.
The ozone-s3g service should have explicit depends_on declarations for the other Ozone services it relies on (ozone-om, ozone-scm, ozone-datanode) to ensure proper service startup order. While the WAITFOR environment variable in ozone-om and the polling loop in the command handle some of the initialization sequencing, adding depends_on ensures Docker Compose starts the services in the correct order.
| <<: *ozone-common-config | |
| <<: *ozone-common-config | |
| depends_on: | |
| - ozone-om | |
| - ozone-scm | |
| - ozone-datanode |
docker-compose.yml
Outdated
| WAITFOR: ozone-scm:9876 | ||
| command: ["ozone", "om"] | ||
| networks: | ||
| iceberg_net: |
There was a problem hiding this comment.
The ozone-om service should have an explicit depends_on declaration for ozone-scm since it uses WAITFOR: ozone-scm:9876 to ensure SCM is ready before starting. While the WAITFOR mechanism handles the runtime dependency, adding depends_on ensures Docker Compose starts ozone-scm before ozone-om.
| iceberg_net: | |
| iceberg_net: | |
| depends_on: | |
| - ozone-scm |
flink-example/docker-compose.yml
Outdated
| WAITFOR: ozone-scm:9876 | ||
| command: ["ozone", "om"] | ||
| networks: | ||
| iceberg_net: |
There was a problem hiding this comment.
The ozone-om service should have an explicit depends_on declaration for ozone-scm since it uses WAITFOR: ozone-scm:9876 to ensure SCM is ready before starting. While the WAITFOR mechanism handles the runtime dependency, adding depends_on ensures Docker Compose starts ozone-scm before ozone-om.
| iceberg_net: | |
| iceberg_net: | |
| depends_on: | |
| - ozone-scm |
docker-compose.yml
Outdated
| s3g_pid=$$! | ||
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | ||
| ozone sh bucket delete /s3v/warehouse || true | ||
| ozone sh bucket create /s3v/warehouse | ||
| wait "$$s3g_pid" |
There was a problem hiding this comment.
The shell variable assignment uses double dollar signs ($$!) which is incorrect for capturing the background process PID in a YAML multi-line string. This should be a single dollar sign ($!) to properly capture the PID of the background process. The double dollar sign is typically used in Makefiles or when you need to escape the dollar sign, but not needed here in the YAML command string.
| s3g_pid=$$! | |
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | |
| ozone sh bucket delete /s3v/warehouse || true | |
| ozone sh bucket create /s3v/warehouse | |
| wait "$$s3g_pid" | |
| s3g_pid=$! | |
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | |
| ozone sh bucket delete /s3v/warehouse || true | |
| ozone sh bucket create /s3v/warehouse | |
| wait "$s3g_pid" |
| spark.sql.catalog.demo.io-impl org.apache.iceberg.aws.s3.S3FileIO | ||
| spark.sql.catalog.demo.warehouse s3://warehouse/wh/ | ||
| spark.sql.catalog.demo.s3.endpoint http://minio:9000 | ||
| spark.sql.catalog.demo.s3.endpoint http://s3.ozone:9878 |
There was a problem hiding this comment.
The spark.sql.catalog.demo.s3.endpoint is configured to use plain HTTP to reach the object store, which sends S3 requests and any associated credentials over the network without encryption. An attacker with access to the host or the Docker network could intercept or tamper with catalog and warehouse data in transit. Consider configuring this endpoint to use TLS (HTTPS) and enabling the corresponding TLS support on the Ozone S3 gateway to protect data and credentials in transit.
| spark.sql.catalog.demo.s3.endpoint http://s3.ozone:9878 | |
| spark.sql.catalog.demo.s3.endpoint https://s3.ozone:9878 |
|
@SaketaChalamchala Hi! were you able to get past the Jupyter thing? |
@dipankarmazumdar, I still can't find a way to get to the notebook server past the security page even though I set the password. Just wanted to test the flink example, might try the flink REST API for creating the DB. Has anybody tried to use the instructions in the Just clarifying that this is not a blocker for Ozone adoption. The updated |
Thanks! I was able to test the "Iceberg - Getting Started" Ipynb and it worked well. Question: How do I see the content of the |
Ozone exposes a both a Hadoop compatible file system and S3 client interfaces. S3 interface: Use any S3 client like AWS cli. Sample commands are below File system interface: Use |
|
@dipankarmazumdar since you asked about OzoneManager and Recon above, here is my explanation for that processes.
As @SaketaChalamchala mentioned, for a specific path access, you should be able to browse via fs/s3 apis. Thank you @SaketaChalamchala for providing the commands to look. |
Thanks, this helps! |
|
@dipankarmazumdar Would you be able to help get an approval for this patch? |
|
Hi @nastra! would you be able to review this? |
|
HI @nastra / @dipankarmazumdar, The CI is failing when building the spark container. Looks like |
#247 would fix this |
liuml07
left a comment
There was a problem hiding this comment.
Replacing MinIO makes sense given its current state. Also using an Apache top-level project, Ozone, sounds good. Using an S3-compatible system is also a sensible choice.
My only question is about the simplicity. I feel we may want to make this repo as simple as possible, enabling users to start the local Spark/Flink environment with Iceberg quickly and easily. If things go wrong, it's easier to debug; also maintainers and advanced users of this would feel straightforward when making changes.
For that, I have some quick questions:
- We are starting fully-fledged Ozone system with 5 new services. Is there a way to consolidate them into fewer ones?
- There are config definitions for Ozone, which are less interesting to the Iceberg quickstart. Can we fallback to defaults, or put configs to an env_file and reference it in the docker compose file?
- Alternatively, can we encapsulate all Ozone services into another file such as
docker-compose.ozone.yml, andincludeit into the main docker compose file? That way the main docker compose file would be more concise and relevant as a quickstart local docker.
Last, some auto-review comments made by Copilot seems valid, for example, the depends_on.
README.md
Outdated
|
|
||
| This is a docker compose environment to quickly get up and running with a Spark environment and a local REST | ||
| catalog, and MinIO as a storage backend. | ||
| catalog, and Apache Ozone as a storage backend. |
There was a problem hiding this comment.
Probably we can link to the project
@liuml07 That's a good idea. I have prepared another patch towards that end:
This approach also allows adding the RustFS backend as an alternative. Please let me know what you think. (I have moved Ozone's S3 service to |
Would not this be confusing, as we are using ozone storage internally but hiding it behind Minio as the name? It does make it compatible with fewer changes, but it may get confusing if we want to change the underlying storage in future |
I agree, for the quickstart we could just keep the |

Following the discussion on the thread apache/iceberg#14945, submitting a patch to use Apache Ozone as the backend store for Iceberg in quickstart docker environment since MinIO is entering maintenance mode.
Manual test using spark-shell: