Updated spark-quickstart instructions to use Ozone as a backend store#15034
Updated spark-quickstart instructions to use Ozone as a backend store#15034SaketaChalamchala wants to merge 1 commit intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request replaces MinIO with Apache Ozone as the backend storage for the Spark Quickstart Docker environment, addressing concerns about MinIO's maintenance mode and security issues discussed in issue #14945.
Changes:
- Replaced MinIO service with Apache Ozone multi-service architecture (SCM, OM, Datanode, Recon, S3 Gateway)
- Updated S3 endpoint configuration to point to Ozone S3 Gateway (s3.ozone:9878)
- Added CREATE NAMESPACE statement to SQL example for proper namespace initialization
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - MINIO_ROOT_USER=admin | ||
| - MINIO_ROOT_PASSWORD=password | ||
| - MINIO_DOMAIN=minio | ||
| <<: *ozone-common-config |
There was a problem hiding this comment.
The ozone-datanode service is missing a depends_on clause to ensure proper startup order. The datanode requires both ozone-scm (Storage Container Manager) and ozone-om (Object Manager) to be running before it can start successfully. Without this dependency, the datanode may fail to start or experience startup issues.
| <<: *ozone-common-config | |
| <<: *ozone-common-config | |
| depends_on: | |
| - ozone-scm | |
| - ozone-om |
| ports: | ||
| - 9888:9888 | ||
| environment: | ||
| <<: *ozone-common-config |
There was a problem hiding this comment.
The ozone-recon service should have a depends_on clause to ensure the ozone-scm and ozone-om services are available before it starts. Recon is the monitoring and management service for Ozone and needs the other core services to be operational to function properly.
| <<: *ozone-common-config | |
| <<: *ozone-common-config | |
| depends_on: | |
| - ozone-scm | |
| - ozone-om |
| ozone-s3g: | ||
| <<: *ozone-image | ||
| ports: | ||
| - 9878:9878 | ||
| environment: | ||
| <<: *ozone-common-config | ||
| command: | ||
| - sh | ||
| - -c | ||
| - | | ||
| set -e | ||
| ozone s3g & | ||
| s3g_pid=$$! | ||
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | ||
| ozone sh bucket delete /s3v/warehouse || true | ||
| ozone sh bucket create /s3v/warehouse | ||
| wait "$$s3g_pid" | ||
| networks: | ||
| iceberg_net: | ||
| aliases: | ||
| - s3.ozone | ||
| - warehouse.s3.ozone |
There was a problem hiding this comment.
The ozone-s3g service is missing a depends_on clause to ensure the prerequisite services are started first. The S3 Gateway requires ozone-scm, ozone-om, and ozone-datanode to be available before it can successfully handle S3 requests. While the shell script waits for ozone-om to respond, the depends_on clause is still needed to prevent the container from starting prematurely and ensure proper orchestration.
| ports: | ||
| - 9878:9878 | ||
| environment: | ||
| <<: *ozone-common-config |
There was a problem hiding this comment.
The Ozone S3 Gateway (ozone-s3g) configuration is missing AWS credential environment variables, while the spark-iceberg and rest services are configured with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. For the S3 API to work properly with authentication, the Ozone S3 Gateway needs to be configured to either accept these credentials or operate in anonymous mode. Without proper credential configuration, there may be authentication mismatches when Spark tries to access the S3 endpoint with the provided credentials.
| <<: *ozone-common-config | |
| <<: *ozone-common-config | |
| - AWS_ACCESS_KEY_ID=admin | |
| - AWS_SECRET_ACCESS_KEY=password | |
| - AWS_REGION=us-east-1 |
| ozone-om: | ||
| <<: *ozone-image | ||
| ports: | ||
| - 9001:9001 | ||
| - 9000:9000 | ||
| command: ["server", "/data", "--console-address", ":9001"] | ||
| mc: | ||
| depends_on: | ||
| - minio | ||
| image: minio/mc | ||
| container_name: mc | ||
| - 9874:9874 | ||
| environment: | ||
| <<: *ozone-common-config | ||
| CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*" | ||
| CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*" | ||
| ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION | ||
| WAITFOR: ozone-scm:9876 | ||
| command: ["ozone", "om"] | ||
| networks: | ||
| iceberg_net: |
There was a problem hiding this comment.
The container name for the ozone-om service is not explicitly set, while other services in the docker-compose file (like spark-iceberg and iceberg-rest) have container_name specified. For consistency and easier debugging, consider adding explicit container names for all Ozone services, such as 'container_name: ozone-om' for the ozone-om service.
| CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*" | ||
| CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*" | ||
| ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION | ||
| WAITFOR: ozone-scm:9876 |
There was a problem hiding this comment.
The WAITFOR configuration on line 118 specifies 'ozone-scm:9876' to wait for the SCM service, but there's no corresponding WAITFOR configuration in the ozone-scm service to wait for its own initialization. The ozone-scm service is the first in the dependency chain but doesn't have any startup coordination. Consider verifying that the apache/ozone image handles the ENSURE_SCM_INITIALIZED check properly without needing an explicit WAITFOR, or document this dependency ordering requirement.
| ozone-datanode: | ||
| <<: *ozone-image | ||
| ports: | ||
| - 9864 | ||
| command: ["ozone", "datanode"] | ||
| environment: | ||
| - MINIO_ROOT_USER=admin | ||
| - MINIO_ROOT_PASSWORD=password | ||
| - MINIO_DOMAIN=minio | ||
| <<: *ozone-common-config | ||
| networks: | ||
| iceberg_net: | ||
| aliases: | ||
| - warehouse.minio | ||
|
|
||
| ozone-om: | ||
| <<: *ozone-image | ||
| ports: | ||
| - 9001:9001 | ||
| - 9000:9000 | ||
| command: ["server", "/data", "--console-address", ":9001"] | ||
| mc: | ||
| depends_on: | ||
| - minio | ||
| image: minio/mc | ||
| container_name: mc | ||
| - 9874:9874 | ||
| environment: | ||
| <<: *ozone-common-config | ||
| CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*" | ||
| CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*" | ||
| ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION | ||
| WAITFOR: ozone-scm:9876 | ||
| command: ["ozone", "om"] | ||
| networks: | ||
| iceberg_net: | ||
|
|
||
| ozone-scm: | ||
| <<: *ozone-image | ||
| ports: | ||
| - 9876:9876 | ||
| environment: | ||
| - AWS_ACCESS_KEY_ID=admin | ||
| - AWS_SECRET_ACCESS_KEY=password | ||
| - AWS_REGION=us-east-1 | ||
| entrypoint: | | ||
| /bin/sh -c " | ||
| until (/usr/bin/mc alias set minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done; | ||
| /usr/bin/mc rm -r --force minio/warehouse; | ||
| /usr/bin/mc mb minio/warehouse; | ||
| /usr/bin/mc policy set public minio/warehouse; | ||
| tail -f /dev/null | ||
| " | ||
| <<: *ozone-common-config | ||
| ENSURE_SCM_INITIALIZED: /data/metadata/scm/current/VERSION | ||
| command: ["ozone", "scm"] | ||
| networks: | ||
| iceberg_net: | ||
|
|
||
| ozone-recon: | ||
| <<: *ozone-image | ||
| ports: | ||
| - 9888:9888 | ||
| environment: | ||
| <<: *ozone-common-config | ||
| command: ["ozone", "recon"] | ||
| networks: | ||
| iceberg_net: |
There was a problem hiding this comment.
The Ozone services (ozone-datanode, ozone-om, ozone-scm, ozone-recon) store data in directories like /data/hdds and /data/metadata, but these directories are not mounted to host volumes. This means that when the containers are stopped or removed, all data will be lost. While this might be acceptable for a quickstart/demo environment, consider documenting this behavior or adding optional volume mounts for users who want to persist data across container restarts.
| container_name: mc | ||
| - 9874:9874 | ||
| environment: | ||
| <<: *ozone-common-config |
There was a problem hiding this comment.
The proxy user configuration allows any host and any group to proxy as the hadoop user, which is overly permissive for a production environment. While this may be acceptable for a quickstart/demo environment, consider adding a comment in the documentation to warn users not to use these settings in production, or at minimum document that these are demo-only settings for ease of use.
| <<: *ozone-common-config | |
| <<: *ozone-common-config | |
| # WARNING: The following proxyuser settings are for quickstart/demo use only. | |
| # Do not use this configuration as-is in production; restrict hosts and groups appropriately. |
| s3g_pid=$$! | ||
| until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done; | ||
| ozone sh bucket delete /s3v/warehouse || true | ||
| ozone sh bucket create /s3v/warehouse |
There was a problem hiding this comment.
The script starts the S3 gateway in the background and then performs administrative operations. However, there's no check to ensure the S3 gateway itself is fully operational before proceeding with bucket operations. The wait only checks if ozone-om responds to 'volume list', but doesn't verify that the S3 gateway is ready. This could lead to race conditions where bucket operations might fail if the S3 gateway hasn't fully initialized. Consider adding a health check or retry logic for the bucket operations.
| ozone sh bucket create /s3v/warehouse | |
| max_retries=30 | |
| retry_delay=2 | |
| attempt=0 | |
| until ozone sh bucket create /s3v/warehouse >/dev/null 2>&1; do | |
| attempt=$$((attempt + 1)) | |
| if [ "$$attempt" -ge "$$max_retries" ]; then | |
| echo "Failed to create bucket /s3v/warehouse after $$max_retries attempts" | |
| exit 1 | |
| fi | |
| echo '...waiting for S3 gateway to be ready for bucket operations...' | |
| sleep "$$retry_delay" | |
| done |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
Following the discussion in the thread #14945 submitting a patch to use Ozone as a backend store for Iceberg in quick start docker environments.

Tested manually with
spark-shell.This patch will need to be merged after databricks/docker-spark-iceberg#246 which also contains updates to
spark-defaults.conf