-
Notifications
You must be signed in to change notification settings - Fork 3k
Updated spark-quickstart instructions to use Ozone as a backend store #15034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -37,6 +37,28 @@ Once you have those, save the yaml below into a file named `docker-compose.yml`: | |||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| ```yaml | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| x-image: | ||||||||||||||||||||||||||||
| &ozone-image | ||||||||||||||||||||||||||||
| image: ${OZONE_IMAGE:-apache/ozone}:${OZONE_IMAGE_VERSION:-2.1.0}${OZONE_IMAGE_FLAVOR:-} | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| x-common-config: | ||||||||||||||||||||||||||||
| &ozone-common-config | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_hdds.datanode.dir: "/data/hdds" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_ozone.metadata.dirs: "/data/metadata" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_ozone.om.address: "ozone-om" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_ozone.om.http-address: "ozone-om:9874" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_ozone.recon.address: "ozone-recon:9891" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_ozone.recon.db.dir: "/data/metadata/recon" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_ozone.replication: "1" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_ozone.scm.block.client.address: "ozone-scm" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_ozone.scm.client.address: "ozone-scm" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_ozone.scm.datanode.id.dir: "/data/metadata" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_ozone.scm.names: "ozone-scm" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_hdds.scm.safemode.min.datanode: "1" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_hdds.scm.safemode.healthy.pipeline.pct: "0" | ||||||||||||||||||||||||||||
| OZONE-SITE.XML_ozone.s3g.domain.name: "s3.ozone" | ||||||||||||||||||||||||||||
| no_proxy: "ozone-om,ozone-recon,ozone-scm,ozone-s3g,localhost,127.0.0.1" | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| services: | ||||||||||||||||||||||||||||
| spark-iceberg: | ||||||||||||||||||||||||||||
| image: tabulario/spark-iceberg | ||||||||||||||||||||||||||||
|
|
@@ -46,7 +68,7 @@ services: | |||||||||||||||||||||||||||
| iceberg_net: | ||||||||||||||||||||||||||||
| depends_on: | ||||||||||||||||||||||||||||
| - rest | ||||||||||||||||||||||||||||
| - minio | ||||||||||||||||||||||||||||
| - ozone-s3g | ||||||||||||||||||||||||||||
| volumes: | ||||||||||||||||||||||||||||
| - ./warehouse:/home/iceberg/warehouse | ||||||||||||||||||||||||||||
| - ./notebooks:/home/iceberg/notebooks/notebooks | ||||||||||||||||||||||||||||
|
|
@@ -72,41 +94,76 @@ services: | |||||||||||||||||||||||||||
| - AWS_REGION=us-east-1 | ||||||||||||||||||||||||||||
| - CATALOG_WAREHOUSE=s3://warehouse/ | ||||||||||||||||||||||||||||
| - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO | ||||||||||||||||||||||||||||
| - CATALOG_S3_ENDPOINT=http://minio:9000 | ||||||||||||||||||||||||||||
| minio: | ||||||||||||||||||||||||||||
| image: minio/minio | ||||||||||||||||||||||||||||
| container_name: minio | ||||||||||||||||||||||||||||
| - CATALOG_S3_ENDPOINT=http://s3.ozone:9878 | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| ozone-datanode: | ||||||||||||||||||||||||||||
| <<: *ozone-image | ||||||||||||||||||||||||||||
| ports: | ||||||||||||||||||||||||||||
| - 9864 | ||||||||||||||||||||||||||||
| command: ["ozone", "datanode"] | ||||||||||||||||||||||||||||
| environment: | ||||||||||||||||||||||||||||
| - MINIO_ROOT_USER=admin | ||||||||||||||||||||||||||||
| - MINIO_ROOT_PASSWORD=password | ||||||||||||||||||||||||||||
| - MINIO_DOMAIN=minio | ||||||||||||||||||||||||||||
| <<: *ozone-common-config | ||||||||||||||||||||||||||||
| networks: | ||||||||||||||||||||||||||||
| iceberg_net: | ||||||||||||||||||||||||||||
| aliases: | ||||||||||||||||||||||||||||
| - warehouse.minio | ||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||
| ozone-om: | ||||||||||||||||||||||||||||
| <<: *ozone-image | ||||||||||||||||||||||||||||
| ports: | ||||||||||||||||||||||||||||
| - 9001:9001 | ||||||||||||||||||||||||||||
| - 9000:9000 | ||||||||||||||||||||||||||||
| command: ["server", "/data", "--console-address", ":9001"] | ||||||||||||||||||||||||||||
| mc: | ||||||||||||||||||||||||||||
| depends_on: | ||||||||||||||||||||||||||||
| - minio | ||||||||||||||||||||||||||||
| image: minio/mc | ||||||||||||||||||||||||||||
| container_name: mc | ||||||||||||||||||||||||||||
| - 9874:9874 | ||||||||||||||||||||||||||||
| environment: | ||||||||||||||||||||||||||||
| <<: *ozone-common-config | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
| <<: *ozone-common-config | |
| <<: *ozone-common-config | |
| # WARNING: The following proxyuser settings are for quickstart/demo use only. | |
| # Do not use this configuration as-is in production; restrict hosts and groups appropriately. |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WAITFOR configuration on line 118 specifies 'ozone-scm:9876' to wait for the SCM service, but there's no corresponding WAITFOR configuration in the ozone-scm service to wait for its own initialization. The ozone-scm service is the first in the dependency chain but doesn't have any startup coordination. Consider verifying that the apache/ozone image handles the ENSURE_SCM_INITIALIZED check properly without needing an explicit WAITFOR, or document this dependency ordering requirement.
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container name for the ozone-om service is not explicitly set, while other services in the docker-compose file (like spark-iceberg and iceberg-rest) have container_name specified. For consistency and easier debugging, consider adding explicit container names for all Ozone services, such as 'container_name: ozone-om' for the ozone-om service.
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ozone-recon service should have a depends_on clause to ensure the ozone-scm and ozone-om services are available before it starts. Recon is the monitoring and management service for Ozone and needs the other core services to be operational to function properly.
| <<: *ozone-common-config | |
| <<: *ozone-common-config | |
| depends_on: | |
| - ozone-scm | |
| - ozone-om |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Ozone services (ozone-datanode, ozone-om, ozone-scm, ozone-recon) store data in directories like /data/hdds and /data/metadata, but these directories are not mounted to host volumes. This means that when the containers are stopped or removed, all data will be lost. While this might be acceptable for a quickstart/demo environment, consider documenting this behavior or adding optional volume mounts for users who want to persist data across container restarts.
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Ozone S3 Gateway (ozone-s3g) configuration is missing AWS credential environment variables, while the spark-iceberg and rest services are configured with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. For the S3 API to work properly with authentication, the Ozone S3 Gateway needs to be configured to either accept these credentials or operate in anonymous mode. Without proper credential configuration, there may be authentication mismatches when Spark tries to access the S3 endpoint with the provided credentials.
| <<: *ozone-common-config | |
| <<: *ozone-common-config | |
| - AWS_ACCESS_KEY_ID=admin | |
| - AWS_SECRET_ACCESS_KEY=password | |
| - AWS_REGION=us-east-1 |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script starts the S3 gateway in the background and then performs administrative operations. However, there's no check to ensure the S3 gateway itself is fully operational before proceeding with bucket operations. The wait only checks if ozone-om responds to 'volume list', but doesn't verify that the S3 gateway is ready. This could lead to race conditions where bucket operations might fail if the S3 gateway hasn't fully initialized. Consider adding a health check or retry logic for the bucket operations.
| ozone sh bucket create /s3v/warehouse | |
| max_retries=30 | |
| retry_delay=2 | |
| attempt=0 | |
| until ozone sh bucket create /s3v/warehouse >/dev/null 2>&1; do | |
| attempt=$$((attempt + 1)) | |
| if [ "$$attempt" -ge "$$max_retries" ]; then | |
| echo "Failed to create bucket /s3v/warehouse after $$max_retries attempts" | |
| exit 1 | |
| fi | |
| echo '...waiting for S3 gateway to be ready for bucket operations...' | |
| sleep "$$retry_delay" | |
| done |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ozone-s3g service is missing a depends_on clause to ensure the prerequisite services are started first. The S3 Gateway requires ozone-scm, ozone-om, and ozone-datanode to be available before it can successfully handle S3 requests. While the shell script waits for ozone-om to respond, the depends_on clause is still needed to prevent the container from starting prematurely and ensure proper orchestration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ozone-datanode service is missing a depends_on clause to ensure proper startup order. The datanode requires both ozone-scm (Storage Container Manager) and ozone-om (Object Manager) to be running before it can start successfully. Without this dependency, the datanode may fail to start or experience startup issues.