Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 87 additions & 29 deletions site/docs/spark-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,28 @@ Once you have those, save the yaml below into a file named `docker-compose.yml`:

```yaml

x-image:
&ozone-image
image: ${OZONE_IMAGE:-apache/ozone}:${OZONE_IMAGE_VERSION:-2.1.0}${OZONE_IMAGE_FLAVOR:-}

x-common-config:
&ozone-common-config
OZONE-SITE.XML_hdds.datanode.dir: "/data/hdds"
OZONE-SITE.XML_ozone.metadata.dirs: "/data/metadata"
OZONE-SITE.XML_ozone.om.address: "ozone-om"
OZONE-SITE.XML_ozone.om.http-address: "ozone-om:9874"
OZONE-SITE.XML_ozone.recon.address: "ozone-recon:9891"
OZONE-SITE.XML_ozone.recon.db.dir: "/data/metadata/recon"
OZONE-SITE.XML_ozone.replication: "1"
OZONE-SITE.XML_ozone.scm.block.client.address: "ozone-scm"
OZONE-SITE.XML_ozone.scm.client.address: "ozone-scm"
OZONE-SITE.XML_ozone.scm.datanode.id.dir: "/data/metadata"
OZONE-SITE.XML_ozone.scm.names: "ozone-scm"
OZONE-SITE.XML_hdds.scm.safemode.min.datanode: "1"
OZONE-SITE.XML_hdds.scm.safemode.healthy.pipeline.pct: "0"
OZONE-SITE.XML_ozone.s3g.domain.name: "s3.ozone"
no_proxy: "ozone-om,ozone-recon,ozone-scm,ozone-s3g,localhost,127.0.0.1"

services:
spark-iceberg:
image: tabulario/spark-iceberg
Expand All @@ -46,7 +68,7 @@ services:
iceberg_net:
depends_on:
- rest
- minio
- ozone-s3g
volumes:
- ./warehouse:/home/iceberg/warehouse
- ./notebooks:/home/iceberg/notebooks/notebooks
Expand All @@ -72,41 +94,76 @@ services:
- AWS_REGION=us-east-1
- CATALOG_WAREHOUSE=s3://warehouse/
- CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
- CATALOG_S3_ENDPOINT=http://minio:9000
minio:
image: minio/minio
container_name: minio
- CATALOG_S3_ENDPOINT=http://s3.ozone:9878

ozone-datanode:
<<: *ozone-image
ports:
- 9864
command: ["ozone", "datanode"]
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
- MINIO_DOMAIN=minio
<<: *ozone-common-config
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ozone-datanode service is missing a depends_on clause to ensure proper startup order. The datanode requires both ozone-scm (Storage Container Manager) and ozone-om (Object Manager) to be running before it can start successfully. Without this dependency, the datanode may fail to start or experience startup issues.

Suggested change
<<: *ozone-common-config
<<: *ozone-common-config
depends_on:
- ozone-scm
- ozone-om

Copilot uses AI. Check for mistakes.
networks:
iceberg_net:
aliases:
- warehouse.minio

ozone-om:
<<: *ozone-image
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
- 9874:9874
environment:
<<: *ozone-common-config
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proxy user configuration allows any host and any group to proxy as the hadoop user, which is overly permissive for a production environment. While this may be acceptable for a quickstart/demo environment, consider adding a comment in the documentation to warn users not to use these settings in production, or at minimum document that these are demo-only settings for ease of use.

Suggested change
<<: *ozone-common-config
<<: *ozone-common-config
# WARNING: The following proxyuser settings are for quickstart/demo use only.
# Do not use this configuration as-is in production; restrict hosts and groups appropriately.

Copilot uses AI. Check for mistakes.
CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
WAITFOR: ozone-scm:9876
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WAITFOR configuration on line 118 specifies 'ozone-scm:9876' to wait for the SCM service, but there's no corresponding WAITFOR configuration in the ozone-scm service to wait for its own initialization. The ozone-scm service is the first in the dependency chain but doesn't have any startup coordination. Consider verifying that the apache/ozone image handles the ENSURE_SCM_INITIALIZED check properly without needing an explicit WAITFOR, or document this dependency ordering requirement.

Copilot uses AI. Check for mistakes.
command: ["ozone", "om"]
networks:
iceberg_net:
Comment on lines +109 to 121
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container name for the ozone-om service is not explicitly set, while other services in the docker-compose file (like spark-iceberg and iceberg-rest) have container_name specified. For consistency and easier debugging, consider adding explicit container names for all Ozone services, such as 'container_name: ozone-om' for the ozone-om service.

Copilot uses AI. Check for mistakes.

ozone-scm:
<<: *ozone-image
ports:
- 9876:9876
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: |
/bin/sh -c "
until (/usr/bin/mc alias set minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
/usr/bin/mc rm -r --force minio/warehouse;
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"
<<: *ozone-common-config
ENSURE_SCM_INITIALIZED: /data/metadata/scm/current/VERSION
command: ["ozone", "scm"]
networks:
iceberg_net:

ozone-recon:
<<: *ozone-image
ports:
- 9888:9888
environment:
<<: *ozone-common-config
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ozone-recon service should have a depends_on clause to ensure the ozone-scm and ozone-om services are available before it starts. Recon is the monitoring and management service for Ozone and needs the other core services to be operational to function properly.

Suggested change
<<: *ozone-common-config
<<: *ozone-common-config
depends_on:
- ozone-scm
- ozone-om

Copilot uses AI. Check for mistakes.
command: ["ozone", "recon"]
networks:
iceberg_net:
Comment on lines +99 to +142
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ozone services (ozone-datanode, ozone-om, ozone-scm, ozone-recon) store data in directories like /data/hdds and /data/metadata, but these directories are not mounted to host volumes. This means that when the containers are stopped or removed, all data will be lost. While this might be acceptable for a quickstart/demo environment, consider documenting this behavior or adding optional volume mounts for users who want to persist data across container restarts.

Copilot uses AI. Check for mistakes.

ozone-s3g:
<<: *ozone-image
ports:
- 9878:9878
environment:
<<: *ozone-common-config
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ozone S3 Gateway (ozone-s3g) configuration is missing AWS credential environment variables, while the spark-iceberg and rest services are configured with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. For the S3 API to work properly with authentication, the Ozone S3 Gateway needs to be configured to either accept these credentials or operate in anonymous mode. Without proper credential configuration, there may be authentication mismatches when Spark tries to access the S3 endpoint with the provided credentials.

Suggested change
<<: *ozone-common-config
<<: *ozone-common-config
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1

Copilot uses AI. Check for mistakes.
command:
- sh
- -c
- |
set -e
ozone s3g &
s3g_pid=$$!
until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done;
ozone sh bucket delete /s3v/warehouse || true
ozone sh bucket create /s3v/warehouse
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script starts the S3 gateway in the background and then performs administrative operations. However, there's no check to ensure the S3 gateway itself is fully operational before proceeding with bucket operations. The wait only checks if ozone-om responds to 'volume list', but doesn't verify that the S3 gateway is ready. This could lead to race conditions where bucket operations might fail if the S3 gateway hasn't fully initialized. Consider adding a health check or retry logic for the bucket operations.

Suggested change
ozone sh bucket create /s3v/warehouse
max_retries=30
retry_delay=2
attempt=0
until ozone sh bucket create /s3v/warehouse >/dev/null 2>&1; do
attempt=$$((attempt + 1))
if [ "$$attempt" -ge "$$max_retries" ]; then
echo "Failed to create bucket /s3v/warehouse after $$max_retries attempts"
exit 1
fi
echo '...waiting for S3 gateway to be ready for bucket operations...'
sleep "$$retry_delay"
done

Copilot uses AI. Check for mistakes.
wait "$$s3g_pid"
networks:
iceberg_net:
aliases:
- s3.ozone
- warehouse.s3.ozone
Comment on lines +144 to +165
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ozone-s3g service is missing a depends_on clause to ensure the prerequisite services are started first. The S3 Gateway requires ozone-scm, ozone-om, and ozone-datanode to be available before it can successfully handle S3 requests. While the shell script waits for ozone-om to respond, the depends_on clause is still needed to prevent the container from starting prematurely and ensure proper orchestration.

Copilot uses AI. Check for mistakes.

networks:
iceberg_net:

Expand Down Expand Up @@ -149,6 +206,7 @@ using `demo.nyc.taxis` where `demo` is the catalog name, `nyc` is the database n
=== "SparkSQL"

```sql
CREATE NAMESPACE IF NOT EXISTS demo.nyc;
CREATE TABLE demo.nyc.taxis
(
vendor_id bigint,
Expand Down
Loading