-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Background
In #14638, the discussion revolves around replacing MinIO in the Spark Quickstart due to the MinIO open-source repository entering maintenance mode and the presence of unpatched security issues. Currently, PR #14928 proposes using RustFS as a replacement. Based on this, I would like to propose a new solution: adding Apache Ozone as an optional storage backend in the Docker Compose Spark Quickstart example.
Motivation
Apache Ozone is an actively maintained Apache top-level project (with the latest 2.0.0 release in 2025), fully open-source (Apache 2.0 license), and free from commercialization risks. It provides excellent S3 compatibility (s3a://) and native Hadoop file system semantics (ofs://), making it highly compatible with the Iceberg + Spark/Hadoop ecosystem. Ozone has been widely used in production environments for big data and AI workloads, offering outstanding scalability and integration advantages.
Adding Ozone as an option can better demonstrate the storage-agnostic nature of Iceberg, providing users with more local testing options and improving its overall flexibility and usability.
Implementation Plan
The storage backend will be switched via an environment variable, for example:
STORAGE_BACKEND=minio|rustfs|ozone # Default is minio
The implementation includes:
- Adding Ozone service: Using the official apache/ozone Docker image in a single-node freestyle mode (this mode starts quickly and has low resource usage).
- Conditional Spark configuration: Dynamically configure spark-defaults.conf based on the selected backend, such as setting the warehouse path, S3 endpoint, etc.
- Updating the README: Provide clear instructions in the Quickstart README on how to enable Ozone as a storage backend, ensuring users can easily configure and test it.
Next Steps
If the community finds this proposal valuable, I plan to submit a PR that includes the following:
- Update docker-compose.yml: Add the Ozone service and dynamically adjust service configurations based on the STORAGE_BACKEND environment variable.
- Modify the spark-defaults.conf template: Dynamically configure the relevant parameters (e.g., warehouse path, S3 endpoint) based on the selected storage backend.
- Update the README: Clearly explain how to switch between storage backends and provide steps for enabling Ozone.