-
Notifications
You must be signed in to change notification settings - Fork 366
Description
Describe the bug
I am using Polaris to connect to an internal S3 service. The internal S3 does not provide Credential Vending / STS. I tried to configure Polaris so it can connect to the internal S3 without STS, but with the current setup Polaris still tries to access the internet and resolves the internal S3 domain externally.
Below are the exact configurations I used (Docker Compose, Spark image Dockerfile, catalog payload and Spark configuration).
Docker Compose (relevant parts)
services:
postgres:
image: postgres:17
container_name: polaris-postgres
ports:
- "5433:5432"
shm_size: 128mb
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: polaris
POSTGRES_INITDB_ARGS: "--encoding UTF8 --data-checksums"
volumes:
- type: bind
source: ./pg-config/postgresql.conf
target: /etc/postgresql/postgresql.conf
command:
- "postgres"
- "-c"
- "config_file=/etc/postgresql/postgresql.conf"
healthcheck:
test: "pg_isready -U postgres"
interval: 5s
timeout: 2s
retries: 15
networks:
- polaris_net
polaris-bootstrap:
image: apache/polaris-admin-tool:latest
container_name: polaris-bootstrap
depends_on:
postgres:
condition: service_healthy
environment:
POLARIS_PERSISTENCE_TYPE: relational-jdbc
QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://postgres:5432/polaris
QUARKUS_DATASOURCE_USERNAME: postgres
QUARKUS_DATASOURCE_PASSWORD: postgres
command:
- "bootstrap"
- "--realm=polaris"
- "--credential=polaris,root,s3cr3t"
networks:
- polaris_net
polaris:
image: apache/polaris:latest
container_name: polaris-container
ports:
- "8181:8181" # API
- "8182:8182" # management (metrics/health)
- "5005:5005" # optional JVM debugger
environment:
# Skip credential subscoping indirection
polaris.features."SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION": "true"
polaris.features.realm-overrides."polaris"."SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION": "true"
# Java debug
JAVA_DEBUG: true
JAVA_DEBUG_PORT: "*:5005"
POLARIS_PERSISTENCE_TYPE: relational-jdbc
POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_RETRIES: 5
POLARIS_PERSISTENCE_RELATIONAL_JDBC_INITIAL_DELAY_IN_MS: 100
POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_DURATION_IN_MS: 5000
QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://postgres:5432/polaris
QUARKUS_DATASOURCE_USERNAME: postgres
QUARKUS_DATASOURCE_PASSWORD: postgres
POLARIS_REALM_CONTEXT_REALMS: polaris
QUARKUS_OTEL_SDK_DISABLED: true
POLARIS_BOOTSTRAP_CREDENTIALS: polaris,root,s3cr3t
polaris.features."ALLOW_INSECURE_STORAGE_TYPES": "false"
polaris.features."SUPPORTED_CATALOG_STORAGE_TYPES": "[\"S3\",\"GCS\",\"AZURE\"]"
polaris.readiness.ignore-severe-issues: "true"
# S3 Configuration
polaris.features."ALLOW_SETTING_S3_ENDPOINTS": "true"
polaris.storage.aws.access-key: "abcd"
polaris.storage.aws.secret-key: "abcd"
AWS_REGION: "us-east-1"
AWS_ACCESS_KEY_ID: "abcd"
AWS_SECRET_ACCESS_KEY: "abcd"
AWS_ENDPOINT_URL: "https://s3-it-hn.company.net"
# Proxy / no-proxy
HTTP_PROXY: http://proxy.company.vn:80
HTTPS_PROXY: http://proxy.company.vn:80
JAVA_OPTS_APPEND: '-Dhttp.nonProxyHosts="localhost|127.0.0.1|s3-it-hn.company.net|*.s3-it-hn.company.net" -Dhttps.nonProxyHosts="localhost|127.0.0.1|s3-it-hn.company.net|*.s3-it-hn.company.net"'
NO_PROXY: localhost,127.0.0.1,.s3-it-hn.company.net,s3-it-hn.company.net
depends_on:
polaris-bootstrap:
condition: service_completed_successfully
healthcheck:
test: ["CMD", "curl", "http://localhost:8182/q/health"]
interval: 2s
timeout: 10s
retries: 10
start_period: 10s
networks:
- polaris_net
polaris-setup:
image: alpine/curl
container_name: polaris-setup
depends_on:
polaris:
condition: service_healthy
environment:
HTTP_PROXY: http://proxy.company.vn:80
HTTPS_PROXY: http://proxy.company.vn:80
NO_PROXY: localhost,127.0.0.1,.s3-it-hn.company.net,s3-it-hn.company.net
STORAGE_LOCATION: "s3://it-demo-data/testxxx"
volumes:
- ./setup-polaris:/polaris
entrypoint: '/bin/sh -c "chmod +x /polaris/create-catalog.sh && /polaris/create-catalog.sh"'
networks:
- polaris_net
spark-jupyter:
build: .
container_name: polaris-spark-jupyter
depends_on:
polaris-setup:
condition: service_completed_successfully
ports:
- "8888:8888"
- "4040:4040"
volumes:
- ./notebooks:/home/jovyan/work
healthcheck:
test: "curl localhost:8888"
interval: 5s
retries: 15
networks:
- polaris_net
networks:
polaris_net:
external: true
name: open-metadata_app_netDockerfile (Spark image)
FROM apache/spark:3.5.8-java17-python3
USER root
ENV HTTP_PROXY=http://proxy.company.vn:80/
ENV HTTPS_PROXY=http://proxy.company.vn:80/
ENV NO_PROXY=localhost,127.0.0.1,polaris-postgres,polaris-container
RUN pip install --no-cache-dir \
jupyterlab \
notebook \
pyspark==3.5.8 \
&& mkdir -p /home/jovyan/work
RUN mkdir -p /opt/spark/jars && \
wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e https_proxy=$HTTPS_PROXY \
https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.10.0/iceberg-spark-runtime-3.5_2.12-1.10.0.jar -O /opt/spark/jars/iceberg-spark-runtime-3.5_2.12-1.10.0.jar && \
wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e https_proxy=$HTTPS_PROXY \
https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.10.0/iceberg-aws-bundle-1.10.0.jar -O /opt/spark/jars/iceberg-aws-bundle-1.10.0.jar && \
wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e https_proxy=$HTTPS_PROXY \
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar -O /opt/spark/jars/hadoop-aws-3.3.4.jar && \
wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e https_proxy=$HTTPS_PROXY \
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.797/aws-java-sdk-bundle-1.12.797.jar -O /opt/spark/jars/aws-java-sdk-bundle-1.12.797.jar
# Clear proxy environment variables so they won't be present in later layers / at runtime
ENV HTTP_PROXY=""
ENV HTTPS_PROXY=""
ENV NO_PROXY=""
ENV JUPYTER_TOKEN=polaris
ENV JUPYTER_ENABLE_LAB=yes
RUN useradd -m -s /bin/bash jovyan && chown -R jovyan /home/jovyan
USER jovyan
WORKDIR /home/jovyan/work
EXPOSE 8888 4040
CMD jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token=$JUPYTER_TOKENCatalog creation payload (create-catalog.sh / jq payload)
PAYLOAD=$(jq -n \
--arg name "quickstart_catalog" \
--arg defaultBase "$STORAGE_LOCATION" \
'{
catalog: {
name: $name,
type: "INTERNAL",
properties: { "default-base-location": $defaultBase },
storageConfigInfo: {
storageType: "S3",
pathStyleAccess: true,
stsUnavailable: true,
endpoint: "https://s3-it-hn.company.net",
endpointInternal: "https://s3-it-hn.company.net"
}
}
}'
)Spark configuration (Python snippet)
S3_HOST="s3-it-hn.company.net"
current_python = sys.executable
os.environ["PYSPARK_PYTHON"] = current_python
os.environ["PYSPARK_DRIVER_PYTHON"] = current_python
os.environ["NO_PROXY"] = S3_HOST
os.environ["no_proxy"] = S3_HOST
# ---- JVM proxy settings (driver + executor) ----
proxy_host = "http://proxy.company.vn"
proxy_port = "80"
java_gc_opts = (
" -XX:+UseG1GC"
" -XX:InitiatingHeapOccupancyPercent=50"
" -XX:OnOutOfMemoryError='kill -9 %p'"
)
jvm_opts = (
f"-Dhttp.proxyHost={proxy_host} -Dhttp.proxyPort={proxy_port} "
f"-Dhttps.proxyHost={proxy_host} -Dhttps.proxyPort={proxy_port} "
f"-Dhttp.nonProxyHosts={S3_HOST} -Dhttps.nonProxyHosts={S3_HOST} "
"-Djava.net.useSystemProxies=false"
"-Daws.java.v1.disableDeprecationAnnouncement=true"
f"{java_gc_opts}"
)
spark = (
SparkSession.builder
.appName("SparkIceberg")
.master("local[12]")
.config("spark.driver.memory", "12g")
.config("spark.executor.memory", "12g")
.config("spark.driver.memoryOverhead", "1024m")
# JARs and JVM options
.config("spark.driver.extraJavaOptions", jvm_opts)
.config("spark.executor.extraJavaOptions", jvm_opts)
# Iceberg Catalog
.config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
.config('spark.sql.iceberg.vectorization.enabled', 'false')
.config("spark.sql.catalog.polaris.type", "rest")
.config("spark.sql.catalog.polaris", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.polaris.uri", "http://polaris:8181/api/catalog")
.config("spark.sql.catalog.polaris.token-refresh-enabled", "false")
.config("spark.sql.catalog.polaris.credential", f"{polaris_client_id}:{polaris_client_secret}")
.config("spark.sql.catalog.polaris.warehouse", catalog_name)
.config("spark.sql.catalog.polaris.scope", 'PRINCIPAL_ROLE:ALL')
# S3A settings
.config("spark.hadoop.fs.s3a.endpoint", S3_HOST)
.config("spark.hadoop.fs.s3a.access.key", S3_ACCESS_KEY)
.config("spark.hadoop.fs.s3a.secret.key", S3_SECRET_KEY)
.config("spark.hadoop.fs.s3a.path.style.access", "true")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.getOrCreate()
)
print("Spark Session created successfully!")To Reproduce
When I run these commands:
spark.sql("USE polaris")
spark.sql("SHOW NAMESPACES").show()
spark.sql("CREATE NAMESPACE IF NOT EXISTS COLLADO_TEST")
spark.sql("CREATE NAMESPACE IF NOT EXISTS COLLADO_TEST.PUBLIC")
spark.sql("SHOW NAMESPACES IN COLLADO_TEST").show()
spark.sql("USE NAMESPACE COLLADO_TEST.PUBLIC")
spark.sql("""CREATE TABLE IF NOT EXISTS TEST_TABLE (
id bigint NOT NULL COMMENT 'unique id',
data string)
USING iceberg;
""")This Spark configuration can connect directly to S3 normally. However, when I connect through Polaris, creating the namespace succeeds, but creating a table in that namespace fails with:
Py4JJavaError Traceback (most recent call last)
Cell In[5], line 2
1 spark.sql("USE NAMESPACE COLLADO_TEST.PUBLIC")
----> 2 spark.sql("""CREATE TABLE IF NOT EXISTS TEST_TABLE (
3 id bigint NOT NULL COMMENT 'unique id',
4 data string)
5 USING iceberg;
6 """)
: org.apache.iceberg.exceptions.ServiceFailureException: Server error: SdkClientException: Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.
Caused by: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: it-demo-data.s3-it-hn.company.net: Name or service not known (SDK Attempt Count: 6)
When I uncomment the proxy section (HTTP_PROXY, HTTPS_PROXY) in polaris-container, I also encounter a similar error:
# HTTP_PROXY: http://proxy.company.vn:80
# HTTPS_PROXY: http://proxy.company.vn:80Actual Behavior
Despite the configuration above, Polaris (or components in the stack) still attempt to access the internet and perform external DNS/resolve for the internal S3 domain. This causes failures because the internal S3 host is only resolvable/accessible from within our internal network and should not require STS.
Expected Behavior
Polaris should connect to the internal S3 endpoint http://s3-it-hn.company.net directly using the provided static access/secret keys and stsUnavailable = true, without attempting to resolve or contact anything on the public internet (or using an STS / credential vending service).
NO_PROXY / no_proxy and JVM -Dhttp.nonProxyHosts / -Dhttps.nonProxyHosts are set where applicable to exclude the internal S3 host from corporate proxy.
Additional context
Has any documentation been provided on this issue? I have read through all the related issues but haven't been able to resolve this problem.
System information
Docker/Docker compose