Skip to content

Polaris attempts to resolve internal S3 host via the internet — cannot connect to internal S3 without STS / Credential Vending #3640

@tuanit03

Description

@tuanit03

Describe the bug

I am using Polaris to connect to an internal S3 service. The internal S3 does not provide Credential Vending / STS. I tried to configure Polaris so it can connect to the internal S3 without STS, but with the current setup Polaris still tries to access the internet and resolves the internal S3 domain externally.

Below are the exact configurations I used (Docker Compose, Spark image Dockerfile, catalog payload and Spark configuration).


Docker Compose (relevant parts)

services:
  postgres:
    image: postgres:17
    container_name: polaris-postgres
    ports:
      - "5433:5432"
    shm_size: 128mb
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: polaris
      POSTGRES_INITDB_ARGS: "--encoding UTF8 --data-checksums"
    volumes:
      - type: bind
        source: ./pg-config/postgresql.conf
        target: /etc/postgresql/postgresql.conf
    command:
      - "postgres"
      - "-c"
      - "config_file=/etc/postgresql/postgresql.conf"
    healthcheck:
      test: "pg_isready -U postgres"
      interval: 5s
      timeout: 2s
      retries: 15
    networks:
      - polaris_net

  polaris-bootstrap:
    image: apache/polaris-admin-tool:latest
    container_name: polaris-bootstrap
    depends_on:
      postgres:
        condition: service_healthy
    environment:
      POLARIS_PERSISTENCE_TYPE: relational-jdbc
      QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://postgres:5432/polaris
      QUARKUS_DATASOURCE_USERNAME: postgres
      QUARKUS_DATASOURCE_PASSWORD: postgres
    command:
      - "bootstrap"
      - "--realm=polaris"
      - "--credential=polaris,root,s3cr3t"
    networks:
      - polaris_net

  polaris:
    image: apache/polaris:latest
    container_name: polaris-container
    ports:
      - "8181:8181"   # API
      - "8182:8182"   # management (metrics/health)
      - "5005:5005"   # optional JVM debugger
    environment:
      # Skip credential subscoping indirection
      polaris.features."SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION": "true"
      polaris.features.realm-overrides."polaris"."SKIP_CREDENTIAL_SUBSCOPING_INDIRECTION": "true"

      # Java debug
      JAVA_DEBUG: true
      JAVA_DEBUG_PORT: "*:5005"

      POLARIS_PERSISTENCE_TYPE: relational-jdbc
      POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_RETRIES: 5
      POLARIS_PERSISTENCE_RELATIONAL_JDBC_INITIAL_DELAY_IN_MS: 100
      POLARIS_PERSISTENCE_RELATIONAL_JDBC_MAX_DURATION_IN_MS: 5000
      QUARKUS_DATASOURCE_JDBC_URL: jdbc:postgresql://postgres:5432/polaris
      QUARKUS_DATASOURCE_USERNAME: postgres
      QUARKUS_DATASOURCE_PASSWORD: postgres
      POLARIS_REALM_CONTEXT_REALMS: polaris
      QUARKUS_OTEL_SDK_DISABLED: true
      POLARIS_BOOTSTRAP_CREDENTIALS: polaris,root,s3cr3t
      polaris.features."ALLOW_INSECURE_STORAGE_TYPES": "false"
      polaris.features."SUPPORTED_CATALOG_STORAGE_TYPES": "[\"S3\",\"GCS\",\"AZURE\"]"
      polaris.readiness.ignore-severe-issues: "true"

      # S3 Configuration
      polaris.features."ALLOW_SETTING_S3_ENDPOINTS": "true"
      polaris.storage.aws.access-key: "abcd"
      polaris.storage.aws.secret-key: "abcd"
      AWS_REGION: "us-east-1"
      AWS_ACCESS_KEY_ID: "abcd"
      AWS_SECRET_ACCESS_KEY: "abcd"
      AWS_ENDPOINT_URL: "https://s3-it-hn.company.net"

      # Proxy / no-proxy
      HTTP_PROXY: http://proxy.company.vn:80
      HTTPS_PROXY: http://proxy.company.vn:80
      JAVA_OPTS_APPEND: '-Dhttp.nonProxyHosts="localhost|127.0.0.1|s3-it-hn.company.net|*.s3-it-hn.company.net" -Dhttps.nonProxyHosts="localhost|127.0.0.1|s3-it-hn.company.net|*.s3-it-hn.company.net"'
      NO_PROXY: localhost,127.0.0.1,.s3-it-hn.company.net,s3-it-hn.company.net
    depends_on:
      polaris-bootstrap:
        condition: service_completed_successfully
    healthcheck:
      test: ["CMD", "curl", "http://localhost:8182/q/health"]
      interval: 2s
      timeout: 10s
      retries: 10
      start_period: 10s
    networks:
      - polaris_net

  polaris-setup:
    image: alpine/curl
    container_name: polaris-setup
    depends_on:
      polaris:
        condition: service_healthy
    environment:
      HTTP_PROXY: http://proxy.company.vn:80
      HTTPS_PROXY: http://proxy.company.vn:80
      NO_PROXY: localhost,127.0.0.1,.s3-it-hn.company.net,s3-it-hn.company.net
      STORAGE_LOCATION: "s3://it-demo-data/testxxx"
    volumes:
      - ./setup-polaris:/polaris
    entrypoint: '/bin/sh -c "chmod +x /polaris/create-catalog.sh && /polaris/create-catalog.sh"'
    networks:
      - polaris_net

  spark-jupyter:
    build: .
    container_name: polaris-spark-jupyter
    depends_on:
      polaris-setup:
        condition: service_completed_successfully
    ports:
      - "8888:8888"
      - "4040:4040"
    volumes:
      - ./notebooks:/home/jovyan/work
    healthcheck:
      test: "curl localhost:8888"
      interval: 5s
      retries: 15
    networks:
      - polaris_net

networks:
  polaris_net:
    external: true
    name: open-metadata_app_net

Dockerfile (Spark image)

FROM apache/spark:3.5.8-java17-python3

USER root

ENV HTTP_PROXY=http://proxy.company.vn:80/
ENV HTTPS_PROXY=http://proxy.company.vn:80/
ENV NO_PROXY=localhost,127.0.0.1,polaris-postgres,polaris-container

RUN pip install --no-cache-dir \
    jupyterlab \
    notebook \
    pyspark==3.5.8 \
    && mkdir -p /home/jovyan/work

RUN mkdir -p /opt/spark/jars && \
    wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e https_proxy=$HTTPS_PROXY \
    https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.10.0/iceberg-spark-runtime-3.5_2.12-1.10.0.jar -O /opt/spark/jars/iceberg-spark-runtime-3.5_2.12-1.10.0.jar && \
    wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e https_proxy=$HTTPS_PROXY \
    https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-aws-bundle/1.10.0/iceberg-aws-bundle-1.10.0.jar -O /opt/spark/jars/iceberg-aws-bundle-1.10.0.jar && \
    wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e https_proxy=$HTTPS_PROXY \
    https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar -O /opt/spark/jars/hadoop-aws-3.3.4.jar && \
    wget -e use_proxy=yes -e http_proxy=$HTTP_PROXY -e https_proxy=$HTTPS_PROXY \
    https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.797/aws-java-sdk-bundle-1.12.797.jar -O /opt/spark/jars/aws-java-sdk-bundle-1.12.797.jar

# Clear proxy environment variables so they won't be present in later layers / at runtime
ENV HTTP_PROXY=""
ENV HTTPS_PROXY=""
ENV NO_PROXY=""

ENV JUPYTER_TOKEN=polaris
ENV JUPYTER_ENABLE_LAB=yes
RUN useradd -m -s /bin/bash jovyan && chown -R jovyan /home/jovyan
USER jovyan
WORKDIR /home/jovyan/work

EXPOSE 8888 4040

CMD jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token=$JUPYTER_TOKEN

Catalog creation payload (create-catalog.sh / jq payload)

PAYLOAD=$(jq -n \
  --arg name "quickstart_catalog" \
  --arg defaultBase "$STORAGE_LOCATION" \
  '{
    catalog: {
      name: $name,
      type: "INTERNAL",
      properties: { "default-base-location": $defaultBase },
      storageConfigInfo: {
        storageType: "S3",
        pathStyleAccess: true,
        stsUnavailable: true,
        endpoint: "https://s3-it-hn.company.net",
        endpointInternal: "https://s3-it-hn.company.net"
      }
    }
  }'
)

Spark configuration (Python snippet)

S3_HOST="s3-it-hn.company.net"

current_python = sys.executable 
os.environ["PYSPARK_PYTHON"] = current_python
os.environ["PYSPARK_DRIVER_PYTHON"] = current_python

os.environ["NO_PROXY"] = S3_HOST
os.environ["no_proxy"] = S3_HOST

# ---- JVM proxy settings (driver + executor) ----
proxy_host = "http://proxy.company.vn"
proxy_port = "80"

java_gc_opts = (
    " -XX:+UseG1GC"
    " -XX:InitiatingHeapOccupancyPercent=50"
    " -XX:OnOutOfMemoryError='kill -9 %p'"
)

jvm_opts = (
    f"-Dhttp.proxyHost={proxy_host} -Dhttp.proxyPort={proxy_port} "
    f"-Dhttps.proxyHost={proxy_host} -Dhttps.proxyPort={proxy_port} "
    f"-Dhttp.nonProxyHosts={S3_HOST} -Dhttps.nonProxyHosts={S3_HOST} "
    "-Djava.net.useSystemProxies=false"
    "-Daws.java.v1.disableDeprecationAnnouncement=true"
    f"{java_gc_opts}"
)

spark = (
    SparkSession.builder
    .appName("SparkIceberg")
    .master("local[12]")
    .config("spark.driver.memory", "12g")
    .config("spark.executor.memory", "12g")
    .config("spark.driver.memoryOverhead", "1024m")
    # JARs and JVM options
    .config("spark.driver.extraJavaOptions", jvm_opts)
    .config("spark.executor.extraJavaOptions", jvm_opts)

    # Iceberg Catalog
    .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
    .config('spark.sql.iceberg.vectorization.enabled', 'false')
    .config("spark.sql.catalog.polaris.type", "rest")
    .config("spark.sql.catalog.polaris", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.polaris.uri", "http://polaris:8181/api/catalog")
    .config("spark.sql.catalog.polaris.token-refresh-enabled", "false")
    .config("spark.sql.catalog.polaris.credential", f"{polaris_client_id}:{polaris_client_secret}")
    .config("spark.sql.catalog.polaris.warehouse", catalog_name)
    .config("spark.sql.catalog.polaris.scope", 'PRINCIPAL_ROLE:ALL')

    # S3A settings
    .config("spark.hadoop.fs.s3a.endpoint", S3_HOST)
    .config("spark.hadoop.fs.s3a.access.key", S3_ACCESS_KEY)
    .config("spark.hadoop.fs.s3a.secret.key", S3_SECRET_KEY)
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .getOrCreate()
)
print("Spark Session created successfully!")

To Reproduce

When I run these commands:

spark.sql("USE polaris")
spark.sql("SHOW NAMESPACES").show()

spark.sql("CREATE NAMESPACE IF NOT EXISTS COLLADO_TEST")
spark.sql("CREATE NAMESPACE IF NOT EXISTS COLLADO_TEST.PUBLIC")
spark.sql("SHOW NAMESPACES IN COLLADO_TEST").show()

spark.sql("USE NAMESPACE COLLADO_TEST.PUBLIC")
spark.sql("""CREATE TABLE IF NOT EXISTS TEST_TABLE (
    id bigint NOT NULL COMMENT 'unique id',
    data string)
USING iceberg;
""")

This Spark configuration can connect directly to S3 normally. However, when I connect through Polaris, creating the namespace succeeds, but creating a table in that namespace fails with:

Py4JJavaError                             Traceback (most recent call last)
Cell In[5], line 2
      1 spark.sql("USE NAMESPACE COLLADO_TEST.PUBLIC")
----> 2 spark.sql("""CREATE TABLE IF NOT EXISTS TEST_TABLE (
      3     id bigint NOT NULL COMMENT 'unique id',
      4     data string)
      5 USING iceberg;
      6 """)
: org.apache.iceberg.exceptions.ServiceFailureException: Server error: SdkClientException: Received an UnknownHostException when attempting to interact with a service. See cause for the exact endpoint that is failing to resolve. If this is happening on an endpoint that previously worked, there may be a network connectivity issue or your DNS cache could be storing endpoints for too long.
Caused by: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: it-demo-data.s3-it-hn.company.net: Name or service not known (SDK Attempt Count: 6)

When I uncomment the proxy section (HTTP_PROXY, HTTPS_PROXY) in polaris-container, I also encounter a similar error:

      # HTTP_PROXY: http://proxy.company.vn:80
      # HTTPS_PROXY: http://proxy.company.vn:80

Actual Behavior

Despite the configuration above, Polaris (or components in the stack) still attempt to access the internet and perform external DNS/resolve for the internal S3 domain. This causes failures because the internal S3 host is only resolvable/accessible from within our internal network and should not require STS.

Expected Behavior

Polaris should connect to the internal S3 endpoint http://s3-it-hn.company.net directly using the provided static access/secret keys and stsUnavailable = true, without attempting to resolve or contact anything on the public internet (or using an STS / credential vending service).

NO_PROXY / no_proxy and JVM -Dhttp.nonProxyHosts / -Dhttps.nonProxyHosts are set where applicable to exclude the internal S3 host from corporate proxy.

Additional context

Has any documentation been provided on this issue? I have read through all the related issues but haven't been able to resolve this problem.

System information

Docker/Docker compose

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions