-
Notifications
You must be signed in to change notification settings - Fork 3k
Core, AWS, REST: Promote the S3 signing endpoint to the main spec #15112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ad95a85 to
f3fc095
Compare
| 5XX: | ||
| $ref: '#/components/responses/ServerErrorResponse' | ||
|
|
||
| /v1/{prefix}/namespaces/{namespace}/tables/{table}/sign/{provider}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{provide} why do we need that ? a table would ideally be in one object store ? if there are multiple thats fine too, i believe we give absolute path of the uri right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this, because if/when a catalog server eventually has remote signing available for more than one object storage provider (say, S3 and Azure), it would be good if the server could determine how exactly to sign the request. Without this path parameter, the server would need to apply some heuristics to determine the right object store provider, and hence how to sign the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the server would need to apply some heuristics to determine the right object store provider
didn't get this part, we give the path we want to be signed from client to server as part of payload of this request right ? can't we extract that from there (Are you concerned with s3 / s3a / s3n semantics ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that easy.
As an example, a request to sign looks like the one below for S3:
PUT /warehouse/db/sales_table/data/date=2024-05/00022-44-55.parquet HTTP/1.1
Host: my-datalake.s3.us-east-1.amazonaws.com
Date: Fri, 24 May 2024 12:45:00 GMT
Content-Length: 134217728
Content-Type: application/octet-stream
A similar request to GCP would look like:
POST /upload/storage/v1/b/my-datalake-bucket/o?uploadType=media&name=warehouse/db/sales/data/file.parquet HTTP/1.1
Host: storage.googleapis.com
Date: Fri, 24 May 2024 12:45:00 GMT
Content-Length: 134217728
Content-Type: application/octet-stream
And for Azure:
PATCH /my-container/warehouse/db/sales/data/file.parquet?action=append&position=0 HTTP/1.1
Host: my-datalake.dfs.core.windows.net
x-ms-date: Fri, 24 May 2024 12:45:00 GMT
x-ms-version: 2023-11-03
Content-Length: 134217728
Content-Type: application/octet-stream
The question is: how do you know the object storage provider so that the server can pick the right signing algorithm? The only (heuristic) way is to inspect the Host header, but that's brittle. It's much simpler if the client tells the server what object storage provider to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am taking of sign request from IRC client to IRC server, i believe what you are showing is IRC server to object store sign ? am i missing something
like IRC client will do a post to /v1/{prefix}/namespaces/{namespace}/tables/{table}/sign with uri as param
https://github.com/apache/iceberg/pull/15112/changes#diff-02549ca620d020dc9ead80088cc14e311e12a69651fa8d394cd41a4308debb2eR4725
i think this would an absolute path right ? s3:////table/data/a.parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be wanna then keep this info in the payload
What do you mean by payload? If you mean in the request body that the signer sends to the signer endpoint, that would require support for request properties, cf dev ML thread:
https://lists.apache.org/thread/gz5nm2xzlhzbc2y3sfossgflnkbm6vq5
But yes, that would work. If people think that's a preferable path, I'm fine with it. (I'm preparing a PR for that as well.)
how are the location checks happening don't we wanna do prefix match ?
Indeed the signer endpoint needs to do a mapping from HTTP URI to S3/GCS/ADLS URIs, in order to validate locations. The mapping can be complex (think: S3 path-style vs virtual-host style, many domains/regions etc.). But it would be even more complex if the signer doesn't even know the target storage provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would require support for request properties
not necessarily, mainly because its something client can infer on its own, for example we wanna R/W path client can just send back the s3 as the provider while making the request ? we just need to define this field in the POST then ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your comment 😅 could you give me an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies let me know if it reads better
if path is s3:///namespace/...../a.parquet
POST <>/v1/namespace/{namespace}/table/{table}
{
provider : s3
method :
region :
uri :
}
essentialy making the provider as part of body we can also add if no provider is there catalog assumes s3
thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example! If I get the suggestion right, you are proposing to add a new field to RemoteSignRequest:
RemoteSignRequest:
description: The request to be signed remotely.
type: object
required:
- provider
- region
- uri
- method
- headers
properties:
provider: # NEW FIELD
type: string
region:
type: string
uri:
type: string
method:
type: string
enum: ["PUT", "GET", "HEAD", "POST", "DELETE", "PATCH", "OPTIONS"]
# etc.The problem with this approach is that it's not backwards-compatible. An older client wouldn't know anything about this new field and wouldn't include it in the request body, which would make a newer server reject the request.
We could make this field optional and default to s3 if not provided.
@nastra wdyt?
open-api/rest-catalog-open-api.yaml
Outdated
|
|
||
| If remote signing for a specific storage provider is enabled, clients must respect the following configurations when creating a remote signer client: | ||
| - `signer.uri`: the base URI of the remote signer endpoint. Optional; if absent, defaults to the catalog's base URI. | ||
| - `signer.endpoint`: the path of the remote signer endpoint. Required. Should be concatenated with `signer.uri` to form the complete URI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHOULD or MUST ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's complicated 😄
The signer client impl uses org.apache.iceberg.rest.RESTUtil#resolveEndpoint to perform the concatenation of signer.uri and signer.endpoint.
So, signer.endpoint could also be an absolute URL, in which case, signer.uri would be ignored.
I will try to come up with a better wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased, lmk what you think!
| allOf: | ||
| - $ref: '#/components/schemas/Expression' | ||
|
|
||
| MultiValuedMap: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is S3Headers eq section in the s3 signer spec ? can we say like ObjectStoreProviderHeader ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went for a more generic name because there is nothing specific to remote signing here. This component could perfectly be used for something else in the spec.
| - `s3.secret-access-key`: secret for credentials that provide access to data in S3 | ||
| - `s3.session-token`: if present, this value should be used for as the session token | ||
| - `s3.remote-signing-enabled`: if `true` remote signing should be performed as described in the `s3-signer-open-api.yaml` specification | ||
| - `s3.remote-signing-enabled`: if `true` remote signing should be performed as described in the `RemoteSignRequest` schema section of this spec document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I chose to keep this property specific to S3. I think that even if the signer endpoint is now generic, enablement should be performed for each specific object storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can actually do google GCS cloud access via its s3 gateway; same signing algorithm, just a few different settings to change listing version, endpoint, &c
| public String baseSignerUri() { | ||
| return properties().getOrDefault(S3_SIGNER_URI, properties().get(CatalogProperties.URI)); | ||
| return properties() | ||
| .getOrDefault(CatalogProperties.SIGNER_URI, properties().get(CatalogProperties.URI)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this breaking existing behavior where one could have provided the s3.signer.uri but now we don't read that property anymore and rely on signer.uri. The same for the endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, since the new client introduced in 1.11 is resilient to older servers.
Let's break this into concrete situations:
| Server | Client | Signer config | Resulting URI | Result |
|---|---|---|---|---|
| 1.10 | 1.11 | s3.signer.uri=... s3.signer.endpoint=... | s3.signer.uri + s3.signer.endpoint | ✅ |
| 1.10 | 1.11 | s3.signer.uri=... | s3.signer.uri + default endpoint | ✅ |
| 1.10 | 1.11 | s3.signer.endpoint=... | catalog URI + s3.signer.endpoint | ✅ |
| 1.10 | 1.11 | (empty) | catalog URI + default endpoint | ✅ |
| 1.11 | 1.10 | signer.uri=... signer.endpoint=... s3.signer.uri=... s3.signer.endpoint=... | s3.signer.uri + s3.signer.endpoint | ✅ |
| 1.11 | 1.10 | signer.endpoint=... s3.signer.endpoint=... | catalog URI + s3.signer.endpoint | ✅ |
| 1.11 | 1.10 | signer.uri=... signer.endpoint=... | catalog URI + default endpoint | ❌ |
| 1.11 | 1.10 | signer.endpoint=... | catalog URI + default endpoint | ❌ |
So in summary:
- 1.11 Clients won't break older servers.
- 1.11 Servers won't break older clients iif they include both
signer.*ands3.signer.*properties for backwards compatibility (which they all should do).
However, once support for the deprecated s3.signer.* properties is removed (1.12), newer clients would break older servers (<= 1.10). If that's concerning, we could e.g. wait a few more minor releases before removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we be reading S3_SIGNER_URI first and only then fall back to CatalogProperties.SIGNER_URI and then to CatalogProperties.URI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I planned for it and forgot to implement 🤦♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, and also added a unit test to verify the precedence behavior.
| * @deprecated since 1.11.0, will be removed in 1.12.0; use {@link CatalogProperties#SIGNER_URI} | ||
| * instead. | ||
| */ | ||
| @Deprecated public static final String S3_SIGNER_URI = CatalogProperties.SIGNER_URI; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can just change the value here as that would break backwards compatibility
| "true", | ||
| CatalogProperties.URI, | ||
| uri, | ||
| CatalogProperties.SIGNER_ENDPOINT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this wasn't needed before but is needed now, which indicates that this is a breaking change for users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now it's not required, but it will become required in a future release (1.12 or later).
There is a check + warning here:
I proactively updated the tests so that they don't break when we make this property required.
|
|
||
| paths: | ||
|
|
||
| /v1/aws/s3/sign: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we would want to remove this spec yet. We should probably first deprecate it
| } | ||
| } | ||
|
|
||
| public static class RemoteSignRequestSerializer<T extends RemoteSignRequest> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these all should probably just be package-private and not public
| gen.writeEndArray(); | ||
| } | ||
| gen.writeEndObject(); | ||
| public static void headersToJson( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure whether we need to make this one and the one below public
| default: | ||
| throw new UnsupportedOperationException("Unsupported grant_type: " + grantType); | ||
| protected void validateSignRequest(RemoteSignRequest request) { | ||
| if ("POST".equalsIgnoreCase(request.method()) && request.uri().getQuery().contains("delete")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor : should we use rest constants instead of raw string for "POST" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I went back to raw strings because org.apache.iceberg.rest.HttpMethod is missing constants for PUT, OPTIONS, etc. which was causing integration tests to fail.
| 5XX: | ||
| $ref: '#/components/responses/ServerErrorResponse' | ||
|
|
||
| /v1/{prefix}/namespaces/{namespace}/tables/{table}/sign/{provider}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would require support for request properties
not necessarily, mainly because its something client can infer on its own, for example we wanna R/W path client can just send back the s3 as the provider while making the request ? we just need to define this field in the POST then ?
| * @param response the HTTP response to add headers to | ||
| */ | ||
| protected void addSignResponseHeaders(RemoteSignRequest request, HttpServletResponse response) { | ||
| if (request.method().equalsIgnoreCase("GET") || request.method().equalsIgnoreCase("HEAD")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we previously had this defined in a CACHEABLE_METHODS set, so would be good to keep this for easier readability
| Preconditions.checkArgument( | ||
| properties().containsKey(S3_SIGNER_URI) || properties().containsKey(CatalogProperties.URI), | ||
| properties().containsKey(S3_SIGNER_URI) | ||
| || properties().containsKey(RESTCatalogProperties.SIGNER_URI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add some tests for these new properties to testS3RemoteSignerWithoutUri()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, but the tests look very similar to the ones in TestS3V4RestSignerClient.legacySignerProperties().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I actually meant to only add tests to testS3RemoteSignerWithoutUri(), which verifies that the error msg S3 signer service URI is required is properly thrown when any of these new props are missing. No need to duplicate TestS3V4RestSignerClient.legacySignerProperties() into TestS3FileIOProperties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the error is only thrown when all of these props are missing. So the existing test is imho already testing sufficiently. Wdyt?
| } | ||
|
|
||
| /** | ||
| * The base URI of the remote signer endpoint. Optional, defaults to {@linkplain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not just {@link CatalogProperties#URI} here?
core/src/main/java/org/apache/iceberg/rest/requests/RemoteSignRequestParser.java
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/rest/responses/RemoteSignResponseParser.java
Show resolved
Hide resolved
|
|
||
| lint-spec: | ||
| uv run yamllint --strict rest-catalog-open-api.yaml | ||
| uv run yamllint --strict ../aws/src/main/resources/s3-signer-open-api.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still leave this in until we actually remove the openapi spec file
| * @deprecated since 1.11.0, will be removed in 1.12.0; use {@link | ||
| * org.apache.iceberg.rest.requests.RemoteSignRequestParser} instead. | ||
| */ | ||
| @Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually liked that you switched the impl here to using functionality from the new RemoteSignRequestParser for stuff, not sure why you decided to change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK 😅 I thought it was a bit too invasive. Let me go back to the previous version.
| * @deprecated since 1.11.0, will be removed in 1.12.0; the serializers for S3 signing are now | ||
| * registered in {@link RESTSerializers}. | ||
| */ | ||
| @Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class I would probably just deprecate and not switch to using RESTSerializers.registerAll(MAPPER).
nastra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM, but left a few minor comments that would be good to address
7ee473b to
90e4eba
Compare
|
Heads up: I had to rebase because of a conflict with fec9800. |
Dev ML discussion: https://lists.apache.org/thread/2kqdqb46j7jww36wwg4txv6pl2hqq9w7 This commit promotes the S3 remote signing endpoint from an AWS-specific implementation to a first-class REST catalog API endpoint. This enables other storage providers (GCS, Azure, etc.) to eventually reuse the same signing endpoint pattern without duplicating the API definition. OpenAPI Specification changes: - Add `/v1/{prefix}/namespaces/{namespace}/tables/{table}/sign/{provider}` endpoint to the main REST catalog OpenAPI spec - Define `RemoteSignRequest`, `RemoteSignResult` and `RemoteSignResponse` schemas - Remove the separate `s3-signer-open-api.yaml` from the AWS module - Update the Python client Core Module changes (iceberg-core): - Add `RemoteSignRequest` and `RemoteSignResponse` model classes, copied from the iceberg-aws module - Add `RemoteSignRequestParser` and `RemoteSignResponseParser` for JSON serialization, copied from the iceberg-aws module - Add `SIGNER_URI` and `SIGNER_ENDPOINT` properties to `CatalogProperties` for configuring the signing endpoint - Add `V1_TABLE_REMOTE_SIGN` field and `remoteSign()` method to `ResourcePaths` - Register the new endpoint in `Endpoint.java` - Add abstract `RemoteSignerServlet` base class for remote signing tests, copied from the iceberg-aws module AWS Module changes (iceberg-aws): - Deprecate `S3SignRequest` and `S3SignResponse` for removal - Deprecate `S3SignRequestParser` and `S3SignResponseParser` for removal - Deprecate `S3ObjectMapper` for removal - Refactor `S3SignerServlet` to extend `RemoteSignerServlet` - Update `S3V4RestSignerClient`
This reverts commit 432ca04.
0fcc2da to
ee33e03
Compare
|
Rebased again because of another conflict with b2f312f. |
Dev ML discussion: https://lists.apache.org/thread/2kqdqb46j7jww36wwg4txv6pl2hqq9w7
This commit promotes the S3 remote signing endpoint from an AWS-specific implementation to a first-class REST catalog API endpoint.
This enables other storage providers (GCS, Azure, etc.) to eventually reuse the same signing endpoint pattern without duplicating the API definition.
OpenAPI Specification changes:
/v1/{prefix}/namespaces/{namespace}/tables/{table}/sign/{provider}endpoint to the main REST catalog OpenAPI specRemoteSignRequest,RemoteSignResultandRemoteSignResponseschemass3-signer-open-api.yamlfrom the AWS moduleCore Module changes (iceberg-core):
RemoteSignRequestandRemoteSignResponsemodel classes, copied from the iceberg-aws moduleRemoteSignRequestParserandRemoteSignResponseParserfor JSON serialization, copied from the iceberg-aws moduleSIGNER_URIandSIGNER_ENDPOINTproperties toCatalogPropertiesfor configuring the signing endpointV1_TABLE_REMOTE_SIGNfield andremoteSign()method toResourcePathsEndpoint.javaRemoteSignerServletbase class for remote signing tests, copied from the iceberg-aws moduleAWS Module changes (iceberg-aws):
S3SignRequestandS3SignResponsefor removalS3SignRequestParserandS3SignResponseParserfor removalS3ObjectMapperfor removalS3SignerServletto extendRemoteSignerServletS3V4RestSignerClient