feat: Add trace_id to AWS STS session tags for end-to-end correlation#3414
Conversation
This change enables deterministic correlation between:
- Catalog operations (Polaris events)
- Credential vending (AWS CloudTrail via STS session tags)
- Metrics reports from compute engines (Spark, Trino, etc.)
Changes:
1. Add traceId field to CredentialVendingContext
- Marked with @Value.Auxiliary to exclude from cache key comparison
- Every request has unique trace ID, so including it in equals/hashCode
would prevent all cache hits
- Trace ID is for correlation/audit only, not authorization
2. Extract OpenTelemetry trace ID in StorageAccessConfigProvider
- getCurrentTraceId() extracts trace ID from current span context
- Populates CredentialVendingContext.traceId for each request
3. Add trace_id to AWS STS session tags
- AwsSessionTagsBuilder includes trace_id in session tags
- Appears in CloudTrail logs for correlation with catalog operations
- Uses 'unknown' placeholder when trace ID is not available
4. Update tests to verify trace_id is included in session tags
This enables operators to correlate:
- Which catalog operation triggered credential vending
- Which data access events in CloudTrail correspond to catalog operations
- Which metrics reports correspond to specific catalog operations
|
@dimas-b as suggested by you, I have created a separate PR for the AWS STS trace_id changes. |
...service/src/main/java/org/apache/polaris/service/catalog/io/StorageAccessConfigProvider.java
Outdated
Show resolved
Hide resolved
|
side note: I think this PR will need a CHANGELOG entry (eventually) |
polaris-core/src/main/java/org/apache/polaris/core/storage/CredentialVendingContext.java
Show resolved
Hide resolved
Perhaps after #3385 is merged, I will add a |
1. Feature Flag to Disable Trace IDs in Session Tags
Added a new feature configuration flag INCLUDE_TRACE_ID_IN_SESSION_TAGS in FeatureConfiguration.java:
polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java (EXCERPT)
public static final FeatureConfiguration<Boolean> INCLUDE_TRACE_ID_IN_SESSION_TAGS =
PolarisConfiguration.<Boolean>builder()
.key("INCLUDE_TRACE_ID_IN_SESSION_TAGS")
.description("If set to true (and INCLUDE_SESSION_TAGS_IN_SUBSCOPED_CREDENTIAL is also true), ...")
.defaultValue(false)
.buildFeatureConfiguration();
2. Cache Key Correctness Solution
The solution ensures cache correctness by including trace IDs in cache keys only when they affect the vended credentials:
Key changes:
1. `StorageCredentialCacheKey` - Added a new traceIdForCaching() field that is populated only when trace IDs affect credentials:
polaris-core/src/main/java/org/apache/polaris/core/storage/cache/StorageCredentialCacheKey.java (EXCERPT)
@Value.Parameter(order = 10)
Optional<String> traceIdForCaching();
2. `StorageCredentialCache` - Reads both flags and includes trace ID in cache key only when both are enabled:
polaris-core/src/main/java/org/apache/polaris/core/storage/cache/StorageCredentialCache.java (EXCERPT)
boolean includeTraceIdInCacheKey = includeSessionTags && includeTraceIdInSessionTags;
StorageCredentialCacheKey key = StorageCredentialCacheKey.of(..., includeTraceIdInCacheKey);
3. `AwsSessionTagsBuilder` - Conditionally includes trace ID based on the new flag.
4. Tests - Updated existing tests and added a new test testSessionTagsWithTraceIdWhenBothFlagsEnabled.
How This Resolves the Cache Correctness vs. Efficiency Trade-off
| Configuration | Trace ID in Session Tags | Trace ID in Cache Key | Caching Behavior |
|---------------|--------------------------|----------------------|------------------|
| Session tags disabled | No | No | Efficient caching |
| Session tags enabled, trace ID disabled (default) | No | No | Efficient caching |
| Session tags enabled, trace ID enabled | Yes | Yes | Correct but no caching across requests |
This design ensures:
• Correctness: When trace IDs affect credentials, they're included in the cache key
• Efficiency: When trace IDs don't affect credentials, they're excluded from the cache key, allowing cache hits across requests
|
Hi @obelix74 thanks for providing support for trace ids! I have a question: should we also include the request ID? This aligns with an earlier, extensive discussion (back in October) regarding which information to include in events (OTel context vs. request IDs). The general agreement was to include request IDs when available. Reference link: https://lists.apache.org/thread/p9357rcy3d1j94w4yogtdwcf2kxzg3jr As noted in that thread:
I think it would be good to apply the same decisions for AWS STS session tags, wdyt? |
Hi Alex In an earlier PR https://github.com/apache/polaris/pull/3327/changes/BASE..11f4c58ce6f24284553e73887bd8d0d2991244ff, I had included the
Based on these two issues, we had removed What are your thoughts? Also, any guidance on a reliable way to obtain the |
polaris-core/src/main/java/org/apache/polaris/core/storage/aws/AwsSessionTagsBuilder.java
Outdated
Show resolved
Hide resolved
polaris-core/src/main/java/org/apache/polaris/core/storage/cache/StorageCredentialCacheKey.java
Outdated
Show resolved
Hide resolved
polaris-core/src/main/java/org/apache/polaris/core/storage/CredentialVendingContext.java
Outdated
Show resolved
Hide resolved
|
|
||
| public static final FeatureConfiguration<Boolean> INCLUDE_TRACE_ID_IN_SESSION_TAGS = | ||
| PolarisConfiguration.<Boolean>builder() | ||
| .key("INCLUDE_TRACE_ID_IN_SESSION_TAGS") |
There was a problem hiding this comment.
nit: since we're adding a feature flag, I'd prefer to mention it in CHANGELOG.md right away.
There was a problem hiding this comment.
I will add it shortly. Can you please see @adutra 's comment about request_id? If you remember, in the previous PR about session tags, we had explicitly removed request_id because we couldn't find a reliable way to get the request_id.
There was a problem hiding this comment.
From my personal POV, I do not know of real use cases for "request IDs", but I suppose some people use it.
I'd be ok leaving it out for now and adding later if concrete use cases for propagating it to STS arise.
@adutra : WDYT?
There was a problem hiding this comment.
If you are able to compute things like the principal or the otel context, it's because there is an active request context when the tags are computed. So, I'm a bit surprised that it's not possible to get the request ID as well. Have you tried this approach:
I'm not saying we absolutely must include the request id here, but I do remember that for events, not including the request ID was a friction point for some contributors.
There was a problem hiding this comment.
I have pushed a commit adding support for request_id, but my understanding is that the code that fetches the request_id will only work for HTTP requests - this was the primary reason we removed request_id from the previous PR that added request_id.
I request @dimas-b @adutra and @singhpk234 to review this and let me know if we should keep this as part of this PR or remove it.
The request_id code follows the same design suggested by @dimas-b for the trace_id. There is one change. Since the request_id is controlled by the end-user and may contain non-STS friendly characters, I added code to sanitize it - if this doesn't happen, vended credentials may fail to be issued.
There was a problem hiding this comment.
I did not review the latest changes yet, posting proactively: STS may be involved in non-HTTP requests, so if request ID is part of STS tags we need to find a way to access / generate it for async tasks too (cf. TaskManagerImpl)... I'll review actual code later ⏳
There was a problem hiding this comment.
@dimas-b thank you. Currently the code handles the absence of request-id by omitting it (request-id is optional, just like trace-id is). I will wait for your review.
There was a problem hiding this comment.
commented on related CDI aspects separately.
| * @return the sanitized value with invalid characters replaced by underscores | ||
| */ | ||
| static String sanitizeTagValue(String value) { | ||
| return INVALID_TAG_VALUE_CHARS.matcher(value).replaceAll("_"); |
There was a problem hiding this comment.
What's the point in including trace ID / request ID if their values are modified? They will become useless for automated processing / correlation, I think 🤔
There was a problem hiding this comment.
I was trying to be defensive in that an invalid request_id can derail vended credential grant (note that this is not an issue with trace_ids), but I see your point.
There was a problem hiding this comment.
From my POV if the trace ID value is not compliant with STS API requirements, I'd prefer to drop this tag with a WARN log message.
There was a problem hiding this comment.
If trace_id is coming from OpenTelemetry (Span.current().getSpanContext().getTraceId()), then in practice it shouldn’t ever be “invalid” for AWS STS session tags.
• Format/charset: OpenTelemetry trace IDs are represented as a 32‑character lowercase hex string ([0-9a-f]{32}), and AWS session tag values allow letters/numbers plus a small set of symbols. Hex is entirely within that allowed set.
• Length: 32 characters is far below the STS tag value limit (256).
• Validity gating in Polaris: In StorageAccessConfigProvider#getCurrentTraceId(), we only include it when spanContext.isValid() is true; if the context is invalid (including “all zeros” or malformed), we return Optional.empty() and the trace_id tag won’t be added at all.
The code above was specifically added for request_ids since they can be arbitrary strings - it has been removed in the revert of the last commit.
| */ | ||
| private Optional<String> getCurrentRequestId() { | ||
| // See org.jboss.resteasy.reactive.server.injection.ContextProducers | ||
| ResteasyReactiveRequestContext context = CurrentRequestManager.get(); |
There was a problem hiding this comment.
Sorry, but I do not think this is a totally correct way to get request IDs. IIRC, CurrentRequestManager.get() will fail in runtime if the this method is called outside an HTTP (REST) request context, which is possible in async tasks.
Please see how TaskExecutorImpl deals with realm IDs. A similar CDI pattern is probably necessary for request IDs, except that we may not have to produce a request ID for background tasks. I do not have an opinion on whether request IDs should be propagated from REST requests to related async tasks or not. However, CDI must be solid and not cause runtime exceptions even if the request ID is not propagated.
There was a problem hiding this comment.
Can I file another feature request to wire the request IDs through a CDI pattern similar to realm IDs and fix it cleanly in that PR? For this PR, I can roll back this last commit and merge it with just the trace_id. Once we have a reliable way to inject request_ids, we can add it to the STS tags then.
There was a problem hiding this comment.
From my POV it's totally fine to leave request IDs out of this PR and add in a follow-up PR (if there's demand).
There was a problem hiding this comment.
@obelix74 : if you feel like continuing working on request IDs in STS tags, I think it might be preferable to start with a dev email about that to gauge interest (not all people watch all PRs 😉 ).
This reverts commit 4b62347.
|
Behaviour changes in this PR are covered by a feature flag (with default being "no change"). The PR has been in review for a substantial period of time. I'm merging. If concerns are raised later, let's address in follow-up PRs. |
…apache#3414) * feat: Add trace_id to AWS STS session tags for end-to-end correlation This change enables deterministic correlation between: - Catalog operations (Polaris events) - Credential vending (AWS CloudTrail via STS session tags) - Metrics reports from compute engines (Spark, Trino, etc.) Changes: 1. Add traceId field to CredentialVendingContext - Marked with @Value.Auxiliary to exclude from cache key comparison - Every request has unique trace ID, so including it in equals/hashCode would prevent all cache hits - Trace ID is for correlation/audit only, not authorization 2. Extract OpenTelemetry trace ID in StorageAccessConfigProvider - getCurrentTraceId() extracts trace ID from current span context - Populates CredentialVendingContext.traceId for each request 3. Add trace_id to AWS STS session tags - AwsSessionTagsBuilder includes trace_id in session tags - Appears in CloudTrail logs for correlation with catalog operations - Uses 'unknown' placeholder when trace ID is not available 4. Update tests to verify trace_id is included in session tags This enables operators to correlate: - Which catalog operation triggered credential vending - Which data access events in CloudTrail correspond to catalog operations - Which metrics reports correspond to specific catalog operations * Update AwsCredentialsStorageIntegrationTest.java * Review comments 1. Feature Flag to Disable Trace IDs in Session Tags Added a new feature configuration flag INCLUDE_TRACE_ID_IN_SESSION_TAGS in FeatureConfiguration.java: polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java (EXCERPT) public static final FeatureConfiguration<Boolean> INCLUDE_TRACE_ID_IN_SESSION_TAGS = PolarisConfiguration.<Boolean>builder() .key("INCLUDE_TRACE_ID_IN_SESSION_TAGS") .description("If set to true (and INCLUDE_SESSION_TAGS_IN_SUBSCOPED_CREDENTIAL is also true), ...") .defaultValue(false) .buildFeatureConfiguration(); 2. Cache Key Correctness Solution The solution ensures cache correctness by including trace IDs in cache keys only when they affect the vended credentials: Key changes: 1. `StorageCredentialCacheKey` - Added a new traceIdForCaching() field that is populated only when trace IDs affect credentials: polaris-core/src/main/java/org/apache/polaris/core/storage/cache/StorageCredentialCacheKey.java (EXCERPT) @Value.Parameter(order = 10) Optional<String> traceIdForCaching(); 2. `StorageCredentialCache` - Reads both flags and includes trace ID in cache key only when both are enabled: polaris-core/src/main/java/org/apache/polaris/core/storage/cache/StorageCredentialCache.java (EXCERPT) boolean includeTraceIdInCacheKey = includeSessionTags && includeTraceIdInSessionTags; StorageCredentialCacheKey key = StorageCredentialCacheKey.of(..., includeTraceIdInCacheKey); 3. `AwsSessionTagsBuilder` - Conditionally includes trace ID based on the new flag. 4. Tests - Updated existing tests and added a new test testSessionTagsWithTraceIdWhenBothFlagsEnabled. How This Resolves the Cache Correctness vs. Efficiency Trade-off | Configuration | Trace ID in Session Tags | Trace ID in Cache Key | Caching Behavior | |---------------|--------------------------|----------------------|------------------| | Session tags disabled | No | No | Efficient caching | | Session tags enabled, trace ID disabled (default) | No | No | Efficient caching | | Session tags enabled, trace ID enabled | Yes | Yes | Correct but no caching across requests | This design ensures: • Correctness: When trace IDs affect credentials, they're included in the cache key • Efficiency: When trace IDs don't affect credentials, they're excluded from the cache key, allowing cache hits across requests * Update CHANGELOG.md Co-authored-by: Anand Kumar Sankaran <anand.sankaran@workday.com>
Checklist
CHANGELOG.md(if needed)site/content/in-dev/unreleased(if needed)Fixes a part of #3337
This change enables deterministic correlation between:
Changes:
Add traceId field to CredentialVendingContext
Extract OpenTelemetry trace ID in StorageAccessConfigProvider
Add trace_id to AWS STS session tags
Update tests to verify trace_id is included in session tags
This enables operators to correlate: