Skip to content

feat: Add trace_id to AWS STS session tags for end-to-end correlation#3414

Merged
dimas-b merged 8 commits intoapache:mainfrom
obelix74:feat/aws-sts-session-tags-trace-id
Jan 16, 2026
Merged

feat: Add trace_id to AWS STS session tags for end-to-end correlation#3414
dimas-b merged 8 commits intoapache:mainfrom
obelix74:feat/aws-sts-session-tags-trace-id

Conversation

@obelix74
Copy link
Contributor

@obelix74 obelix74 commented Jan 11, 2026

Checklist

  • 🛡️ Don't disclose security issues! (contact security@apache.org)
  • 🔗 Clearly explained why the changes are needed, or linked related issues: Fixes #
  • 🧪 Added/updated tests with good coverage, or manually tested (and explained how)
  • 💡 Added comments for complex logic
  • 🧾 Updated CHANGELOG.md (if needed)
  • 📚 Updated documentation in site/content/in-dev/unreleased (if needed)

Fixes a part of #3337

This change enables deterministic correlation between:

  • Catalog operations (Polaris events)
  • Credential vending (AWS CloudTrail via STS session tags)
  • Metrics reports from compute engines (Spark, Trino, etc.)

Changes:

  1. Add traceId field to CredentialVendingContext

    • Marked with @Value.Auxiliary to exclude from cache key comparison
    • Every request has unique trace ID, so including it in equals/hashCode would prevent all cache hits
    • Trace ID is for correlation/audit only, not authorization
  2. Extract OpenTelemetry trace ID in StorageAccessConfigProvider

    • getCurrentTraceId() extracts trace ID from current span context
    • Populates CredentialVendingContext.traceId for each request
  3. Add trace_id to AWS STS session tags

    • AwsSessionTagsBuilder includes trace_id in session tags
    • Appears in CloudTrail logs for correlation with catalog operations
    • Uses 'unknown' placeholder when trace ID is not available
  4. Update tests to verify trace_id is included in session tags

This enables operators to correlate:

  • Which catalog operation triggered credential vending
  • Which data access events in CloudTrail correspond to catalog operations
  • Which metrics reports correspond to specific catalog operations

This change enables deterministic correlation between:
- Catalog operations (Polaris events)
- Credential vending (AWS CloudTrail via STS session tags)
- Metrics reports from compute engines (Spark, Trino, etc.)

Changes:
1. Add traceId field to CredentialVendingContext
   - Marked with @Value.Auxiliary to exclude from cache key comparison
   - Every request has unique trace ID, so including it in equals/hashCode
     would prevent all cache hits
   - Trace ID is for correlation/audit only, not authorization

2. Extract OpenTelemetry trace ID in StorageAccessConfigProvider
   - getCurrentTraceId() extracts trace ID from current span context
   - Populates CredentialVendingContext.traceId for each request

3. Add trace_id to AWS STS session tags
   - AwsSessionTagsBuilder includes trace_id in session tags
   - Appears in CloudTrail logs for correlation with catalog operations
   - Uses 'unknown' placeholder when trace ID is not available

4. Update tests to verify trace_id is included in session tags

This enables operators to correlate:
- Which catalog operation triggered credential vending
- Which data access events in CloudTrail correspond to catalog operations
- Which metrics reports correspond to specific catalog operations
@obelix74
Copy link
Contributor Author

@dimas-b as suggested by you, I have created a separate PR for the AWS STS trace_id changes.

Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, @obelix74 !

The feature itself LGTM, but I have some concerns about current impl. (below).

@dimas-b
Copy link
Contributor

dimas-b commented Jan 12, 2026

side note: I think this PR will need a CHANGELOG entry (eventually)

@obelix74
Copy link
Contributor Author

side note: I think this PR will need a CHANGELOG entry (eventually)

Perhaps after #3385 is merged, I will add a CHANGELOG entry summarized how end-to-end telemetry works? A TL/DR of the google doc that was shared?

@obelix74 obelix74 requested a review from dimas-b January 12, 2026 15:51
  1. Feature Flag to Disable Trace IDs in Session Tags

   Added a new feature configuration flag INCLUDE_TRACE_ID_IN_SESSION_TAGS in FeatureConfiguration.java:
   polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java (EXCERPT)
   public static final FeatureConfiguration<Boolean> INCLUDE_TRACE_ID_IN_SESSION_TAGS =
       PolarisConfiguration.<Boolean>builder()
           .key("INCLUDE_TRACE_ID_IN_SESSION_TAGS")
           .description("If set to true (and INCLUDE_SESSION_TAGS_IN_SUBSCOPED_CREDENTIAL is also true), ...")
           .defaultValue(false)
           .buildFeatureConfiguration();

   2. Cache Key Correctness Solution

   The solution ensures cache correctness by including trace IDs in cache keys only when they affect the vended credentials:

   Key changes:

     1. `StorageCredentialCacheKey` - Added a new traceIdForCaching() field that is populated only when trace IDs affect credentials:
   polaris-core/src/main/java/org/apache/polaris/core/storage/cache/StorageCredentialCacheKey.java (EXCERPT)
   @Value.Parameter(order = 10)
   Optional<String> traceIdForCaching();

     2. `StorageCredentialCache` - Reads both flags and includes trace ID in cache key only when both are enabled:
   polaris-core/src/main/java/org/apache/polaris/core/storage/cache/StorageCredentialCache.java (EXCERPT)
   boolean includeTraceIdInCacheKey = includeSessionTags && includeTraceIdInSessionTags;
   StorageCredentialCacheKey key = StorageCredentialCacheKey.of(..., includeTraceIdInCacheKey);

     3. `AwsSessionTagsBuilder` - Conditionally includes trace ID based on the new flag.

     4. Tests - Updated existing tests and added a new test testSessionTagsWithTraceIdWhenBothFlagsEnabled.

   How This Resolves the Cache Correctness vs. Efficiency Trade-off

   | Configuration | Trace ID in Session Tags | Trace ID in Cache Key | Caching Behavior |
   |---------------|--------------------------|----------------------|------------------|
   | Session tags disabled | No | No | Efficient caching |
   | Session tags enabled, trace ID disabled (default) | No | No | Efficient caching |
   | Session tags enabled, trace ID enabled | Yes | Yes | Correct but no caching across requests |

   This design ensures:
     • Correctness: When trace IDs affect credentials, they're included in the cache key
     • Efficiency: When trace IDs don't affect credentials, they're excluded from the cache key, allowing cache hits across requests
@adutra
Copy link
Contributor

adutra commented Jan 12, 2026

Hi @obelix74 thanks for providing support for trace ids!

I have a question: should we also include the request ID?

This aligns with an earlier, extensive discussion (back in October) regarding which information to include in events (OTel context vs. request IDs). The general agreement was to include request IDs when available.

Reference link: https://lists.apache.org/thread/p9357rcy3d1j94w4yogtdwcf2kxzg3jr

As noted in that thread:

Request ID should remain the canonical identifier for every request handled by Polaris. [...] OTel context is OPTIONAL."

I think it would be good to apply the same decisions for AWS STS session tags, wdyt?

@obelix74 obelix74 requested a review from snazy January 12, 2026 18:03
@obelix74
Copy link
Contributor Author

Hi @obelix74 thanks for providing support for trace ids!

I have a question: should we also include the request ID?

This aligns with an earlier, extensive discussion (back in October) regarding which information to include in events (OTel context vs. request IDs). The general agreement was to include request IDs when available.

Reference link: https://lists.apache.org/thread/p9357rcy3d1j94w4yogtdwcf2kxzg3jr

As noted in that thread:

Request ID should remain the canonical identifier for every request handled by Polaris. [...] OTel context is OPTIONAL."

I think it would be good to apply the same decisions for AWS STS session tags, wdyt?

Hi Alex

In an earlier PR https://github.com/apache/polaris/pull/3327/changes/BASE..11f4c58ce6f24284553e73887bd8d0d2991244ff, I had included the request_id. I had run into a few issues and based on @singhpk234 's advise, removed it from that PR.

  1. I couldn't figure out a reliable way to obtain the request_id - I was using the RESTContext but @dimas-b pointed out that won't work for API calls. I then replaced it with SLF4J's MDC - but that felt hacky.
  2. My understanding is that Polaris generates a new request_id per incoming call, whereas the trace_id is more "atomic" and ensures that if a single atomic operation results in multiple REST calls, all get the same ID.

Based on these two issues, we had removed request_id from that pull request. Perhaps, this is a good time to add it back.

What are your thoughts? Also, any guidance on a reliable way to obtain the request_id would be helpful.

@obelix74 obelix74 requested a review from dimas-b January 12, 2026 19:00
dimas-b
dimas-b previously approved these changes Jan 12, 2026
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks, @obelix74 !

Let's give this PR a couple more review days before merging given the potential perf. and functionality impact.

@github-project-automation github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jan 12, 2026

public static final FeatureConfiguration<Boolean> INCLUDE_TRACE_ID_IN_SESSION_TAGS =
PolarisConfiguration.<Boolean>builder()
.key("INCLUDE_TRACE_ID_IN_SESSION_TAGS")
Copy link
Contributor

@dimas-b dimas-b Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we're adding a feature flag, I'd prefer to mention it in CHANGELOG.md right away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it shortly. Can you please see @adutra 's comment about request_id? If you remember, in the previous PR about session tags, we had explicitly removed request_id because we couldn't find a reliable way to get the request_id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my personal POV, I do not know of real use cases for "request IDs", but I suppose some people use it.

I'd be ok leaving it out for now and adding later if concrete use cases for propagating it to STS arise.

@adutra : WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are able to compute things like the principal or the otel context, it's because there is an active request context when the tags are computed. So, I'm a bit surprised that it's not possible to get the request ID as well. Have you tried this approach:

private Optional<String> getRequestId() {
// See org.jboss.resteasy.reactive.server.injection.ContextProducers
ResteasyReactiveRequestContext context = CurrentRequestManager.get();
if (context != null) {
ContainerRequestContextImpl request = context.getContainerRequestContext();
String requestId = (String) request.getProperty(RequestIdFilter.REQUEST_ID_KEY);
return Optional.ofNullable(requestId);
}
return Optional.empty();
}

I'm not saying we absolutely must include the request id here, but I do remember that for events, not including the request ID was a friction point for some contributors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adutra. I am on it.

Copy link
Contributor Author

@obelix74 obelix74 Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adutra and @dimas-b does the aforementioned code snippet handle non-HTTP clients or programmatic clients as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed a commit adding support for request_id, but my understanding is that the code that fetches the request_id will only work for HTTP requests - this was the primary reason we removed request_id from the previous PR that added request_id.

I request @dimas-b @adutra and @singhpk234 to review this and let me know if we should keep this as part of this PR or remove it.

The request_id code follows the same design suggested by @dimas-b for the trace_id. There is one change. Since the request_id is controlled by the end-user and may contain non-STS friendly characters, I added code to sanitize it - if this doesn't happen, vended credentials may fail to be issued.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not review the latest changes yet, posting proactively: STS may be involved in non-HTTP requests, so if request ID is part of STS tags we need to find a way to access / generate it for async tasks too (cf. TaskManagerImpl)... I'll review actual code later ⏳

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimas-b thank you. Currently the code handles the absence of request-id by omitting it (request-id is optional, just like trace-id is). I will wait for your review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented on related CDI aspects separately.

@obelix74 obelix74 requested a review from dimas-b January 12, 2026 19:25
dimas-b
dimas-b previously approved these changes Jan 12, 2026
@obelix74 obelix74 requested review from adutra and dimas-b January 13, 2026 19:42
* @return the sanitized value with invalid characters replaced by underscores
*/
static String sanitizeTagValue(String value) {
return INVALID_TAG_VALUE_CHARS.matcher(value).replaceAll("_");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point in including trace ID / request ID if their values are modified? They will become useless for automated processing / correlation, I think 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be defensive in that an invalid request_id can derail vended credential grant (note that this is not an issue with trace_ids), but I see your point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my POV if the trace ID value is not compliant with STS API requirements, I'd prefer to drop this tag with a WARN log message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimas-b

If trace_id is coming from OpenTelemetry (Span.current().getSpanContext().getTraceId()), then in practice it shouldn’t ever be “invalid” for AWS STS session tags.
Format/charset: OpenTelemetry trace IDs are represented as a 32‑character lowercase hex string ([0-9a-f]{32}), and AWS session tag values allow letters/numbers plus a small set of symbols. Hex is entirely within that allowed set.
Length: 32 characters is far below the STS tag value limit (256).
Validity gating in Polaris: In StorageAccessConfigProvider#getCurrentTraceId(), we only include it when spanContext.isValid() is true; if the context is invalid (including “all zeros” or malformed), we return Optional.empty() and the trace_id tag won’t be added at all.

The code above was specifically added for request_ids since they can be arbitrary strings - it has been removed in the revert of the last commit.

*/
private Optional<String> getCurrentRequestId() {
// See org.jboss.resteasy.reactive.server.injection.ContextProducers
ResteasyReactiveRequestContext context = CurrentRequestManager.get();
Copy link
Contributor

@dimas-b dimas-b Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I do not think this is a totally correct way to get request IDs. IIRC, CurrentRequestManager.get() will fail in runtime if the this method is called outside an HTTP (REST) request context, which is possible in async tasks.

Please see how TaskExecutorImpl deals with realm IDs. A similar CDI pattern is probably necessary for request IDs, except that we may not have to produce a request ID for background tasks. I do not have an opinion on whether request IDs should be propagated from REST requests to related async tasks or not. However, CDI must be solid and not cause runtime exceptions even if the request ID is not propagated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I file another feature request to wire the request IDs through a CDI pattern similar to realm IDs and fix it cleanly in that PR? For this PR, I can roll back this last commit and merge it with just the trace_id. Once we have a reliable way to inject request_ids, we can add it to the STS tags then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my POV it's totally fine to leave request IDs out of this PR and add in a follow-up PR (if there's demand).

Copy link
Contributor

@dimas-b dimas-b Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@obelix74 : if you feel like continuing working on request IDs in STS tags, I think it might be preferable to start with a dev email about that to gauge interest (not all people watch all PRs 😉 ).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimas-b Raised #3444. Reverted my last commit so this PR is back to what you approved. Once I merge this PR and the metrics persistence PR, I can work on the request ID PR, and will start a discussion with the ML.

@obelix74 obelix74 requested a review from dimas-b January 14, 2026 21:45
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks again, @obelix74 !

@obelix74
Copy link
Contributor Author

LGTM 👍 Thanks again, @obelix74 !

Hi @dimas-b can this PR be merged?

@dimas-b
Copy link
Contributor

dimas-b commented Jan 16, 2026

Behaviour changes in this PR are covered by a feature flag (with default being "no change"). The PR has been in review for a substantial period of time. I'm merging.

If concerns are raised later, let's address in follow-up PRs.

@dimas-b dimas-b merged commit f1b51c7 into apache:main Jan 16, 2026
15 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Jan 16, 2026
evindj pushed a commit to evindj/polaris that referenced this pull request Jan 26, 2026
…apache#3414)

* feat: Add trace_id to AWS STS session tags for end-to-end correlation

This change enables deterministic correlation between:
- Catalog operations (Polaris events)
- Credential vending (AWS CloudTrail via STS session tags)
- Metrics reports from compute engines (Spark, Trino, etc.)

Changes:
1. Add traceId field to CredentialVendingContext
   - Marked with @Value.Auxiliary to exclude from cache key comparison
   - Every request has unique trace ID, so including it in equals/hashCode
     would prevent all cache hits
   - Trace ID is for correlation/audit only, not authorization

2. Extract OpenTelemetry trace ID in StorageAccessConfigProvider
   - getCurrentTraceId() extracts trace ID from current span context
   - Populates CredentialVendingContext.traceId for each request

3. Add trace_id to AWS STS session tags
   - AwsSessionTagsBuilder includes trace_id in session tags
   - Appears in CloudTrail logs for correlation with catalog operations
   - Uses 'unknown' placeholder when trace ID is not available

4. Update tests to verify trace_id is included in session tags

This enables operators to correlate:
- Which catalog operation triggered credential vending
- Which data access events in CloudTrail correspond to catalog operations
- Which metrics reports correspond to specific catalog operations

* Update AwsCredentialsStorageIntegrationTest.java

* Review comments

  1. Feature Flag to Disable Trace IDs in Session Tags

   Added a new feature configuration flag INCLUDE_TRACE_ID_IN_SESSION_TAGS in FeatureConfiguration.java:
   polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java (EXCERPT)
   public static final FeatureConfiguration<Boolean> INCLUDE_TRACE_ID_IN_SESSION_TAGS =
       PolarisConfiguration.<Boolean>builder()
           .key("INCLUDE_TRACE_ID_IN_SESSION_TAGS")
           .description("If set to true (and INCLUDE_SESSION_TAGS_IN_SUBSCOPED_CREDENTIAL is also true), ...")
           .defaultValue(false)
           .buildFeatureConfiguration();

   2. Cache Key Correctness Solution

   The solution ensures cache correctness by including trace IDs in cache keys only when they affect the vended credentials:

   Key changes:

     1. `StorageCredentialCacheKey` - Added a new traceIdForCaching() field that is populated only when trace IDs affect credentials:
   polaris-core/src/main/java/org/apache/polaris/core/storage/cache/StorageCredentialCacheKey.java (EXCERPT)
   @Value.Parameter(order = 10)
   Optional<String> traceIdForCaching();

     2. `StorageCredentialCache` - Reads both flags and includes trace ID in cache key only when both are enabled:
   polaris-core/src/main/java/org/apache/polaris/core/storage/cache/StorageCredentialCache.java (EXCERPT)
   boolean includeTraceIdInCacheKey = includeSessionTags && includeTraceIdInSessionTags;
   StorageCredentialCacheKey key = StorageCredentialCacheKey.of(..., includeTraceIdInCacheKey);

     3. `AwsSessionTagsBuilder` - Conditionally includes trace ID based on the new flag.

     4. Tests - Updated existing tests and added a new test testSessionTagsWithTraceIdWhenBothFlagsEnabled.

   How This Resolves the Cache Correctness vs. Efficiency Trade-off

   | Configuration | Trace ID in Session Tags | Trace ID in Cache Key | Caching Behavior |
   |---------------|--------------------------|----------------------|------------------|
   | Session tags disabled | No | No | Efficient caching |
   | Session tags enabled, trace ID disabled (default) | No | No | Efficient caching |
   | Session tags enabled, trace ID enabled | Yes | Yes | Correct but no caching across requests |

   This design ensures:
     • Correctness: When trace IDs affect credentials, they're included in the cache key
     • Efficiency: When trace IDs don't affect credentials, they're excluded from the cache key, allowing cache hits across requests

* Update CHANGELOG.md

Co-authored-by: Anand Kumar Sankaran <anand.sankaran@workday.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants