-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Implement Cloud Run Logging trace correlation with contextvars #235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
lisad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest question I have is if we're really going to copy all this and have 3 copies (one in schemaindex, one in trust registry, one in activitypub testbed. That feels like quite a bit of code we're going to have to maintain separately. I'm really annoyed at Google that this isn't easier and part of their libraries for google logging.
- I can see a strong case for the investment for the Data Trust Registry. if we have errors there, we have to be able to figure out what happened. Especially with complicated things like domain verification.
- I don't think we should rush to add it to SchemaIndex. That's a very simple site with simple interactions and we don't know if it's going to take off.
- I am really conflicted about whether this goes in Testbed. On the one hand, debugging OAuth interactions with a trace ID is quite useful. On the other hand, it's an open source project. If anybody else were to run it, it is more than likely not on Google Cloud. They might only run it on localhost. I hate to add a bunch of stuff that is not directly related to ActivityPub and probably not useful to other users of the project. If we hand over the testbed to some other party to maintain, this gets in the way. I'd like to not have the cost of running the testbed forever under DTI.
- If we refactor the middleware to be responsible for the context vars, then we perhaps have something we can maintain across 2 different projects - e.g. a note that explains what the middleware is for and that the copy in Data Trust Registry is the same as the copy elsewhere (if we have any elsewheres, that is). I also am considering what it would look like for the middleware (and possibly also the logging filter) to be packaged so it's the same code across 2 different projects, but that's complicated by the fact that one is open source and one isn't.
Let's discuss... these are all my thoughts for the moment
| if len(x_cloud_trace_context_parts) >= 1: | ||
| trace_id = x_cloud_trace_context_parts[0] | ||
| if trace_id: | ||
| record.trace = f"projects/{self.project_id}/traces/{trace_id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the record is being passed by value instead of by reference, then changes to the object wouldn't be carried along. Are we sure that this is an intent of the logging.Filter interface? It seems more like taking advantage of a (possibly temporary or situational) side effect of the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.python.org/3/library/logging.html says :
Changed in version 3.12: You can now return a LogRecord instance from filters to replace the log record rather than modifying it in place. This allows filters attached to a Handler to modify the log record before it is emitted, without having side effects on other handlers.
| _request_method_var.set(request_method) | ||
|
|
||
|
|
||
| def get_trace_context(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer used
|
|
||
| # Thread-local storage for trace context | ||
| _trace_local = threading.local() | ||
| _trace_context_var = contextvars.ContextVar("trace_context", default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move the contextvars and management of them to Middleware. The reason is that the middleware can be the utility that other things depend on. If we needed the request method, request path or trace_id anywhere else that we didn't have it, a ContextMiddleware could be called by other code as well as LoggingFilter.
It also just seems a little more robust for the logging filter to get answers from the middleware, rather than for the middleware to import the logging filter to push data to it. So this would reverse the import relationship.
|
UPDATE. We agreed on not implementing Cloud Run logging in this project and we moved forward to iterate over this implementation in Trust Registry and a minimal logging implementation in Schemas Pub. Both implementation are working as expected now. Trust Registry has a middleware + filter architecture and it uses contextvars. While doing some research about Google Cloud logging documentation I found this I found that we won't be able to add some custom fields that we included in our Trust Registry implementation but it should handle the trace correlation automatically. This new implementation sits in the middle in our logging implementations, being Trust Registry the most complete, and Schemas Pub the most minimal. When we call If this works as I hope, I would like to keep this for tracking OAuth, and other useful information. |
Closes #234
This PR enables Cloud Run trace correlation and enhanced Google Cloud logs metadata.
In the previous PR I was able to successfully get our loggers in Cloud Run so this is a iteration over that with a slightly different approach but adding more metadata to the logs.
Cloud Run automatically injects the X-Cloud-Trace-Context header into every request. And it has this format:
TRACE_ID/SPAN_ID;o=TRACE_ENABLED.I decided to somehow capture this header to get some useful information out of it. By doing so we can add important information such as
request.path,request.methodand some very useful labels but most importantly is that we can get the full request trace and get the entire request flow which will make debugging way more easier and complete since now we can see the complete request context.Our current implementation (from the previous PR) achieved the task of displaying loggers into Cloud Logging but they are isolated logs.
For example, in the following screenshot we see that there is a log that indicates that
OAuth callback received with no state parameter. This is a logger that exists in our codebase.And the same user could generates different loggers such as the following but it's hard to understand how they are connected.
With Trace Correlation we'll have the following logging structure:
All logs grouped under the request and expandable. By clicking on "View Trace" we'll see full timeline with spans.
ContextVars
One of the key decision I had to make in this feature was how I'll get the X-Cloud-Trace-Context since google-cloud-logging library does not support this and it's known that trace correlation has to be done manually for these cases.
To achieve the goal of having better cloud logging structure we need to get per-request context (trace header, request info, etc). In my previous PR I started working with
threading.local()but the implementation was not yet completely since I wanted to get the first phase of the implementation ready which was getting Cloud Logging to work as expected.So I started researching about which approach could work the best for us and I took into consideration that we are in a WSGI environment.
ContextVars is thought for ASGI/async environments but it works perfectly in WSGI environments too since contextvars behave essentially like threading.
There are several pros that took me to consider contextvars instead of threading:
Implementation
LoggingTraceContextMiddlewareserves as the entry and exit point for the trace correlation process. It captures trace information when a request first arrives and ensure that this information is cleaned up when the request completes.When Cloud Run receives an HTTP request it automatically injects the X-Cloud-Trace-Context header into every incoming request. This header contains a unique trace identifier that Cloud Logging uses to group all logs related to that specific request. The middleware's first job is to extract this header value at the very beginning of request processing, before any application code has a chance to execute.
First,
process_requestextracts the trace header from the incoming request metadata and also the request path and method (we can capture more information if we want to). These information are then stored in contextvars.Then
process_responseis called after view has executed and returned a response. The middleware job now is to clean up by clearing all the context variables that it set earlier. This avoids that trace information from one request could leak into subsequent requests.Once the middleware handles the process of capturing and storing trace information,
CloudRunTraceFilteris responsible for the actual enrichment of logs records. It access the contextvars that contains the trace information that was gathered previously.This filter parses the header and extracts trace id and span id and formats it according to what Cloud Logging expects.
And lastly, the
get_cloud_logging_handlerchecks if Cloud Logging should be enabled in the current environment by searching for USE_GCLOUD_LOGGING environment variable which is a feature flag that controls where we should use Cloud Logging or local console logging.Google Cloud Logging is enabled only in Staging and Production. When it detects USE_GCLOUD_LOGGING first imports the necessary google cloud logging libraries and then it creates a CloudLoggingClient that makes the connection to Google Cloud Logging's API. This client handles authentication automatically via the service account that is in charge of executing the Cloud Run service (but it has to have the right permissions on Google Cloud IAM section). And lastly, it creates an instance of CloudRunTraceFilter and is where the whole process comes together.