-
Notifications
You must be signed in to change notification settings - Fork 61
feat: add data model for client side metrics #1187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This reverts commit 6b48242.
|
Before merging, we should re-run the benchmarking code to make sure we are satisfied with the performance |
|
|
||
| # default values for zone and cluster data, if not captured | ||
| DEFAULT_ZONE = "global" | ||
| DEFAULT_CLUSTER_ID = "unspecified" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| DEFAULT_CLUSTER_ID = "unspecified" | |
| DEFAULT_CLUSTER_ID = "<unspecified>" |
| DEFAULT_CLUSTER_ID = "unspecified" | ||
|
|
||
| # keys for parsing metadata blobs | ||
| BIGTABLE_METADATA_KEY = "x-goog-ext-425905942-bin" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe a more descriptive name like BIGTABLE_LOCATION_METADATA_KEY?
| class OperationType(Enum): | ||
| """Enum for the type of operation being performed.""" | ||
|
|
||
| READ_ROWS = "ReadRows" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should also be a READ_ROW so we know if it's a point read or a scan
| MUTATE_ROW = "MutateRow" | ||
| CHECK_AND_MUTATE = "CheckAndMutateRow" | ||
| READ_MODIFY_WRITE = "ReadModifyWriteRow" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about BulkMutateRows for write batcher?
| backoff_before_attempt_ns: int = 0 | ||
| # time waiting on grpc channel, in nanoseconds | ||
| # TODO: capture grpc_throttling_time | ||
| grpc_throttling_time_ns: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi: we realized that in java this metric also doesn't capture the time a request queued on the channel. So if it's hard to get it in python we can skip it.
| op_type=self.op_type, | ||
| uuid=self.uuid, | ||
| completed_attempts=self.completed_attempts, | ||
| duration_ns=time.monotonic_ns() - self.start_time_ns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, can we add a sanity check to make sure it's >= 0 ( or if its negative use 0 ) so that in case there's a bug in the code csm won't break the client.
|
|
||
| @CrossSync.convert | ||
| async def intercept_unary_unary(self, continuation, client_call_details, request): | ||
| @_with_active_operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this called? Can you point me to the code location? It feels a bit weird that starting an attempt is called from an interceptor
| def __init__(self, **kwargs): | ||
| pass | ||
|
|
||
| def on_operation_complete(self, op: CompletedOperationMetric) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is on_operation_complete called vs end_with_status in ActiveOperationMetric?
| # record metadata from failed rpc | ||
| if isinstance(exc, GoogleAPICallError) and exc.errors: | ||
| rpc_error = exc.errors[-1] | ||
| metadata = list(rpc_error.trailing_metadata()) + list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this call metrics_interceptor._get_metadata()?
| # record ending attempt for timeout failures | ||
| attempt_exc = exc_list[-1] | ||
| _track_retryable_error(operation)(attempt_exc) | ||
| operation.end_with_status(source_exc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is end_with_success called?
Blocked on #1206
This PR revives #923, which was de-priotirized to work on the sync client. This PR brings it back, working with both async and sync. It also adds a grpc interceptor, as an improved way to capture metadata across both clients
Design
The main architecture looks like this:
Most of the work is done by the ActiveOperationMetric class, which is instantiated with each rpc call, and updated through the lifecycle of the call. When the rpc is complete, it will call
on_operation_completeandon_attempt_completeon the MetricsHandler, which can then log the completed data into OpenTelemetry (or theoretically, other locations if needed)Note that there are separate classes for active vs completed metrics (
ActiveOperationMetric,ActiveAttemptMetric,CompletedOperationMetric,CompletedAttemptMetric). This is so that we can keep fields mutable and optional while the request is ongoing, but pass down static immutable copies once the attempt is completed and no new data is coming