Skip to content

Conversation

@daniel-sanche
Copy link
Contributor

@daniel-sanche daniel-sanche commented Aug 11, 2025

Blocked on #1206

This PR revives #923, which was de-priotirized to work on the sync client. This PR brings it back, working with both async and sync. It also adds a grpc interceptor, as an improved way to capture metadata across both clients


Design

The main architecture looks like this:

300137129-bebbb05a-20f0-45c2-9d38-e95a314edf64 drawio (1)

Most of the work is done by the ActiveOperationMetric class, which is instantiated with each rpc call, and updated through the lifecycle of the call. When the rpc is complete, it will call on_operation_complete and on_attempt_complete on the MetricsHandler, which can then log the completed data into OpenTelemetry (or theoretically, other locations if needed)

Note that there are separate classes for active vs completed metrics (ActiveOperationMetric, ActiveAttemptMetric, CompletedOperationMetric, CompletedAttemptMetric). This is so that we can keep fields mutable and optional while the request is ongoing, but pass down static immutable copies once the attempt is completed and no new data is coming

@daniel-sanche daniel-sanche added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 26, 2025
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 26, 2025
@daniel-sanche daniel-sanche added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 26, 2025
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Nov 26, 2025
@daniel-sanche daniel-sanche marked this pull request as ready for review November 27, 2025 00:57
@daniel-sanche daniel-sanche requested review from a team as code owners November 27, 2025 00:57
@daniel-sanche
Copy link
Contributor Author

Before merging, we should re-run the benchmarking code to make sure we are satisfied with the performance

@vermas2012 vermas2012 assigned mutianf and unassigned vermas2012 Dec 5, 2025

# default values for zone and cluster data, if not captured
DEFAULT_ZONE = "global"
DEFAULT_CLUSTER_ID = "unspecified"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DEFAULT_CLUSTER_ID = "unspecified"
DEFAULT_CLUSTER_ID = "<unspecified>"

DEFAULT_CLUSTER_ID = "unspecified"

# keys for parsing metadata blobs
BIGTABLE_METADATA_KEY = "x-goog-ext-425905942-bin"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe a more descriptive name like BIGTABLE_LOCATION_METADATA_KEY?

class OperationType(Enum):
"""Enum for the type of operation being performed."""

READ_ROWS = "ReadRows"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should also be a READ_ROW so we know if it's a point read or a scan

MUTATE_ROW = "MutateRow"
CHECK_AND_MUTATE = "CheckAndMutateRow"
READ_MODIFY_WRITE = "ReadModifyWriteRow"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about BulkMutateRows for write batcher?

backoff_before_attempt_ns: int = 0
# time waiting on grpc channel, in nanoseconds
# TODO: capture grpc_throttling_time
grpc_throttling_time_ns: int = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: we realized that in java this metric also doesn't capture the time a request queued on the channel. So if it's hard to get it in python we can skip it.

op_type=self.op_type,
uuid=self.uuid,
completed_attempts=self.completed_attempts,
duration_ns=time.monotonic_ns() - self.start_time_ns,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, can we add a sanity check to make sure it's >= 0 ( or if its negative use 0 ) so that in case there's a bug in the code csm won't break the client.


@CrossSync.convert
async def intercept_unary_unary(self, continuation, client_call_details, request):
@_with_active_operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this called? Can you point me to the code location? It feels a bit weird that starting an attempt is called from an interceptor

def __init__(self, **kwargs):
pass

def on_operation_complete(self, op: CompletedOperationMetric) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is on_operation_complete called vs end_with_status in ActiveOperationMetric?

# record metadata from failed rpc
if isinstance(exc, GoogleAPICallError) and exc.errors:
rpc_error = exc.errors[-1]
metadata = list(rpc_error.trailing_metadata()) + list(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this call metrics_interceptor._get_metadata()?

# record ending attempt for timeout failures
attempt_exc = exc_list[-1]
_track_retryable_error(operation)(attempt_exc)
operation.end_with_status(source_exc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is end_with_success called?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigtable Issues related to the googleapis/python-bigtable API. size: xl Pull request size is extra large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants