feat: add data model for client side metrics #1187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

daniel-sanche wants to merge 106 commits into main from csm_1_data_model

Contributor

daniel-sanche commented Aug 11, 2025 •

edited

Loading

Blocked on #1206

This PR revives #923, which was de-priotirized to work on the sync client. This PR brings it back, working with both async and sync. It also adds a grpc interceptor, as an improved way to capture metadata across both clients

Design

The main architecture looks like this:

300137129-bebbb05a-20f0-45c2-9d38-e95a314edf64 drawio (1)

Most of the work is done by the ActiveOperationMetric class, which is instantiated with each rpc call, and updated through the lifecycle of the call. When the rpc is complete, it will call on_operation_complete and on_attempt_complete on the MetricsHandler, which can then log the completed data into OpenTelemetry (or theoretically, other locations if needed)

Note that there are separate classes for active vs completed metrics (ActiveOperationMetric, ActiveAttemptMetric, CompletedOperationMetric, CompletedAttemptMetric). This is so that we can keep fields mutable and optional while the request is ongoing, but pass down static immutable copies once the attempt is completed and no new data is coming

daniel-sanche and others added 30 commits

July 25, 2025 16:12


          use replaceable channel wrapper

d2175f1


          got unit tests working

5e107fc


          put back in cache invalidation

c4a97e1


          added wrapped multicallables to avoid cache invalidation

e71b1d5


          added crosssync, moved close logic back to client

b81a9be


          generated sync code

a1dffb5


          got tests running

e3ec02b


          fixed tests

4e13783


          remove extra wrapper; added invalidate_stubs helper

7d90a04


          fixed lint

26cd601


          fixed lint

375332f


          renamed replaceablechannel to swappablechannel

428d75a


          added tests

4b39bc5


          added docstrings

3f090c2


          Merge branch 'main' into refactor_refresh

883ceab


          initial commit

04c762a


          added back interceptor

29dff4d


          added metrics to client

e4f8238


          fixed lint

fcb062e


          Merge branch 'refactor_refresh' into csm_1_data_model

ac8dbe4


          set up channel interceptions

d155f8a


          added TrackedBackoffGenerator

9fece96


          fixed lint

aec2577


          fixed import

ec4e847


          added operation.cancel

8f99e4e


          added operation cancelled to interceptor

f8e6603


          gave each operation a uuid

f5e057e


          return attempt metric on new attempt

8c397bb


          use standard context manager

2c34198


          use default backoff generator

9bd1e07

daniel-sanche added the kokoro:force-run label

yoshi-kokoro removed the kokoro:force-run label

daniel-sanche added the kokoro:force-run label

yoshi-kokoro removed the kokoro:force-run label

daniel-sanche added 15 commits

November 26, 2025 14:15


          loosened test tolerances

6a4d742


          removed metrics superclass from interceptor

a474560


          fixed lint


          improved comments

3e0d134


          moved interceptor into _metrics

6b48242


          pulled out tracking into new file

dcf3d0a


          simplified wrapper method

c94e4ff


          Revert "moved interceptor into _metrics"

3a87a35

This reverts commit 6b48242.


          moved tracked retry out of autogen folder

b5c361b


          fixed typing

1b0b857


          added tests

ac315d0


          removed unneeded imports

87e78d1


          ran blacken

54b7208


          Moved retry trackers into own file

819e1ae


          added docstring

22eb2e1

daniel-sanche marked this pull request as ready for review

November 27, 2025 00:57

daniel-sanche requested review from a team as code owners

November 27, 2025 00:57

blunderbuss-gcf bot assigned vermas2012

daniel-sanche added 2 commits

November 26, 2025 16:58


          fixed type

0ec8d14


          import annotations

fa25c2b

Contributor Author

daniel-sanche commented Dec 3, 2025

Before merging, we should re-run the benchmarking code to make sure we are satisfied with the performance

vermas2012 assigned mutianf and unassigned vermas2012

mutianf reviewed

View reviewed changes

google/cloud/bigtable/data/_metrics/data_model.py

    
              # default values for zone and cluster data, if not captured

              DEFAULT_ZONE = "global"

              DEFAULT_CLUSTER_ID = "unspecified"

Contributor

mutianf Dec 4, 2025

Suggested change

      
            DEFAULT_CLUSTER_ID = "unspecified"
          
            DEFAULT_CLUSTER_ID = "<unspecified>"

google/cloud/bigtable/data/_metrics/data_model.py

    
              DEFAULT_CLUSTER_ID = "unspecified"

              # keys for parsing metadata blobs

              BIGTABLE_METADATA_KEY = "x-goog-ext-425905942-bin"

Contributor

mutianf Dec 4, 2025

nit: maybe a more descriptive name like BIGTABLE_LOCATION_METADATA_KEY?

google/cloud/bigtable/data/_metrics/data_model.py

    
              class OperationType(Enum):

                  """Enum for the type of operation being performed."""

                  READ_ROWS = "ReadRows"

Contributor

mutianf Dec 4, 2025

there should also be a READ_ROW so we know if it's a point read or a scan

google/cloud/bigtable/data/_metrics/data_model.py

    
                  MUTATE_ROW = "MutateRow"

                  CHECK_AND_MUTATE = "CheckAndMutateRow"

                  READ_MODIFY_WRITE = "ReadModifyWriteRow"

Contributor

mutianf Dec 4, 2025

how about BulkMutateRows for write batcher?

google/cloud/bigtable/data/_metrics/data_model.py

    
                  backoff_before_attempt_ns: int = 0

                  # time waiting on grpc channel, in nanoseconds

                  # TODO: capture grpc_throttling_time

                  grpc_throttling_time_ns: int = 0

Contributor

mutianf Dec 4, 2025

fyi: we realized that in java this metric also doesn't capture the time a request queued on the channel. So if it's hard to get it in python we can skip it.

google/cloud/bigtable/data/_metrics/data_model.py

    
                          op_type=self.op_type,

                          uuid=self.uuid,

                          completed_attempts=self.completed_attempts,

                          duration_ns=time.monotonic_ns() - self.start_time_ns,

Contributor

mutianf Dec 6, 2025

same here, can we add a sanity check to make sure it's >= 0 ( or if its negative use 0 ) so that in case there's a bug in the code csm won't break the client.

google/cloud/bigtable/data/_async/metrics_interceptor.py

    
                  @CrossSync.convert

                  async def intercept_unary_unary(self, continuation, client_call_details, request):

                  @_with_active_operation

Contributor

mutianf Dec 8, 2025

where is this called? Can you point me to the code location? It feels a bit weird that starting an attempt is called from an interceptor

google/cloud/bigtable/data/_metrics/handlers/_base.py

    
                  def __init__(self, **kwargs):

                      pass

                  def on_operation_complete(self, op: CompletedOperationMetric) -> None:

Contributor

mutianf Dec 8, 2025

how is on_operation_complete called vs end_with_status in ActiveOperationMetric?

google/cloud/bigtable/data/_metrics/tracked_retry.py

    
                          # record metadata from failed rpc

                          if isinstance(exc, GoogleAPICallError) and exc.errors:

                              rpc_error = exc.errors[-1]

                              metadata = list(rpc_error.trailing_metadata()) + list(

Contributor

mutianf Dec 8, 2025

should this call metrics_interceptor._get_metadata()?

google/cloud/bigtable/data/_metrics/tracked_retry.py

    
                          # record ending attempt for timeout failures

                          attempt_exc = exc_list[-1]

                          _track_retryable_error(operation)(attempt_exc)

                      operation.end_with_status(source_exc)

Contributor

mutianf Dec 8, 2025

where is end_with_success called?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigtable size: xl