From 0cc7bd1ebf84d036fb46e863f978b60993a960be Mon Sep 17 00:00:00 2001 From: Craig Tiller Date: Tue, 6 May 2025 10:24:19 -0700 Subject: [PATCH 1/8] x --- A98-debug-protocol.md | 191 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 191 insertions(+) create mode 100644 A98-debug-protocol.md diff --git a/A98-debug-protocol.md b/A98-debug-protocol.md new file mode 100644 index 000000000..7e1b38496 --- /dev/null +++ b/A98-debug-protocol.md @@ -0,0 +1,191 @@ +Title +---- +* Author(s): ctiller +* Approver: markdroth +* Status: Draft +* Implemented in: +* Last updated: 2025/05/06 +* Discussion at: (filled after thread exists) + +## Abstract + +Add a generalized debug interface for gRPC services. + +## Background + +In A14 we added channelz. +This protocol mixes some lightweight monitoring with some tracing and debuggability features. +It suffers from being relatively rigid in the topology it presents, and the set of node types available. + + +### Related Proposals: +* A14 - channelz. + +## Proposal + +A new debug service proto will be added, `debug/debug.proto` in the `grpc.debug.v1` namespace. + +### Entities + +The fundamental building block of the protocol is a queryable entity. +An entity describes one live object inside of a gRPC implementation. +This might be a call, a channel, a transport, or a TCP socket. It might also describe a resource quota or a watched configuration file. + +One entity is described as such: + +``` +message Entity { + // The identifier for this entity. + int64 id = 1; + // The kind of this entity. + string kind = 2; + // Parents for this entity. + repeated int64 parents = 3; + // Instantaneous data for this entity. + repeated google.protobuf.Any data = 4; + // Historical trace information for the entity. + repeated TraceEvent trace = 5; +} +``` + +Entities have state and configuration, and tend be be relatively long lived - such that querying them makes sense. + +**id**: An entity is identified by an id. +These ids are allocated sequentially per the rationale in A14, and implementations should use the same id space for debug entities and channelz objects. + +**kind**: An entity has a kind. This is a string descriptor of the general category of object that this entity describes. +We use strings here rather than enums so that implementations are free to extend the kind space with their own objects. +Common entity kinds will see some level of standardization across stacks, but we expect many kinds to be specific per gRPC implementation. +Initially implementations are expected to match kinds with channelz object types: +kind `"channel"` -> `channelz.v1.Channel`, `"subchannel"` -> `channelz.v1.Subchannel`, `"server"` -> `channelz.v1.Server`, `"socket"` -> `channelz.v1.Socket`. + + +**parents**: It's useful to be able to associate entities in parent/child hierarchies. +For example, a channel has many subchannel children. +Channelz listed specific kinds of children in its various node types - this tracking (and the need to produce it when sending an object) has caused contention issues in implementations in the past. +Should the list of children (optionally of a particular kind) for an entity be desired - and that's common - a separate paginated service call can be made. +Instead, this protocol lists only parents (as that set is far more stable). +Multiple parents are allowed - to handle at least the case of C++ subchannels being owned by multiple channels. + +**data**: This is a list of protobuf Any objects describing the current state of this entity. +Implementations may define their own protobufs to be exposed here, and common sets will be standardized separately. + +**trace**: Finally, an entity may supply a small summary of its history as a trace. +Each event is defined as: + +``` +message TraceEvent { + // High level description of the event. + string description = 1; + // When this event occurred. + google.protobuf.Timestamp timestamp = 2; + // Any additional supporting data. + repeated google.protobuf.Any data = 3; + // Any referenced entities. + repeated int64 referenced_entities = 4; +} +``` + +These are made available to all entities (in contrast to channelz that selected which nodes had traces, and which did not) - though an implementation need not make that facility available in its implementation of entities. + +Also note that any notion of severity has been removed from the protocol (in contrast to channelz) - in practice this has not been a useful field. +When converting this protocol to channelz all trace events should be taken as CT_INFO. + +Again, there is a facility for implementations to provide their own additional information in the **data** field. + +### Queries + +Queries are made available via the `Debug` service: + +``` +service Debug { + // Gets all entities of a given kind, optionally with a given parent. + rpc QueryEntities(QueryEntitiesRequest) returns (QueryEntitiesResponse); + // Gets information for a specific entity. + rpc GetEntity(GetEntityRequest) returns (GetEntityResponse); + // Query a named trace from an entity. + // These query live information from the system, and run for as long + // as the query time is made for. + rpc QueryTrace(QueryTraceRequest) returns (QueryTraceResponse); +} +``` + +**QueryEntities** allows the full database of entities to be queried for entries that match a set of criteria. +The allowed criteria may be extended over time with additional gRFCs. + +``` +message QueryEntitiesRequest { + // The kind of entities to query. + // If this is set to empty then all kinds will be queried. + string kind = 1; + // Filter the entities so that only children of this parent are returned. + // If this is 0 then no parent filter is applied. + int64 parent = 2; +} + +message QueryEntitiesResponse { + // List of entities that match the query. + repeated Entity entities = 1; +} +``` + +**GetEntity** allows polling a single entity's state. + +``` +message GetEntityRequest { + // The identifier of the entity to get. + int64 id = 1; +} + +message GetEntityResponse { + // The Entity that corresponds to the requested id. This field + // should be set. + Entity entity = 1; +} +``` + +Finally, **QueryTrace** allows rich queries of live trace data to be pulled from an instance. + +``` +message QueryTraceRequest { + // The identifier of the entity to query. + int64 id = 1; + // The name of the trace to query. + string name = 2; + // The amount of time to query. Implementations may arbitrarily cap this. + google.protobuf.Duration duration = 3; + // Implementation defined query arguments. + repeated google.protobuf.Any args = 4; +} + +message QueryTraceResponse { + // The events in the trace. + repeated TraceEvent events = 1; + // Number of events matched by the trace. + // This may be higher than the number returned if memory limits were exceeded. + int64 num_events_matched = 2; +} +``` + +The idea here is that these traces will be very detailed - beyond what can feasibly be stored in the historical traces. +So instead of needing to store this data in historical traces *in case* it is needed, we query and collect this data on demand for small windows of time. + +## Rationale + +The new interface is heavily inspired by channelz, but removes rigid node associations and predefined data sets. +Much of the infrastructure required for channelz can be shared with the debug interface, and indeed one can be implemented atop the other with good fidelity (in either direction). + +Node ids are consistent between the protocols so that one can take a node id from the debug interface and query channelz data directly (and vice versa). +This will allow consumers of the protocols to gradually transition from one to the other. +It also allows re-using the book-keeping already in existence for channelz in implementations, without adding another kind of book-keeping to the mix. + +Why not just extend channelz? + +There are some fundamental data model issues in that protocol that I'd like to address going forward: + +* chaotic-good (a transport currently implemented and in production in C++) has multiple TCP sockets per transport. This might be representable as a subchannel with multiple sockets on the client side (though we've already represented the HTTP2 transport as a socket there), but server side makes no allowance for properly describing the object hierarchy. +* Further, the set of node types ("kinds" here-in) and their relationships are pre-baked into the protocol, and as we evolve and improve gRPC new node types are needed and new relationships will be added or removed. We should not need to update channelz each time this happens. + +## Implementation + +ctiller will implement this for C++. Other languages may pick this up as needed. From 4fb593d00f55e8d25c147663e85edc2f1cadab73 Mon Sep 17 00:00:00 2001 From: Craig Tiller Date: Tue, 6 May 2025 10:25:36 -0700 Subject: [PATCH 2/8] x --- A98-debug-protocol.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A98-debug-protocol.md b/A98-debug-protocol.md index 7e1b38496..4dd2bf6ec 100644 --- a/A98-debug-protocol.md +++ b/A98-debug-protocol.md @@ -1,4 +1,4 @@ -Title +A98: Debug protocol ---- * Author(s): ctiller * Approver: markdroth From 709a7ba7cde5946a452abbee8ad89356453bd380 Mon Sep 17 00:00:00 2001 From: Craig Tiller Date: Tue, 6 May 2025 10:41:07 -0700 Subject: [PATCH 3/8] x --- A98-debug-protocol.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/A98-debug-protocol.md b/A98-debug-protocol.md index 4dd2bf6ec..69a431622 100644 --- a/A98-debug-protocol.md +++ b/A98-debug-protocol.md @@ -170,6 +170,35 @@ message QueryTraceResponse { The idea here is that these traces will be very detailed - beyond what can feasibly be stored in the historical traces. So instead of needing to store this data in historical traces *in case* it is needed, we query and collect this data on demand for small windows of time. +### Well Known Data + +A separate `well_known_data.proto` file will be maintained, with debug state that's common across implementations described within it. + +The initial state will be: + +``` +// Channel connectivity state - attached to kind "channel" and "subchannel". +// These come from the specified states in this document: +// https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md +message ChannelConnectivityState { + enum State { + UNKNOWN = 0; + IDLE = 1; + CONNECTING = 2; + READY = 3; + TRANSIENT_FAILURE = 4; + SHUTDOWN = 5; + } + State state = 1; +} + +// Channel target information. Attached to kind "channel" and "subchannel". +message ChannelTarget { + // The target this channel originally tried to connect to. May be absent + string target = 1; +} +``` + ## Rationale The new interface is heavily inspired by channelz, but removes rigid node associations and predefined data sets. @@ -186,6 +215,10 @@ There are some fundamental data model issues in that protocol that I'd like to a * chaotic-good (a transport currently implemented and in production in C++) has multiple TCP sockets per transport. This might be representable as a subchannel with multiple sockets on the client side (though we've already represented the HTTP2 transport as a socket there), but server side makes no allowance for properly describing the object hierarchy. * Further, the set of node types ("kinds" here-in) and their relationships are pre-baked into the protocol, and as we evolve and improve gRPC new node types are needed and new relationships will be added or removed. We should not need to update channelz each time this happens. +Specific metrics have been not been carried forward from channelz as required parts of the protocol. +Users that need call or message counts from the system are encouraged to use the telemetry features of gRPC. +Implementations however are encouraged to publish whatever metrics they have available at query time to some `data` protobuf. + ## Implementation ctiller will implement this for C++. Other languages may pick this up as needed. From 4ab3630ff6b9a5dfd1bef2714d70f740bd7a4046 Mon Sep 17 00:00:00 2001 From: Craig Tiller Date: Tue, 6 May 2025 10:45:28 -0700 Subject: [PATCH 4/8] Update A98-debug-protocol.md --- A98-debug-protocol.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A98-debug-protocol.md b/A98-debug-protocol.md index 69a431622..d24539c59 100644 --- a/A98-debug-protocol.md +++ b/A98-debug-protocol.md @@ -5,7 +5,7 @@ A98: Debug protocol * Status: Draft * Implemented in: * Last updated: 2025/05/06 -* Discussion at: (filled after thread exists) +* Discussion at: https://groups.google.com/g/grpc-io/c/XrOzA4akIHo ## Abstract From a4e82f2da1a94e58b08e11ae4afa0330ceab290c Mon Sep 17 00:00:00 2001 From: Craig Tiller Date: Tue, 6 May 2025 12:33:31 -0700 Subject: [PATCH 5/8] Update A98-debug-protocol.md Co-authored-by: Yuri Golobokov --- A98-debug-protocol.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A98-debug-protocol.md b/A98-debug-protocol.md index d24539c59..00208d829 100644 --- a/A98-debug-protocol.md +++ b/A98-debug-protocol.md @@ -215,7 +215,7 @@ There are some fundamental data model issues in that protocol that I'd like to a * chaotic-good (a transport currently implemented and in production in C++) has multiple TCP sockets per transport. This might be representable as a subchannel with multiple sockets on the client side (though we've already represented the HTTP2 transport as a socket there), but server side makes no allowance for properly describing the object hierarchy. * Further, the set of node types ("kinds" here-in) and their relationships are pre-baked into the protocol, and as we evolve and improve gRPC new node types are needed and new relationships will be added or removed. We should not need to update channelz each time this happens. -Specific metrics have been not been carried forward from channelz as required parts of the protocol. +Specific metrics have not been carried forward from channelz as required parts of the protocol. Users that need call or message counts from the system are encouraged to use the telemetry features of gRPC. Implementations however are encouraged to publish whatever metrics they have available at query time to some `data` protobuf. From e804e700890e21ab1461310310afc5534f1f8e30 Mon Sep 17 00:00:00 2001 From: Craig Tiller Date: Fri, 30 May 2025 09:27:36 -0700 Subject: [PATCH 6/8] x --- A98-debug-protocol.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/A98-debug-protocol.md b/A98-debug-protocol.md index d24539c59..285665fc8 100644 --- a/A98-debug-protocol.md +++ b/A98-debug-protocol.md @@ -41,14 +41,17 @@ message Entity { string kind = 2; // Parents for this entity. repeated int64 parents = 3; + // Has this entity been orphaned? + bool orphaned = 4; // Instantaneous data for this entity. - repeated google.protobuf.Any data = 4; + repeated google.protobuf.Any data = 5; // Historical trace information for the entity. - repeated TraceEvent trace = 5; + repeated TraceEvent trace = 6; } ``` Entities have state and configuration, and tend be be relatively long lived - such that querying them makes sense. +It's allowed for this protocol to return entities who's active object has already been deleted. **id**: An entity is identified by an id. These ids are allocated sequentially per the rationale in A14, and implementations should use the same id space for debug entities and channelz objects. @@ -67,6 +70,8 @@ Should the list of children (optionally of a particular kind) for an entity be d Instead, this protocol lists only parents (as that set is far more stable). Multiple parents are allowed - to handle at least the case of C++ subchannels being owned by multiple channels. +**orphaned**: If the gRPC object that this entity represents has been deleted, then this field MUST be set to true to represent that this is potentially stale data. + **data**: This is a list of protobuf Any objects describing the current state of this entity. Implementations may define their own protobufs to be exposed here, and common sets will be standardized separately. From 1ec11aebe9420ec8f1dc0d47a7d3ac45755294ee Mon Sep 17 00:00:00 2001 From: Craig Tiller Date: Fri, 30 May 2025 10:36:21 -0700 Subject: [PATCH 7/8] x --- A98-debug-protocol.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/A98-debug-protocol.md b/A98-debug-protocol.md index 2dcde1616..aed711029 100644 --- a/A98-debug-protocol.md +++ b/A98-debug-protocol.md @@ -92,6 +92,12 @@ message TraceEvent { ``` These are made available to all entities (in contrast to channelz that selected which nodes had traces, and which did not) - though an implementation need not make that facility available in its implementation of entities. +Implementations may limit memory per entity, or impose an overall system limit to the amount of traces collected. + +The entity trace is a historical snapshot of important events in the entities history. +Which is to say it's not intended to be a high fidelity log of every event that occurred. +It's recommended that implementations limit this trace to high value historical data (a channel disconnected for this reason), and additionally provide some higher fidelity on currently ongoing operations. +For in-depth examination of all events on an entity the QueryTrace API provides live tracing of entities. Also note that any notion of severity has been removed from the protocol (in contrast to channelz) - in practice this has not been a useful field. When converting this protocol to channelz all trace events should be taken as CT_INFO. @@ -111,7 +117,7 @@ service Debug { // Query a named trace from an entity. // These query live information from the system, and run for as long // as the query time is made for. - rpc QueryTrace(QueryTraceRequest) returns (QueryTraceResponse); + rpc QueryTrace(QueryTraceRequest) returns (stream QueryTraceResponse); } ``` @@ -157,14 +163,14 @@ message QueryTraceRequest { int64 id = 1; // The name of the trace to query. string name = 2; - // The amount of time to query. Implementations may arbitrarily cap this. - google.protobuf.Duration duration = 3; // Implementation defined query arguments. repeated google.protobuf.Any args = 4; } message QueryTraceResponse { // The events in the trace. + // If multiple events occurred between the last message in the stream being + // sent and this one being sent, this can contain more than one event. repeated TraceEvent events = 1; // Number of events matched by the trace. // This may be higher than the number returned if memory limits were exceeded. From 9c3e25b178d53a90fb02338bd9910ab4d944c90c Mon Sep 17 00:00:00 2001 From: Craig Tiller Date: Fri, 30 May 2025 11:40:16 -0700 Subject: [PATCH 8/8] x --- A98-debug-protocol.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/A98-debug-protocol.md b/A98-debug-protocol.md index aed711029..9f4ff06cb 100644 --- a/A98-debug-protocol.md +++ b/A98-debug-protocol.md @@ -1,4 +1,4 @@ -A98: Debug protocol +A98: Channelz v2 ---- * Author(s): ctiller * Approver: markdroth @@ -16,6 +16,7 @@ Add a generalized debug interface for gRPC services. In A14 we added channelz. This protocol mixes some lightweight monitoring with some tracing and debuggability features. It suffers from being relatively rigid in the topology it presents, and the set of node types available. +This update improves the protocols flexibility. ### Related Proposals: @@ -23,7 +24,7 @@ It suffers from being relatively rigid in the topology it presents, and the set ## Proposal -A new debug service proto will be added, `debug/debug.proto` in the `grpc.debug.v1` namespace. +A new debug service proto will be added, `channelz/v2/channelz.proto` in the `grpc.channelz.v2` namespace. ### Entities @@ -106,10 +107,10 @@ Again, there is a facility for implementations to provide their own additional i ### Queries -Queries are made available via the `Debug` service: +Queries are made available via the `Channelz` service: ``` -service Debug { +service Channelz { // Gets all entities of a given kind, optionally with a given parent. rpc QueryEntities(QueryEntitiesRequest) returns (QueryEntitiesResponse); // Gets information for a specific entity.