From 7712e424bb723131dd97fad45dd1941c14129fcb Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Tue, 7 Oct 2025 18:50:45 +0000 Subject: [PATCH 01/33] WIP --- ...x_concurrent_streams-connection-scaling.md | 53 +++++++++++++++++++ 1 file changed, 53 insertions(+) create mode 100644 A105-max_concurrent_streams-connection-scaling.md diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md new file mode 100644 index 000000000..1f6dac840 --- /dev/null +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -0,0 +1,53 @@ +A105: MAX_CONCURRENT_STREAMS Connection Scaling +---- +* Author(s): @markdroth, @dfawley +* Approver: @ejona86 +* Status: {Draft, In Review, Ready for Implementation, Implemented} +* Implemented in: +* Last updated: 2025-10-06 +* Discussion at: (filled after thread exists) + +## Abstract + +We propose allowing gRPC clients to automatically establish new +connections to the same endpoint address when it hits the HTTP/2 +MAX_CONCURRENT_STREAMS limit for an existing connection. + +## Background + +HTTP/2 contains a connection-level setting called +[MAX_CONCURRENT_STREAMS][H2MCS] that limits the number of streams a peer +may initiate. This is often set by servers or proxies to protect against +a single client using too many resources. However, in an environment with +reverse proxies or virtual IPs, it may be possible to create multiple +connections to the same IP address that lead to different physical +servers, and this can be a feasible way for a client to achieve more +throughput (QPS) without overloading any server. + +### Related Proposals: +* A list of proposals this proposal builds on or supersedes. + +[H2MCS]: https://httpwg.org/specs/rfc7540.html#SETTINGS_MAX_CONCURRENT_STREAMS + +## Proposal + +[A precise statement of the proposed change.] + + + +### Temporary environment variable protection + +[Name the environment variable(s) used to enable/disable the feature(s) this proposal introduces and their default(s). Generally, features that are enabled by I/O should include this type of control until they have passed some testing criteria, which should also be detailed here. This section may be omitted if there are none.] + +## Rationale + +[A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.] + + +## Implementation + +[A description of the steps in the implementation, who will do them, and when. If a particular language is going to get the implementation first, this section should list the proposed order.] + +## Open issues (if applicable) + +[A discussion of issues relating to this proposal for which the author does not know the solution. This section may be omitted if there are none.] From 72b6c81c308fd7cb29c3c495bc9c45d2ca0bd8e5 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 10 Oct 2025 01:05:46 +0000 Subject: [PATCH 02/33] more WIP --- ...x_concurrent_streams-connection-scaling.md | 96 +++++++++++++++++-- 1 file changed, 89 insertions(+), 7 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 1f6dac840..83e2343f4 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-10-06 +* Last updated: 2025-10-09 * Discussion at: (filled after thread exists) ## Abstract @@ -24,29 +24,111 @@ connections to the same IP address that lead to different physical servers, and this can be a feasible way for a client to achieve more throughput (QPS) without overloading any server. +Today, the MAX_CONCURRENT_STREAMS limit is visible only inside the +transport, not to anything at the channel layer. This means that the +channel may dispatch an RPC to a subchannel that is already at its +MAX_CONCURRENT_STREAMS limit, even if there are other subchannels that +are not at that limit and would be able to accept the RPC. In this +situation, the transport normally queues the RPC until it can start +another stream (i.e., either when one of the existing streams ends or +when the server increases its MAX_CONCURRENT_STREAMS limit). + +Applications today are typically forced to deal with this problem +by creating multiple gRPC channels to the same target (and disabling +subchannel sharing, for implementations that share subchannels between +channels) and doing their own load balancing across those channels. +This is an obvious flaw in the gRPC channel abstraction, because the +channel is intended to hide connection management details like this so +that the application doesn't have to know about it. It is also not a +very flexible solution for the application in a case where the server is +dynamically tuning its MAX_CONCURRENT_STREAMS setting, because the +application cannot see that setting in order to tune the number of +channels it uses. + ### Related Proposals: -* A list of proposals this proposal builds on or supersedes. +* [A9: Server-side Connection Management][A9] +* [A61: IPv4 and IPv6 Dualstack Backend Support][A61] +[A9]: A9-server-side-conn-mgt.md +[A61]: A61-IPv4-IPv6-dualstack-backends.md [H2MCS]: https://httpwg.org/specs/rfc7540.html#SETTINGS_MAX_CONCURRENT_STREAMS ## Proposal -[A precise statement of the proposed change.] - - +We will add a mechanism for the gRPC client to dynamically scale up the +number of connections to a particular endpoint address when it hits the +MAX_CONCURRENT_STREAMS limit. This functionality will be implemented at +the subchannel layer. + +There are several parts to this proposal: +- Configuration for the max number of connections per subchannel, via + either the [service + config](https://github.com/grpc/grpc/blob/master/doc/service_config.md) + or via a per-endpoint attribute in the LB policy tree. +- A mechanism for the transport to report the current MAX_CONCURRENT_STREAMS + setting to the subchannel layer. +- Connection scaling functionality in the subchannel. +- Functionality to pick a connection for each RPC in the subchannel, + including the ability to queue RPCs while waiting for new connections + to be established. + +### Configuration + +- service config +- per-endpoint attribute in the LB policy tree + - needs to NOT affect subchannel identity + - if subchannels are shared between channels, will use the highest + value of any channel sharing the subchannel + - in C-core, will just have a channel arg that the channel will treat + specially -- this works because "updating" a subchannel is done by + "recreating" it + - in java/go, consider some sort of injector approach? + - note: requires implementations to have only one address per subchannel, + as per A61 (right?) +- xDS story (per-host circuit breaker) +- disabled by default (max connections per subchannel == 1) + +### Transport Reporting Current MAX_CONCURRENT_STREAMS + +- transport needs to report this to subchannel +- in C-core, maybe revamp the connectivity state API between transport + and subchannel? was thinking about doing this anyway for subchannel + metrics disconnection reason -- but need to figure out a plan for + direct channels + +### Connection Scaling Within the Subchannel + +- conditions for scaling up +- connections are removed when they terminate (depends on A9 configured + on server side) +- need to document connectivity state behavior (and update client + channel spec!) + +### Picking a Connection for Each RPC + +- algorithm for picking a connection for each RPC + - show picker pseudo-code +- RPC queueing in the subchannel +- describe synchronization ### Temporary environment variable protection -[Name the environment variable(s) used to enable/disable the feature(s) this proposal introduces and their default(s). Generally, features that are enabled by I/O should include this type of control until they have passed some testing criteria, which should also be detailed here. This section may be omitted if there are none.] +Enabling this feature via either the gRPC service config or xDS will +initially be guarded via the environment variable +`GRPC_EXPERIMENTAL_MAX_CONCURRENT_STREAMS_CONNECTION_SCALING`. The +feature will be enabled by default once it has passed interop tests. + +- TODO: define interop tests? ## Rationale [A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.] +- TODO: document reasons why not doing this in LB policy tree ## Implementation -[A description of the steps in the implementation, who will do them, and when. If a particular language is going to get the implementation first, this section should list the proposed order.] +Will be implemented in C-core, Java, and Go. ## Open issues (if applicable) From 71b747313c684ed538fddf476991beaa54ea1d3d Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Tue, 14 Oct 2025 00:04:08 +0000 Subject: [PATCH 03/33] fleshed out most of it, just a few TODOs left --- ...x_concurrent_streams-connection-scaling.md | 483 ++++++++++++++++-- 1 file changed, 445 insertions(+), 38 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 83e2343f4..3c204f071 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-10-09 +* Last updated: 2025-10-13 * Discussion at: (filled after thread exists) ## Abstract @@ -45,12 +45,20 @@ dynamically tuning its MAX_CONCURRENT_STREAMS setting, because the application cannot see that setting in order to tune the number of channels it uses. -### Related Proposals: +### Related Proposals: +* [A6: gRPC Retry Design][A6] * [A9: Server-side Connection Management][A9] +* [A32: xDS Circuit Breaking][A32] * [A61: IPv4 and IPv6 Dualstack Backend Support][A61] +* [A79: Non-per-call Metrics Architecture][A79] +* [A94: OTel Metrics for Subchannels][A94] +[A6]: A6-client-retries.md [A9]: A9-server-side-conn-mgt.md +[A32]: A32-xds-circuit-breaking.md [A61]: A61-IPv4-IPv6-dualstack-backends.md +[A79]: A79-non-per-call-metrics-architecture.md +[A94]: A94-subchannel-otel-metrics.md [H2MCS]: https://httpwg.org/specs/rfc7540.html#SETTINGS_MAX_CONCURRENT_STREAMS ## Proposal @@ -74,42 +82,389 @@ There are several parts to this proposal: ### Configuration -- service config -- per-endpoint attribute in the LB policy tree - - needs to NOT affect subchannel identity - - if subchannels are shared between channels, will use the highest - value of any channel sharing the subchannel - - in C-core, will just have a channel arg that the channel will treat - specially -- this works because "updating" a subchannel is done by - "recreating" it - - in java/go, consider some sort of injector approach? - - note: requires implementations to have only one address per subchannel, - as per A61 (right?) -- xDS story (per-host circuit breaker) -- disabled by default (max connections per subchannel == 1) +Connection scaling will be configured via a new service config field, +as follows (schema shown in protobuf form, although gRPC actually accepts +the service config in JSON form): + +```proto +message ServiceConfig { + // ...existing fields... + + // Settings to control dynamic connection scaling. + message ConnectionScaling { + // Maximum connections gRPC will maintain for each subchannel in + // this channel. When no streams are available for an RPC in a + // subchannel, gRPC will automatically create new connections up + // to this limit. If this value changes during the life of a + // channel, existing subchannels will be updated to reflect + // the change. No connections will be closed as a result of + // lowering this value; down-scaling will only happen as + // connections are lost naturally. + // + // Values higher than the client-enforced limit (by default, 10) + // will be clamped to that limit. + google.protobuf.UInt32Value max_connections_per_subchannel = 1; + } + ConnectionScaling connection_scaling = N; +} +``` + +In the subchannel, connection scaling will be configured via a parameter +called max_connections_per_subchannel, which be passed into the subchannel +via a per-endpoint attribute. Configuring this via the service config +will effectively set that per-endpoint attribute before passing the list +of endpoints into the LB policy, but the attribute can also be set or +modified by the LB policy. + +As indicated in the comment above, the channel will enforce a maximum +limit for the max_connections_per_subchannel attribute. This limit +will be 10 by default, but gRPC will provide a channel-level setting to +allow a client application to raise or lower that limit. Whenever the +max_connections_per_subchannel attribute is larger than the channel's +limit, it will be capped to that limit. This capping will be performed +in the subchannel itself, so that it will apply regardless of where the +attribute is set. + +The max_connections_per_subchannel attribute can change with each resolver +update, regardless of whether it is set via the service config or via an +LB policy. When this happens, we do not want to throw away the subchannel +and create a new one, since that would cause unnecessary connection churn. +This means that the max_connections_per_subchannel attribute must not +be considered part of the subchannel's unique identity that is set only +at subchannel creation time; instead, it must be an attribute that can +be changed over the life of a subchannel. The approach for this will be +different in C-core than in Java and Go. + +If the max_connections_per_subchannel attribute is unset, the subchannel +will assume a default of 1, which effectively means the same behavior +as before this gRFC. + +#### C-core + +In C-core, every time there is a resolver update, the LB policy +calls `CreateSubchannel()` for every address in the new address list. +The `CreateSubchannel()` call returns a subchannel wrapper that holds +a ref to the underlying subchannel. The channel uses a subchannel +pool to store the set of currently existing subchannels: the requested +subchannel is created only if it doesn't already exist in the pool; +otherwise, the returned subchannel wrapper will hold a new ref to the +existing subchannel, so that it doesn't actually wind up creating a new +subchannel (only a new subchannel wrapper). This means that we do not +want the max_connections_per_subchannel attribute to be part of the +subchannel's key in the subchannel pool, or else we will wind up +recreating the subchannel whenever the attribute's value changes. + +In addition, by default, C-core's subchannel pool is shared between +channels, meaning that if two channels attempt to create the same +subchannel, they will wind up sharing a single subchannel. In this +case, each channel using a given subchannel may have a different value +for the max_connections_per_subchannel attribute. The subchannel will +use the maximum value set for this attribute across all channels. + +To support this, the implementation will be as follows: +- The subchannel will store a map from max_connections_per_subchannel + value to the number of subchannel wrappers currently holding a ref to + it with that value. Entries are added to the map whenever the first + ref for a given value of max_connections_per_subchannel is taken, and + entries are removed from the map whenever the last ref for a given + value of max_connections_per_subchannel is removed. Whenever the set + of entries changes, the subchannel will do a pass over the map to find + the new max value to use. +- We will create a new channel arg called + `GRPC_ARG_MAX_CONNECTIONS_PER_SUBCHANNEL`, which will be treated + specially by the channel. Specifically, when this attribute is passed + to `CreateSubchannel()`, it will be excluded from the channel args + that are used as a key in the subchannel pool. Instead, when the + subchannel wrapper is instantiated, it will call a new API on the + underlying subchannel to tell it that a new ref is being held for this + value of max_connections_per_subchannel. When the subchannel wrapper + is orphaned, it will call a new API on the underlying subchannel + to tell it that the ref is going away for a particular value of + max_connections_per_subchannel. + +#### Java and Go + +In Java and Go, there is no subchannel pool, and LB policies will not +call `CreateSubchannel()` for any address for which they already have a +subchannel from the previous address list. There is also no need to +deal with the case of multiple channels sharing the same subchannel. +Therefore, a different approach is called for. However, it seems +desirable to design this API such that it leaves open the possibility of +Java and Go introducing a subchannel pool at some point in the future. + +TODO: flesh this out, maybe use some sort of injector approach? + +Note that this approach assumes that Java and Go have switched to a +model where there is only one address per subchannel, as per [A61]. + +#### xDS Configuration + +In xDS, the max_connections_per_subchannel value will be configured via +a per-host circuit breaker in the CDS resource. This uses a similar +structure to the circuit breaker described in [A32]. + +In the CDS resource, in the +[circuit_breakers](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/cluster.proto#L885) +field, we will now add support for the following field: +- [per_host_thresholds](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/circuit_breaker.proto#L120): + As in [A32], gRPC will look only at the first entry for priority + [DEFAULT](https://github.com/envoyproxy/envoy/blob/6ab1e7afbfda48911e187c9d653a46b8bca98166/api/envoy/config/core/v3/base.proto#L39). + If that entry is found, then within that entry: + - [max_connections](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/circuit_breaker.proto#L59): + If this field is set, then its value will be used to set the + max_connections_per_subchannel attribute for all endpoints for that + xDS cluster. If it is unset, then the + max_connections_per_subchannel attribute will remain unset. A value + of 0 will be rejected at resource validation time. + +A new field will be added to the parsed CDS resource representation +containing the value of this field. ### Transport Reporting Current MAX_CONCURRENT_STREAMS -- transport needs to report this to subchannel -- in C-core, maybe revamp the connectivity state API between transport - and subchannel? was thinking about doing this anyway for subchannel - metrics disconnection reason -- but need to figure out a plan for - direct channels +In order for the subchannel to know when to create a new connection, the +transport will need to report the current value of the peer's +MAX_CONCURRENT_STREAMS setting up to the subchannel. + +TODO: in C-core, maybe revamp the connectivity state API between transport +and subchannel? was thinking about doing this anyway for subchannel +metrics disconnection reason -- but need to figure out a plan for +direct channels + +### Subchannel Behavior + +The connection scaling functionality in the subchannel will be used if +the max_connections_per_subchannel attribute is greater than 1. + +If the value is 1 (or unset), then implementations must not impose any +additional per-RPC overhead at this layer beyond what already exists +today. In other words, the connection scaling feature must not affect +performance unless it is actually enabled. + +#### Connection Management + +When max_connections_per_subchannel is greater than 1, the subchannel will +contain up to that number of established connections. The connections +will be stored in a list ordered by the time at which the connection was +established, so that the oldest connections are at the start of the list. + +Each connection will store the peer's MAX_CONCURRENT_STREAMS setting +reported by the transport. It will also track how many RPCs are +currently in flight on the connection. + +The subchannel will also track one in-flight connection attempt, which +will be unset if no connection attempt is currently in flight. + +A new connection attempt will be started when all of the +following are true: +- One or more RPCs are queued in the subchannel, waiting for a + connection to be sent on. +- No existing connection in the subchannel has any available streams -- + i.e., the number of RPCs currently in flight on each connection is + greater than or equal to the connection's MAX_CONCURRENT_STREAMS. +- The number of existing connections in the subchannel is fewer than the + max_connections_per_subchannel value. +- There is no connection attempt currently in flight on the subchannel. + +The subchannel will never close a connection once it has been established. +However, when a connection is closed for any reason, it is removed from +the subchannel. If the application wishes to garbage collect unused +connections, it should configure MAX_CONNECTION_IDLE on the server side, +as described in [A9]. + +Some examples of cases that can trigger a new connection attempt to be +started: +- A new RPC is started on the subchannel and all existing connections + are already at their MAX_CONCURRENT_STREAMS limit. +- RPCs were already queued on the subchannel and a connection was lost + or a connection attempt fails. +- RPCs were already queued on the subchannel and a new connection was just + created that did not provide enough available streams for all pending RPCs. +- The value of the max_connections_per_subchannel attribute increases. + +#### Backoff Behavior and Connectivity State -### Connection Scaling Within the Subchannel +The subchannel must have no more than one connection +attempt in progress at any given time. The [backoff +state](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) +will be used for all connection attempts in the subchannel, regardless +of how many established connections there are. If a connection attempt +fails, the backoff period must be respected and scale accordingly before +starting the next attempt on any connection. When a connection attempt +succeeds, backoff state will be reset, just as it is today. -- conditions for scaling up -- connections are removed when they terminate (depends on A9 configured - on server side) -- need to document connectivity state behavior (and update client - channel spec!) +For example, if the subchannel has both two existing connections and a +pending connection attempt for a third connection, if one of the original +connections fails, the subchannel may not start a new connection attempt +immediately, because it already has a connection attempt in progress. +Instead, it must wait for the in-flight connection attempt to finish. +If that attempt fails, then backoff must be performed before starting the +next connection attempt. But if that attempt succeeds, backoff state +will be reset, so if there are still enough queued RPCs to warrant a +second connection, then the subchannel may immediately start another +connection attempt. -### Picking a Connection for Each RPC +The connectivity state of the subchannel should be determined as follows +(first match wins): +- If there is at least one connection established, report READY. +- If there is at least one connection attempt in flight, report + CONNECTING. +- If the subchannel is in backoff after a failed connection attempt, + report TRANSIENT_FAILURE. +- Otherwise, report IDLE. -- algorithm for picking a connection for each RPC - - show picker pseudo-code -- RPC queueing in the subchannel -- describe synchronization +Note that subchannels may exhibit two new state transitions that have not +previously been possible. Today, the only possible transition from state +READY is to state IDLE, but with this design, the following additional +transitions are possible: +- If the subchannel has an existing connection and has a connection + attempt in flight for a second connection, and then the first + connection fails before the in-flight connection attempt completes, + then the subchannel will transition from READY to CONNECTING. +- If the subchannel has an existing connection but is in backoff after + failing to establish a second connection attempt, and then the original + connection fails, the subchannel will transition from READY directly + to TRANSIENT_FAILURE. + +Implementations should ensure that LB policies can handle these state +transitions. + +TODO: What happens if there are RPCs queued in the subchannel when the +last connection fails? then we report non-READY to the LB policy, but +we need to deal with the queued RPCs somehow... + +TODO: update client channel spec with this info + +#### Picking a Connection for Each RPC + +When an RPC is started and a subchannel is picked for that RPC, the +subchannel will find the first available connection for the RPC, in +order of connection creation. + +The subchannel must ensure that races do not happen while dispatching +RPCs to a connection that will lead to one or more RPCs being queued in +the connection despite having available quota elsewhere. For example, +if two RPCs are initiated at the same time and one stream is available +in a connection, both RPCs must not choose the same connection, or else +one will queue. This race can be avoided with locks or atomics. + +One race that may lead to RPCs being queued in a connection is if the +MAX_CONCURRENT_STREAMS setting of a connection is lowered by the server +after RPCs are dispatched to the connection. This race can be avoided +if the connection is modified to not queue RPCs but instead report the +scenario back to the subchannel, or to coordinate the SETTINGS frame ACK +with the subchannel. Such changes are out of scope for this design, +but may be considered in the future. For the purposes of this design, +it is acceptable to queue RPCs on a connection due to this race, which +is expected to be rare. + +If no connection is available for an RPC, the RPC must be queued in the +subchannel until a connection is available for it. This queue must be +roughly fair: RPCs must be dispatched in the order in which they are +received into the queue, acknowledging that timing between threads may +lead to concurrent RPCs being added to the queue in an arbitrary order. +See [Rationale](#rationale) below for a possible adjustment to this +queuing strategy. + +When a connection attempt fails and the subchannel is in backoff, all +RPCs (both those already queued and any new RPC that is started on the +subchannel after that) will be failed with UNAVAILABLE status. These +RPCs are be eligible for transparent retries (see [A6]), because no wire +traffic was produced for them. + +When an RPC completes, if there are queued RPCs in the subchannel, the +subchannel should check to see if the connection that that RPC was sent +on is below its MAX_CONCURRENT_STREAMS limit. If so, it should start +the next queued RPC on that connection. Note that it's possible that +the connection is not actually below its MAX_CONCURRENT_STREAMS limit, +because the peer may have lowered the MAX_CONCURRENT_STREAMS limit after +that RPC was started. + +#### Subchannel Pseudo-Code + +The following pseudo-code illustrates the expected functionality in the +subchannel: + +``` +# Starts an RPC on the subchannel. +def StartRpc(self, rpc): + # Use the oldest connection that can accept a new stream, if any. + for connection in self.connections: + if connection.rpcs_in_flight < connection.max_concurrent_streams: + connection.rpcs_in_flight += 1 + connection.StartRpc(rpc) + return + # If we aren't yet at the max number of connections, see if we can + # create a new one. + if len(self.connections) < self.max_connections_per_subchannel: + # If we're in backoff delay, fail the RPC. + if self.pending_backoff_timer is not None: + rpc.Fail(UNAVAILABLE) + return + # If there is no connection attempt in flight, start one. + if self.connection_attempt is None: + self.StartConnectionAttempt() + # Queue the RPC until we have a connection to send it on. + self.queue.append(rpc) + +# Retries RPCs from the queue, in order. +def RetryRpcsFromQueue(self): + queue = self.queue + self.queue = [] + for rpc in queue: + self.StartRpc(rpc) + +# Starts a new connection attempt. +def StartConnectionAttempt(self): + self.backoff_state = BackoffState() + self.connection_attempt = ConnectionAttempt() + +# Called when a connection attempt succeeds. +def OnConnectionAttemptSucceeded(self, new_connection): + self.connections.append(new_connection) + self.pending_backoff_timer = None + self.backoff_state = None + self.RetryRpcsFromQueue() + +# Called when a connection attempt fails. This puts us in backoff. +def OnConnectionAttemptFailed(self): + self.connection_attempt = None + self.pending_backoff_timer = Timer(self.backoff_state.NextBackoffDelay()) + self.RetryRpcsFromQueue() + +# Called when the backoff timer fires. Will trigger a new connection +# attempt if there are RPCs in the queue. +def OnBackoffTimer(self): + self.pending_backoff_timer = None + self.RetryRpcsFromQueue() + +# Called when an established connection fails. Will trigger a new +# connection attempt if there are RPCs in the queue. +def OnConnectionFailed(self, failed_connection): + self.connections.remove(failed_connection) + self.RetryRpcsFromQueue() + +# Called when a connection reports a new MAX_CONCURRENT_STREAMS value. +# May send RPCs on the connection if there are any queued and the +# MAX_CONCURRENT_STREAMS value has increased. +def OnConnectionReportsNewMaxConcurrentStreams( + self, connection, max_concurrent_streams): + connection.max_concurrent_streams = max_concurrent_streams + self.RetryRpcsFromQueue() + +# Called when an RPC completes on one of the subchannel's connections. +def OnRpcComplete(self): + self.RetryRpcsFromQueue() + +# Called when the max_connections_per_subchannel value changes. +def OnMaxConnectionsPerSubchannelChanged(self, max_connections_per_subchannel): + self.max_connections_per_subchannel = max_connections_per_subchannel + self.RetryRpcsFromQueue() +``` + +### Metrics + +TODO: define metrics ### Temporary environment variable protection @@ -118,18 +473,70 @@ initially be guarded via the environment variable `GRPC_EXPERIMENTAL_MAX_CONCURRENT_STREAMS_CONNECTION_SCALING`. The feature will be enabled by default once it has passed interop tests. -- TODO: define interop tests? +TODO: define interop tests? ## Rationale -[A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.] +We considered several different approaches as part of this design process. -- TODO: document reasons why not doing this in LB policy tree +We considered putting the connection scaling functionality in an LB +policy instead of in the subchannel, but we rejected that approach for +several reasons: +- The LB policy API is designed around the idea that returning a new + picker will cause queued RPCs to be replayed, which works fine when an + updated picker is likely to allow most of the queued RPCs to proceed. + However, in a case where the workload is hitting the + MAX_CONCURRENT_STREAMS limit across all connections, we would wind up + updating the picker as each RPC finishes, which would allow only one + of the queued RPCs to continue, thus exhibiting O(n^2) behavior. We + concluded that the LB policy API is simply not well-suited to solve + this particular problem. +- There were some concerns about possibly exposing the + MAX_CONCURRENT_STREAMS setting from the transport to the LB policy + API. However, this particular concern could have been ameliorated + by initially using an internal or experimental API. +- Putting this functionality in the LB policy would have added + complexity due to subchannels being shared between channels in C-core, + where each channel can see only the RPCs that that channel is sending + to the subchannel. -## Implementation +Given that we decided to add this new functionality in the subchannel, +we intentionally elected to keep it simple, at least to start with. +We considered a design incorporating a more sophisticated -- or pluggable +-- connection selection method but rejected it due to the complexity +of such a mechanism and the desire to avoid creating a duplicate load +balancing ecosystem within the subchannel. -Will be implemented in C-core, Java, and Go. +Potential future improvements that we could consider: +- Additional configuration knobs: + - min_connections_per_subchannel to force clients to create a minimum + number of connections, even if they are not necessary. + - min_available_streams_per_subchannel to allow clients to create new + connections before the hard MAX_CONCURRENT_STREAMS setting is reached. +- Channel arg to limit streams per connection lower than + MAX_CONCURRENT_STREAMS. +- Instead of queuing RPCs in the subchannel, it may be possible to improve + aggregate performance by failing the RPC, resulting in transparent retry + and a re-pick. Without other systems in place, this would lead to a + busy-loop if the same subchannel is picked repeatedly, so this is not + included in this design. + +This design does not handle connection affinity; there is no way to +ensure related RPCs end up on the same connection without setting +max_connections_per_subchannel to 1. For use cases where connection +affinity is important, multiple channels will continue to be necessary +for now. -## Open issues (if applicable) +This design also does not attempt to maximize throughput of connections, +which would be a far more complex problem. To maximize throughput +effectively, more information about the nature of RPCs would need to +be exposed; e.g., how much bandwidth they may require and how long they +might be expected to last. And unlike this connection scaling design, +that functionality might actually want to be implemented in the LB policy +layer, since it would be totally compatible with the LB policy API. +The goal of *this* design is to simply overcome the stream limits on +connections, hence the simple and greedy connection selection mechanism. -[A discussion of issues relating to this proposal for which the author does not know the solution. This section may be omitted if there are none.] +## Implementation + +Will be implemented in C-core, Java, and Go. From 23e6f33c9c47239a0fdd41a2db769471b647510e Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Wed, 15 Oct 2025 00:04:58 +0000 Subject: [PATCH 04/33] fail RPCs when no connections left, add PF behavior, etc --- ...x_concurrent_streams-connection-scaling.md | 118 ++++++++++++------ 1 file changed, 82 insertions(+), 36 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 3c204f071..be920d08c 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-10-13 +* Last updated: 2025-10-14 * Discussion at: (filled after thread exists) ## Abstract @@ -79,6 +79,8 @@ There are several parts to this proposal: - Functionality to pick a connection for each RPC in the subchannel, including the ability to queue RPCs while waiting for new connections to be established. +- Modify pick_first to handle the subchannel transitioning from READY to + CONNECTING state. ### Configuration @@ -246,7 +248,7 @@ performance unless it is actually enabled. When max_connections_per_subchannel is greater than 1, the subchannel will contain up to that number of established connections. The connections will be stored in a list ordered by the time at which the connection was -established, so that the oldest connections are at the start of the list. +established, so that the oldest connection is at the start of the list. Each connection will store the peer's MAX_CONCURRENT_STREAMS setting reported by the transport. It will also track how many RPCs are @@ -280,7 +282,9 @@ started: or a connection attempt fails. - RPCs were already queued on the subchannel and a new connection was just created that did not provide enough available streams for all pending RPCs. -- The value of the max_connections_per_subchannel attribute increases. +- The value of the max_connections_per_subchannel attribute increases, + all existing connections are already at their MAX_CONCURRENT_STREAMS + limit, and there are queued RPCs. #### Backoff Behavior and Connectivity State @@ -326,12 +330,8 @@ transitions are possible: connection fails, the subchannel will transition from READY directly to TRANSIENT_FAILURE. -Implementations should ensure that LB policies can handle these state -transitions. - -TODO: What happens if there are RPCs queued in the subchannel when the -last connection fails? then we report non-READY to the LB policy, but -we need to deal with the queued RPCs somehow... +See [PickFirst Changes](#pickfirst-changes) below for details on how +pick_first will handle these new transitions. TODO: update client channel spec with this info @@ -358,27 +358,45 @@ but may be considered in the future. For the purposes of this design, it is acceptable to queue RPCs on a connection due to this race, which is expected to be rare. -If no connection is available for an RPC, the RPC must be queued in the -subchannel until a connection is available for it. This queue must be -roughly fair: RPCs must be dispatched in the order in which they are -received into the queue, acknowledging that timing between threads may -lead to concurrent RPCs being added to the queue in an arbitrary order. +When choosing a connection for an RPC within a subchannel, the following +algorithm will be used (first match wins): +1. If the subchannel has no working connections, then the RPC will be + failed with status UNAVAILABLE. +2. Look through all working connections in order from the oldest to the + newest. For each connection, if the number of RPCs in flight on the + connection is lower than the peer's MAX_CONCURRENT_STREAMS setting, + then the RPC will be dispatched on that connection. +3. If the number of existing connections is equal to the + max_connections_per_subchannel value, the RPC will be queued. +4. If the number of existing connections is less than the + max_connections_per_subchannel value and the subchannel is in backoff + delay due to the last connection attempt failing, the RPC will be + failed with status UNAVAILABLE. +5. If the number of existing connections is less than the + max_connections_per_subchannel value and no connection attempt is + currently in flight, a new connection attempt will be started, and + the RPC will be queued. + +When queueing an RPC, the queue must be roughly fair: RPCs must +be dispatched in the order in which they are received into the +queue, acknowledging that timing between threads may lead to +concurrent RPCs being added to the queue in an arbitrary order. See [Rationale](#rationale) below for a possible adjustment to this queuing strategy. -When a connection attempt fails and the subchannel is in backoff, all -RPCs (both those already queued and any new RPC that is started on the -subchannel after that) will be failed with UNAVAILABLE status. These -RPCs are be eligible for transparent retries (see [A6]), because no wire -traffic was produced for them. +When retrying a queued RPC, the subchannel will use the same algorithm +described above that it will when it first sees the RPC. RPCs will be +drained from the queue upon the following events: +- When a connection attempt completes, whether successfully or not. +- When the backoff timer fires. +- When an existing connection fails. +- When the transport for a connection reports a new value for + MAX_CONCURRENT_STREAMS. +- When an RPC dispatched on one of the connections completes. -When an RPC completes, if there are queued RPCs in the subchannel, the -subchannel should check to see if the connection that that RPC was sent -on is below its MAX_CONCURRENT_STREAMS limit. If so, it should start -the next queued RPC on that connection. Note that it's possible that -the connection is not actually below its MAX_CONCURRENT_STREAMS limit, -because the peer may have lowered the MAX_CONCURRENT_STREAMS limit after -that RPC was started. +When failing an RPC due to all connections failing or due to being in +backoff, note that the RPC will be eligible for transparent retries (see +[A6]), because no wire traffic was produced for it. #### Subchannel Pseudo-Code @@ -386,33 +404,44 @@ The following pseudo-code illustrates the expected functionality in the subchannel: ``` -# Starts an RPC on the subchannel. -def StartRpc(self, rpc): +# Returns True if the RPC has been handled, False if it needs to be queued. +def MaybeDispatchRpc(self, rpc): + # If there are no connections, fail the RPC. + # Note that the RPC will be eligible for transparent retries. + if len(self.connections) == 0: + rpc.Fail(UNAVAILABLE) + return True # Use the oldest connection that can accept a new stream, if any. for connection in self.connections: if connection.rpcs_in_flight < connection.max_concurrent_streams: connection.rpcs_in_flight += 1 connection.StartRpc(rpc) - return + return True # If we aren't yet at the max number of connections, see if we can # create a new one. if len(self.connections) < self.max_connections_per_subchannel: # If we're in backoff delay, fail the RPC. + # Note that the RPC will be eligible for transparent retries. if self.pending_backoff_timer is not None: rpc.Fail(UNAVAILABLE) - return + return True # If there is no connection attempt in flight, start one. if self.connection_attempt is None: self.StartConnectionAttempt() - # Queue the RPC until we have a connection to send it on. - self.queue.append(rpc) + # RPC not handled -- needs to be queued. + return False + +# Starts an RPC on the subchannel. +def StartRpc(self, rpc): + if not self.MaybeDispatchRpc(rpc): + self.queue.append(rpc) # Retries RPCs from the queue, in order. def RetryRpcsFromQueue(self): - queue = self.queue - self.queue = [] - for rpc in queue: - self.StartRpc(rpc) + while len(self.queue) > 0: + if not self.MaybeDispatchRpc(self.queue[-1]): + break + self.queue.pop() # Starts a new connection attempt. def StartConnectionAttempt(self): @@ -462,6 +491,23 @@ def OnMaxConnectionsPerSubchannelChanged(self, max_connections_per_subchannel): self.RetryRpcsFromQueue() ``` +### PickFirst Changes + +As mentioned above, the pick_first LB policy will need to handle two new +state transitions from subchannels. Previously, a subchannel in READY +state could transition only to IDLE state; now it will be possible for +the subchannel to instead transition to CONNECTING or TRANSIENT_FAILURE +states. + +If the selected (READY) subchannel transitions to CONNECTING state, +then pick_first will go back into CONNECTING state. It will start the +happy eyeballs pass across all subchannels, as described in [A61]. + +If the selected (READY) subchannel transitions to TRANSIENT_FAILURE +state, then pick_first will... + +TODO: figure out TF behavior + ### Metrics TODO: define metrics From 770aedbd3350c8ba87b6087097e1e81ca16f6201 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Thu, 16 Oct 2025 00:42:05 +0000 Subject: [PATCH 05/33] fix pseudo-code --- A105-max_concurrent_streams-connection-scaling.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index be920d08c..d630ccfd1 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-10-14 +* Last updated: 2025-10-15 * Discussion at: (filled after thread exists) ## Abstract @@ -482,7 +482,8 @@ def OnConnectionReportsNewMaxConcurrentStreams( self.RetryRpcsFromQueue() # Called when an RPC completes on one of the subchannel's connections. -def OnRpcComplete(self): +def OnRpcComplete(self, connection): + connection.rpcs_in_flight -= 1 self.RetryRpcsFromQueue() # Called when the max_connections_per_subchannel value changes. From 946a8cf24276edd56079dd30ebbdb880ebdd2519 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 20 Oct 2025 19:30:12 +0000 Subject: [PATCH 06/33] - don't fail RPCs while in backoff - PF goes into CONNECTING when subchannel goes from READY to CONNECTING or TF - started section on interactions between channel and subchannel (WIP) --- ...x_concurrent_streams-connection-scaling.md | 65 ++++++++++++------- 1 file changed, 41 insertions(+), 24 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index d630ccfd1..c6618aa1f 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-10-15 +* Last updated: 2025-10-20 * Discussion at: (filled after thread exists) ## Abstract @@ -371,7 +371,7 @@ algorithm will be used (first match wins): 4. If the number of existing connections is less than the max_connections_per_subchannel value and the subchannel is in backoff delay due to the last connection attempt failing, the RPC will be - failed with status UNAVAILABLE. + queued. 5. If the number of existing connections is less than the max_connections_per_subchannel value and no connection attempt is currently in flight, a new connection attempt will be started, and @@ -394,9 +394,33 @@ drained from the queue upon the following events: MAX_CONCURRENT_STREAMS. - When an RPC dispatched on one of the connections completes. -When failing an RPC due to all connections failing or due to being in -backoff, note that the RPC will be eligible for transparent retries (see -[A6]), because no wire traffic was produced for it. +When failing an RPC due to the subchannel not having any established +connections, note that the RPC will be eligible for transparent retries +(see [A6]), because no wire traffic was produced for it. + +#### Interaction Between Channel and Subchannel + +Today, when the channel does an LB pick and gets back a subchannel, +it calls a method on that subchannel to get its underlying connection. +There are only two possible results: + +1. The subchannel returns a ref to the underlying connection. +2. The subchannel returns null to indicate that it no longer has a working + connection. This case can happen due to a race between the LB policy + picking the subchannel and the transport seeing a GOAWAY or disconnection; + when that occurs, the channel will queue the RPC (i.e., it is treated + the same as if the picker indicated to queue the RPC) in the expectation + that the LB policy will soon see the subchannel report the disconnection + and will return a new picker, at which point a new pick will be done for + the queued RPC. + +With this design, the API used for this interaction between the channel +and the subchannel will need to change. + +TODO: figure out details, then revise pseudo-code below + +TODO: interaction with circuit breaking -- i.e., when we do call the +call tracker to indicate that the RPC has started? #### Subchannel Pseudo-Code @@ -419,16 +443,11 @@ def MaybeDispatchRpc(self, rpc): return True # If we aren't yet at the max number of connections, see if we can # create a new one. - if len(self.connections) < self.max_connections_per_subchannel: - # If we're in backoff delay, fail the RPC. - # Note that the RPC will be eligible for transparent retries. - if self.pending_backoff_timer is not None: - rpc.Fail(UNAVAILABLE) - return True - # If there is no connection attempt in flight, start one. - if self.connection_attempt is None: - self.StartConnectionAttempt() - # RPC not handled -- needs to be queued. + if (len(self.connections) < self.max_connections_per_subchannel and + self.connection_attempt is None and + self.pending_backoff_timer is None): + self.StartConnectionAttempt() + # Didn't find a connection for the RPC, so queue it. return False # Starts an RPC on the subchannel. @@ -439,13 +458,15 @@ def StartRpc(self, rpc): # Retries RPCs from the queue, in order. def RetryRpcsFromQueue(self): while len(self.queue) > 0: + # Stop at the first RPC that gets queued. if not self.MaybeDispatchRpc(self.queue[-1]): break self.queue.pop() # Starts a new connection attempt. def StartConnectionAttempt(self): - self.backoff_state = BackoffState() + if self.backoff_state is None: + self.backoff_state = BackoffState() self.connection_attempt = ConnectionAttempt() # Called when a connection attempt succeeds. @@ -500,14 +521,10 @@ state could transition only to IDLE state; now it will be possible for the subchannel to instead transition to CONNECTING or TRANSIENT_FAILURE states. -If the selected (READY) subchannel transitions to CONNECTING state, -then pick_first will go back into CONNECTING state. It will start the -happy eyeballs pass across all subchannels, as described in [A61]. - -If the selected (READY) subchannel transitions to TRANSIENT_FAILURE -state, then pick_first will... - -TODO: figure out TF behavior +If the selected (READY) subchannel transitions to CONNECTING or +TRANSIENT_FAILURE state, then pick_first will go back into CONNECTING +state. It will start the happy eyeballs pass across all subchannels, +as described in [A61]. ### Metrics From 9c8c5cec78f938e6230b834ab1bf3afa40df3e76 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 20 Oct 2025 22:51:36 +0000 Subject: [PATCH 07/33] queue inside subchannel, and document circuit breaker behavior --- ...x_concurrent_streams-connection-scaling.md | 35 +++++++++++++++---- 1 file changed, 29 insertions(+), 6 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index c6618aa1f..f7e86fe86 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -415,12 +415,35 @@ There are only two possible results: the queued RPC. With this design, the API used for this interaction between the channel -and the subchannel will need to change. - -TODO: figure out details, then revise pseudo-code below - -TODO: interaction with circuit breaking -- i.e., when we do call the -call tracker to indicate that the RPC has started? +and the subchannel will need to change. Instead of the channel getting +a connection from the subchannel, the channel will simply send the RPC +to the subchannel, and the subchannel will be responsible for picking a +connection or queuing the RPC if there is no connection immediately +available. + +One consequence of this is that when we hit the race condition between +the LB policy picking the subchannel and the transport seeing a GOWAWAY +or disconnection, we will now rely on transparent retries (see [A6]) +instead of just having the channel re-queue the LB pick to be tried +again the next time the LB policy updates the picker. + +#### Interaction with xDS Circuit Breaking + +gRPC currently supports xDS circuit breaking as described in [A32]. +Specifically, we support configuring the max number of RPCs in flight +to each cluster. This is done by having the xds_cluster_impl LB policy +increment an atomic counter that tracks the number of RPCs currently in +flight to the cluster, which is later decremented when the RPC ends. + +Some gRPC implementations don't actually increment the counter until +after a connection is chosen for the RPC, so that they won't erroneously +increment the counter when the picker returns a subchannel for which no +connection is available (the race condition described above). However, +now that there could be a longer delay between when the picker returns +and when a connection is chosen for the RPC, implementations will need +to increment the counter in the picker. It will then be necessary to +decrement the counter when the RPC finishes, regardless of whether it +was actually sent out or whether the subchannel failed it. #### Subchannel Pseudo-Code From 56147707158160a3bf3f96bc943bf36cd867ef0f Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Thu, 23 Oct 2025 00:16:51 +0000 Subject: [PATCH 08/33] channel arg is internal-only --- A105-max_concurrent_streams-connection-scaling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index f7e86fe86..562d0ab2e 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -172,7 +172,7 @@ To support this, the implementation will be as follows: value of max_connections_per_subchannel is removed. Whenever the set of entries changes, the subchannel will do a pass over the map to find the new max value to use. -- We will create a new channel arg called +- We will create a new internal-only channel arg called `GRPC_ARG_MAX_CONNECTIONS_PER_SUBCHANNEL`, which will be treated specially by the channel. Specifically, when this attribute is passed to `CreateSubchannel()`, it will be excluded from the channel args From b567c11dac32ae5d09065870332babbc3616c1d3 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 24 Oct 2025 23:02:52 +0000 Subject: [PATCH 09/33] fill in java/go details --- ...x_concurrent_streams-connection-scaling.md | 69 ++++++++++++------- 1 file changed, 45 insertions(+), 24 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 562d0ab2e..a23b6e6c7 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-10-20 +* Last updated: 2025-10-24 * Discussion at: (filled after thread exists) ## Abstract @@ -112,35 +112,34 @@ message ServiceConfig { ``` In the subchannel, connection scaling will be configured via a parameter -called max_connections_per_subchannel, which be passed into the subchannel -via a per-endpoint attribute. Configuring this via the service config -will effectively set that per-endpoint attribute before passing the list -of endpoints into the LB policy, but the attribute can also be set or -modified by the LB policy. - -As indicated in the comment above, the channel will enforce a maximum -limit for the max_connections_per_subchannel attribute. This limit -will be 10 by default, but gRPC will provide a channel-level setting to -allow a client application to raise or lower that limit. Whenever the -max_connections_per_subchannel attribute is larger than the channel's -limit, it will be capped to that limit. This capping will be performed -in the subchannel itself, so that it will apply regardless of where the -attribute is set. +called max_connections_per_subchannel. That parameter will be set +either via the service config or via a per-endpoint attribute in the LB +policy tree. The approach for plumbing this parameter into the +subchannel will be different in C-core than in Java and Go; see below +for details. The max_connections_per_subchannel attribute can change with each resolver update, regardless of whether it is set via the service config or via an LB policy. When this happens, we do not want to throw away the subchannel and create a new one, since that would cause unnecessary connection churn. -This means that the max_connections_per_subchannel attribute must not +This means that the max_connections_per_subchannel parameter must not be considered part of the subchannel's unique identity that is set only -at subchannel creation time; instead, it must be an attribute that can -be changed over the life of a subchannel. The approach for this will be -different in C-core than in Java and Go. +at subchannel creation time; instead, it must be changeable over the life +of a subchannel. If the max_connections_per_subchannel attribute is unset, the subchannel will assume a default of 1, which effectively means the same behavior as before this gRFC. +As indicated in the comment above, the channel will enforce a maximum +limit for the max_connections_per_subchannel attribute. This limit +will be 10 by default, but gRPC will provide a channel-level setting to +allow a client application to raise or lower that limit. Whenever the +max_connections_per_subchannel attribute is larger than the channel's +limit, it will be capped to that limit. This capping will be performed +in the subchannel itself, so that it will apply regardless of where the +attribute is set. + #### C-core In C-core, every time there is a resolver update, the LB policy @@ -190,11 +189,33 @@ In Java and Go, there is no subchannel pool, and LB policies will not call `CreateSubchannel()` for any address for which they already have a subchannel from the previous address list. There is also no need to deal with the case of multiple channels sharing the same subchannel. -Therefore, a different approach is called for. However, it seems -desirable to design this API such that it leaves open the possibility of -Java and Go introducing a subchannel pool at some point in the future. - -TODO: flesh this out, maybe use some sort of injector approach? +Therefore, a different approach is called for. + +A notification object will be used to notify the subchannel of the value +of the max_connections_per_subchannel attribute. This object will be +passed into the subchannel at creation time via a resolver attribute. + +When a channel is constructed, it will create a single notification +object to be used for all subchannels for the lifetime of the channel. +This notification object will be added to the resolver attributes before +the resolver update is passed to the LB policy, so that it will propagate +through to the `CreateSubchannel()` call. Whenever the resolver returns +a service config that sets max_connections_per_subchannel, the channel will +tell the notification object to use that value, which will then be +communicated to the subchannels. + +For the case where an LB policy wants to override the value of +max_connections_per_subchannel, it can create its own notification +object and replace the channel's notification object with its own in the +resolver attributes. It can then tell the notification object what +value to use whenever it gets a config update. + +Because the notification object is passed to the subchannel only at +creation time, which means that the LB policy must decide whether +to inject its own notification object at that time. If an LB policy +initially wants to just use the default value from the channel, it would +need to proxy that value from the original notification object that was +passed down from the channel. Note that this approach assumes that Java and Go have switched to a model where there is only one address per subchannel, as per [A61]. From 9fd8768ae7eacc43ea09767051866ce924f08eab Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 24 Oct 2025 23:12:46 +0000 Subject: [PATCH 10/33] handle xDS case in xds_cluster_impl policy --- A105-max_concurrent_streams-connection-scaling.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index a23b6e6c7..0dc3ee063 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -50,6 +50,8 @@ channels it uses. * [A9: Server-side Connection Management][A9] * [A32: xDS Circuit Breaking][A32] * [A61: IPv4 and IPv6 Dualstack Backend Support][A61] +* [A74: xDS Config Tears][A74] +* [A75: xDS Aggregate Cluster Behavior Fixes][A74] * [A79: Non-per-call Metrics Architecture][A79] * [A94: OTel Metrics for Subchannels][A94] @@ -57,6 +59,8 @@ channels it uses. [A9]: A9-server-side-conn-mgt.md [A32]: A32-xds-circuit-breaking.md [A61]: A61-IPv4-IPv6-dualstack-backends.md +[A74]: A74-xds-config-tears.md +[A75]: A75-xds-aggregate-cluster-behavior-fixes.md [A79]: A79-non-per-call-metrics-architecture.md [A94]: A94-subchannel-otel-metrics.md [H2MCS]: https://httpwg.org/specs/rfc7540.html#SETTINGS_MAX_CONCURRENT_STREAMS @@ -243,6 +247,15 @@ field, we will now add support for the following field: A new field will be added to the parsed CDS resource representation containing the value of this field. +The xds_cluster_impl LB policy will be responsible for setting the +max_connections_per_subchannel attribute based on this xDS configuration. +Note that it makes sense to do this in the xds_cluster_impl LB policy +instead of the cds policy for two reasons: first, this is where circuit +breaking is already configured, and second, this policy is in the right +location in the LB policy tree regardless of whether [A75] has been +implemented yet. Note that post-[A74], this will not require adding +any new fields in the xds_cluster_impl LB policy configuration. + ### Transport Reporting Current MAX_CONCURRENT_STREAMS In order for the subchannel to know when to create a new connection, the From 21f5a167d1303e195c54bbb9ddbec02babb239bb Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 24 Oct 2025 23:18:17 +0000 Subject: [PATCH 11/33] metrics --- A105-max_concurrent_streams-connection-scaling.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 0dc3ee063..288755d6f 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -52,7 +52,6 @@ channels it uses. * [A61: IPv4 and IPv6 Dualstack Backend Support][A61] * [A74: xDS Config Tears][A74] * [A75: xDS Aggregate Cluster Behavior Fixes][A74] -* [A79: Non-per-call Metrics Architecture][A79] * [A94: OTel Metrics for Subchannels][A94] [A6]: A6-client-retries.md @@ -61,7 +60,6 @@ channels it uses. [A61]: A61-IPv4-IPv6-dualstack-backends.md [A74]: A74-xds-config-tears.md [A75]: A75-xds-aggregate-cluster-behavior-fixes.md -[A79]: A79-non-per-call-metrics-architecture.md [A94]: A94-subchannel-otel-metrics.md [H2MCS]: https://httpwg.org/specs/rfc7540.html#SETTINGS_MAX_CONCURRENT_STREAMS @@ -585,7 +583,9 @@ as described in [A61]. ### Metrics -TODO: define metrics +No new metrics will be defined for this feature. However, +note that the additional connections will show up in the +`grpc.subchannel.open_connections` metric defined in [A94]. ### Temporary environment variable protection From e855a63f2e5b92857b87a1c7d9e08d1181d0c2fb Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Sat, 25 Oct 2025 00:15:41 +0000 Subject: [PATCH 12/33] flesh out interactions between channel and subchannel, and between subchannel and transport --- ...x_concurrent_streams-connection-scaling.md | 105 ++++++++++++------ 1 file changed, 70 insertions(+), 35 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 288755d6f..2fd9f11b6 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -184,6 +184,11 @@ To support this, the implementation will be as follows: is orphaned, it will call a new API on the underlying subchannel to tell it that the ref is going away for a particular value of max_connections_per_subchannel. +- If max_connections_per_subchannel is configured via the service + config, the `GRPC_ARG_MAX_CONNECTIONS_PER_SUBCHANNEL` channel arg will + be set by the channel before passing the resolver update to the LB + policy tree. Individual LB policies may override this channel arg + before `CreateSubchannel()` is called. #### Java and Go @@ -254,17 +259,6 @@ location in the LB policy tree regardless of whether [A75] has been implemented yet. Note that post-[A74], this will not require adding any new fields in the xds_cluster_impl LB policy configuration. -### Transport Reporting Current MAX_CONCURRENT_STREAMS - -In order for the subchannel to know when to create a new connection, the -transport will need to report the current value of the peer's -MAX_CONCURRENT_STREAMS setting up to the subchannel. - -TODO: in C-core, maybe revamp the connectivity state API between transport -and subchannel? was thinking about doing this anyway for subchannel -metrics disconnection reason -- but need to figure out a plan for -direct channels - ### Subchannel Behavior The connection scaling functionality in the subchannel will be used if @@ -283,8 +277,10 @@ will be stored in a list ordered by the time at which the connection was established, so that the oldest connection is at the start of the list. Each connection will store the peer's MAX_CONCURRENT_STREAMS setting -reported by the transport. It will also track how many RPCs are -currently in flight on the connection. +reported by the transport (see [Interaction Between Transport and +Subchannel](#interaction-between-transport-and-subchannel) below for +details). It will also track how many RPCs are currently in flight on +the connection. The subchannel will also track one in-flight connection attempt, which will be unset if no connection attempt is currently in flight. @@ -430,34 +426,73 @@ When failing an RPC due to the subchannel not having any established connections, note that the RPC will be eligible for transparent retries (see [A6]), because no wire traffic was produced for it. +### Interaction Between Transport and Subchannel + +In order for the subchannel to know when to create a new connection, the +transport will need to report the peer's MAX_CONCURRENT_STREAMS setting +to the subchannel. The transport will need to first report this value +immediately after receiving the initial SETTINGS frame from the peer, +and then it will need to send an update any time it receives a SETTINGS +frame from the peer that changes this particular setting. + +Implementations should consider providing some mechanism for flow +control for these updates, to prevent sending updates to the subchannel +faster than it can process them. One possible simple approach would be +for the transport to have only one in-flight update to the subchannel at +any given time, which could be implemented as follows: +- When sending an update to the subchannel, the transport will record + the MAX_CONCURRENT_STREAMS value that it is reporting to the + subchannel. It will include a callback with the update, which the + subchannel must invoke when it has finished processing the update. +- If the transport receives a new MAX_CONCURRENT_STREAMS value from the + peer before it receives the callback from the subchannel, it will + not send another update to the subchannel. +- When the transport receives the callback from the subchannel, it will + check whether the peer's current MAX_CONCURRENT_STREAMS value is + different from the last value it reported to the subchannel. If so, + it will start a new update. + +Note that this approach means that there may be a noticeable delay +between when the transport sees an update and when the subchannel sees +the update. If the update is an *increase* in MAX_CONCURRENT_STREAMS, +that won't cause any problems (the worst case is that the subchannel +may create a new connection that it might have been able to avoid). +If the update is a *decrease* in MAX_CONCURRENT_STREAMS, then the +subchannel may dispatch RPCs to the connection that will wind up being +queued in the transport, which is sub-optimal. However, this design +already has a race condition in that case (see [Picking a Connection +for Each RPC](#picking-a-connection-for-each-rpc) above for details), +so this is not an additional problem. + #### Interaction Between Channel and Subchannel Today, when the channel does an LB pick and gets back a subchannel, it calls a method on that subchannel to get its underlying connection. There are only two possible results: -1. The subchannel returns a ref to the underlying connection. -2. The subchannel returns null to indicate that it no longer has a working - connection. This case can happen due to a race between the LB policy - picking the subchannel and the transport seeing a GOAWAY or disconnection; - when that occurs, the channel will queue the RPC (i.e., it is treated - the same as if the picker indicated to queue the RPC) in the expectation - that the LB policy will soon see the subchannel report the disconnection - and will return a new picker, at which point a new pick will be done for - the queued RPC. - -With this design, the API used for this interaction between the channel -and the subchannel will need to change. Instead of the channel getting -a connection from the subchannel, the channel will simply send the RPC -to the subchannel, and the subchannel will be responsible for picking a -connection or queuing the RPC if there is no connection immediately -available. - -One consequence of this is that when we hit the race condition between -the LB policy picking the subchannel and the transport seeing a GOWAWAY -or disconnection, we will now rely on transparent retries (see [A6]) -instead of just having the channel re-queue the LB pick to be tried -again the next time the LB policy updates the picker. +1. The subchannel returns a ref to the underlying connection. In this + case, the channel starts a stream on the returned connection and + forwards the RPC to that stream. +2. The subchannel returns null to indicate that it no longer has a + working connection. This case can happen due to a race between the + LB policy picking the subchannel and the transport seeing a GOAWAY + or disconnection. When this occurs, the channel will queue the RPC + (i.e., it is treated the same as if the picker indicated to queue + the RPC) in the expectation that the LB policy will soon see the + subchannel report the disconnection and will return a new picker, + at which point a new pick will be done for the queued RPC. + +With this design, there is a third outcome possible: the subchannel may +not have a set of working connections, but they may all already be at +their MAX_CONCURRENT_STREAMS limits, so the RPC may need to be queued +until the subchannel has a connection that it can send it on. This case +can be handled by having the subchannel return a fake connection object +that queues the RPC in the subchannel. + +Note that the race condition described in case 2 above will now happen +only if the subchannel has no working connections. If there is at least +one working connection, then even if the RPC cannot be sent immediately, +the RPC will be queued in the subchannel instead. #### Interaction with xDS Circuit Breaking From d87a971347976127dd0b188eb7c675fc683ed549 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Sat, 25 Oct 2025 00:17:58 +0000 Subject: [PATCH 13/33] reorder a bit --- ...x_concurrent_streams-connection-scaling.md | 38 +++++++++---------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 2fd9f11b6..037687473 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -426,7 +426,7 @@ When failing an RPC due to the subchannel not having any established connections, note that the RPC will be eligible for transparent retries (see [A6]), because no wire traffic was produced for it. -### Interaction Between Transport and Subchannel +#### Interaction Between Transport and Subchannel In order for the subchannel to know when to create a new connection, the transport will need to report the peer's MAX_CONCURRENT_STREAMS setting @@ -494,24 +494,6 @@ only if the subchannel has no working connections. If there is at least one working connection, then even if the RPC cannot be sent immediately, the RPC will be queued in the subchannel instead. -#### Interaction with xDS Circuit Breaking - -gRPC currently supports xDS circuit breaking as described in [A32]. -Specifically, we support configuring the max number of RPCs in flight -to each cluster. This is done by having the xds_cluster_impl LB policy -increment an atomic counter that tracks the number of RPCs currently in -flight to the cluster, which is later decremented when the RPC ends. - -Some gRPC implementations don't actually increment the counter until -after a connection is chosen for the RPC, so that they won't erroneously -increment the counter when the picker returns a subchannel for which no -connection is available (the race condition described above). However, -now that there could be a longer delay between when the picker returns -and when a connection is chosen for the RPC, implementations will need -to increment the counter in the picker. It will then be necessary to -decrement the counter when the RPC finishes, regardless of whether it -was actually sent out or whether the subchannel failed it. - #### Subchannel Pseudo-Code The following pseudo-code illustrates the expected functionality in the @@ -616,6 +598,24 @@ TRANSIENT_FAILURE state, then pick_first will go back into CONNECTING state. It will start the happy eyeballs pass across all subchannels, as described in [A61]. +### Interaction with xDS Circuit Breaking + +gRPC currently supports xDS circuit breaking as described in [A32]. +Specifically, we support configuring the max number of RPCs in flight +to each cluster. This is done by having the xds_cluster_impl LB policy +increment an atomic counter that tracks the number of RPCs currently in +flight to the cluster, which is later decremented when the RPC ends. + +Some gRPC implementations don't actually increment the counter until +after a connection is chosen for the RPC, so that they won't erroneously +increment the counter when the picker returns a subchannel for which no +connection is available (the race condition described above). However, +now that there could be a longer delay between when the picker returns +and when a connection is chosen for the RPC, implementations will need +to increment the counter in the picker. It will then be necessary to +decrement the counter when the RPC finishes, regardless of whether it +was actually sent out or whether the subchannel failed it. + ### Metrics No new metrics will be defined for this feature. However, From 87489d4a233a4ea85aca006c06371a5cf826b48d Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Sat, 25 Oct 2025 00:23:18 +0000 Subject: [PATCH 14/33] use consistent terminology throughout --- ...x_concurrent_streams-connection-scaling.md | 60 +++++++++---------- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 037687473..cd8e29166 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -113,34 +113,34 @@ message ServiceConfig { } ``` -In the subchannel, connection scaling will be configured via a parameter -called max_connections_per_subchannel. That parameter will be set +In the subchannel, connection scaling will be configured via a setting +called max_connections_per_subchannel. That setting will be set either via the service config or via a per-endpoint attribute in the LB -policy tree. The approach for plumbing this parameter into the +policy tree. The approach for plumbing this setting into the subchannel will be different in C-core than in Java and Go; see below for details. -The max_connections_per_subchannel attribute can change with each resolver -update, regardless of whether it is set via the service config or via an -LB policy. When this happens, we do not want to throw away the subchannel -and create a new one, since that would cause unnecessary connection churn. -This means that the max_connections_per_subchannel parameter must not -be considered part of the subchannel's unique identity that is set only -at subchannel creation time; instead, it must be changeable over the life -of a subchannel. +The max_connections_per_subchannel setting for a given subchannel can +change with each resolver update, regardless of whether it is set +via the service config or via an LB policy. When this happens, we +do not want to throw away the subchannel and create a new one, since +that would cause unnecessary connection churn. This means that the +max_connections_per_subchannel setting must not be considered part of +the subchannel's unique identity that is set only at subchannel creation +time; instead, it must be changeable over the life of a subchannel. -If the max_connections_per_subchannel attribute is unset, the subchannel +If the max_connections_per_subchannel setting is unset, the subchannel will assume a default of 1, which effectively means the same behavior as before this gRFC. As indicated in the comment above, the channel will enforce a maximum -limit for the max_connections_per_subchannel attribute. This limit +limit for the max_connections_per_subchannel setting. This limit will be 10 by default, but gRPC will provide a channel-level setting to allow a client application to raise or lower that limit. Whenever the -max_connections_per_subchannel attribute is larger than the channel's +max_connections_per_subchannel setting is larger than the channel's limit, it will be capped to that limit. This capping will be performed in the subchannel itself, so that it will apply regardless of where the -attribute is set. +setting is set. #### C-core @@ -153,7 +153,7 @@ subchannel is created only if it doesn't already exist in the pool; otherwise, the returned subchannel wrapper will hold a new ref to the existing subchannel, so that it doesn't actually wind up creating a new subchannel (only a new subchannel wrapper). This means that we do not -want the max_connections_per_subchannel attribute to be part of the +want the max_connections_per_subchannel setting to be part of the subchannel's key in the subchannel pool, or else we will wind up recreating the subchannel whenever the attribute's value changes. @@ -161,12 +161,12 @@ In addition, by default, C-core's subchannel pool is shared between channels, meaning that if two channels attempt to create the same subchannel, they will wind up sharing a single subchannel. In this case, each channel using a given subchannel may have a different value -for the max_connections_per_subchannel attribute. The subchannel will -use the maximum value set for this attribute across all channels. +for the max_connections_per_subchannel setting. The subchannel will +use the maximum value set for this setting across all channels. To support this, the implementation will be as follows: - The subchannel will store a map from max_connections_per_subchannel - value to the number of subchannel wrappers currently holding a ref to + setting to the number of subchannel wrappers currently holding a ref to it with that value. Entries are added to the map whenever the first ref for a given value of max_connections_per_subchannel is taken, and entries are removed from the map whenever the last ref for a given @@ -199,7 +199,7 @@ deal with the case of multiple channels sharing the same subchannel. Therefore, a different approach is called for. A notification object will be used to notify the subchannel of the value -of the max_connections_per_subchannel attribute. This object will be +of the max_connections_per_subchannel setting. This object will be passed into the subchannel at creation time via a resolver attribute. When a channel is constructed, it will create a single notification @@ -229,7 +229,7 @@ model where there is only one address per subchannel, as per [A61]. #### xDS Configuration -In xDS, the max_connections_per_subchannel value will be configured via +In xDS, the max_connections_per_subchannel setting will be configured via a per-host circuit breaker in the CDS resource. This uses a similar structure to the circuit breaker described in [A32]. @@ -242,16 +242,16 @@ field, we will now add support for the following field: If that entry is found, then within that entry: - [max_connections](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/circuit_breaker.proto#L59): If this field is set, then its value will be used to set the - max_connections_per_subchannel attribute for all endpoints for that + max_connections_per_subchannel setting for all endpoints for that xDS cluster. If it is unset, then the - max_connections_per_subchannel attribute will remain unset. A value + max_connections_per_subchannel setting will remain unset. A value of 0 will be rejected at resource validation time. A new field will be added to the parsed CDS resource representation containing the value of this field. The xds_cluster_impl LB policy will be responsible for setting the -max_connections_per_subchannel attribute based on this xDS configuration. +max_connections_per_subchannel setting based on this xDS configuration. Note that it makes sense to do this in the xds_cluster_impl LB policy instead of the cds policy for two reasons: first, this is where circuit breaking is already configured, and second, this policy is in the right @@ -262,7 +262,7 @@ any new fields in the xds_cluster_impl LB policy configuration. ### Subchannel Behavior The connection scaling functionality in the subchannel will be used if -the max_connections_per_subchannel attribute is greater than 1. +the max_connections_per_subchannel setting is greater than 1. If the value is 1 (or unset), then implementations must not impose any additional per-RPC overhead at this layer beyond what already exists @@ -293,7 +293,7 @@ following are true: i.e., the number of RPCs currently in flight on each connection is greater than or equal to the connection's MAX_CONCURRENT_STREAMS. - The number of existing connections in the subchannel is fewer than the - max_connections_per_subchannel value. + max_connections_per_subchannel setting. - There is no connection attempt currently in flight on the subchannel. The subchannel will never close a connection once it has been established. @@ -310,7 +310,7 @@ started: or a connection attempt fails. - RPCs were already queued on the subchannel and a new connection was just created that did not provide enough available streams for all pending RPCs. -- The value of the max_connections_per_subchannel attribute increases, +- The value of the max_connections_per_subchannel setting increases, all existing connections are already at their MAX_CONCURRENT_STREAMS limit, and there are queued RPCs. @@ -395,13 +395,13 @@ algorithm will be used (first match wins): connection is lower than the peer's MAX_CONCURRENT_STREAMS setting, then the RPC will be dispatched on that connection. 3. If the number of existing connections is equal to the - max_connections_per_subchannel value, the RPC will be queued. + max_connections_per_subchannel setting, the RPC will be queued. 4. If the number of existing connections is less than the - max_connections_per_subchannel value and the subchannel is in backoff + max_connections_per_subchannel setting and the subchannel is in backoff delay due to the last connection attempt failing, the RPC will be queued. 5. If the number of existing connections is less than the - max_connections_per_subchannel value and no connection attempt is + max_connections_per_subchannel setting and no connection attempt is currently in flight, a new connection attempt will be started, and the RPC will be queued. From a3d4885b60d019d02111a450ed2aed64346b4cb6 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Sat, 25 Oct 2025 00:27:35 +0000 Subject: [PATCH 15/33] remove the last remaining TODOs --- A105-max_concurrent_streams-connection-scaling.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index cd8e29166..39b85ad5a 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -361,8 +361,6 @@ transitions are possible: See [PickFirst Changes](#pickfirst-changes) below for details on how pick_first will handle these new transitions. -TODO: update client channel spec with this info - #### Picking a Connection for Each RPC When an RPC is started and a subchannel is picked for that RPC, the @@ -629,8 +627,6 @@ initially be guarded via the environment variable `GRPC_EXPERIMENTAL_MAX_CONCURRENT_STREAMS_CONNECTION_SCALING`. The feature will be enabled by default once it has passed interop tests. -TODO: define interop tests? - ## Rationale We considered several different approaches as part of this design process. From 34bed1bf940aba19b414400ab7114f4d69ad7e29 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Sat, 25 Oct 2025 00:46:38 +0000 Subject: [PATCH 16/33] add mailing list link --- A105-max_concurrent_streams-connection-scaling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 39b85ad5a..b4f1c0923 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -5,7 +5,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: * Last updated: 2025-10-24 -* Discussion at: (filled after thread exists) +* Discussion at: https://groups.google.com/g/grpc-io/c/n9Mi7ZODReE ## Abstract From 0cdd21de08038be130845849b46e3ef100b6463b Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 27 Oct 2025 22:29:53 +0000 Subject: [PATCH 17/33] fix grammar --- A105-max_concurrent_streams-connection-scaling.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index b4f1c0923..e46809856 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -218,11 +218,11 @@ resolver attributes. It can then tell the notification object what value to use whenever it gets a config update. Because the notification object is passed to the subchannel only at -creation time, which means that the LB policy must decide whether -to inject its own notification object at that time. If an LB policy -initially wants to just use the default value from the channel, it would -need to proxy that value from the original notification object that was -passed down from the channel. +creation time, the LB policy must decide whether to inject its own +notification object at that time. If an LB policy initially wants to +just use the default value from the channel, it would need to proxy that +value from the original notification object that was passed down from +the channel. Note that this approach assumes that Java and Go have switched to a model where there is only one address per subchannel, as per [A61]. From 76db4c123a375fb9baf3b2d2cb64cb4c5c7fa621 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 27 Oct 2025 22:57:34 +0000 Subject: [PATCH 18/33] fix example --- A105-max_concurrent_streams-connection-scaling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index e46809856..68517429a 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -333,7 +333,7 @@ Instead, it must wait for the in-flight connection attempt to finish. If that attempt fails, then backoff must be performed before starting the next connection attempt. But if that attempt succeeds, backoff state will be reset, so if there are still enough queued RPCs to warrant a -second connection, then the subchannel may immediately start another +third connection, then the subchannel may immediately start another connection attempt. The connectivity state of the subchannel should be determined as follows From 4e051a2f7c16fe5d36297a9f16cb6ee0f5e9f1eb Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Wed, 29 Oct 2025 18:37:30 +0000 Subject: [PATCH 19/33] improve pseudo-code, and don't drain queue when connection attempt fails --- A105-max_concurrent_streams-connection-scaling.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 68517429a..4b74f532d 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -413,7 +413,7 @@ queuing strategy. When retrying a queued RPC, the subchannel will use the same algorithm described above that it will when it first sees the RPC. RPCs will be drained from the queue upon the following events: -- When a connection attempt completes, whether successfully or not. +- When a connection attempt completes successfully. - When the backoff timer fires. - When an existing connection fails. - When the transport for a connection reports a new value for @@ -523,13 +523,13 @@ def MaybeDispatchRpc(self, rpc): # Starts an RPC on the subchannel. def StartRpc(self, rpc): if not self.MaybeDispatchRpc(rpc): - self.queue.append(rpc) + self.queue.add(rpc) # Retries RPCs from the queue, in order. def RetryRpcsFromQueue(self): while len(self.queue) > 0: # Stop at the first RPC that gets queued. - if not self.MaybeDispatchRpc(self.queue[-1]): + if not self.MaybeDispatchRpc(self.queue.front()): break self.queue.pop() @@ -550,7 +550,6 @@ def OnConnectionAttemptSucceeded(self, new_connection): def OnConnectionAttemptFailed(self): self.connection_attempt = None self.pending_backoff_timer = Timer(self.backoff_state.NextBackoffDelay()) - self.RetryRpcsFromQueue() # Called when the backoff timer fires. Will trigger a new connection # attempt if there are RPCs in the queue. From 3e3f4af57c205143a7794723010e6f5a36a59a7f Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Wed, 29 Oct 2025 18:38:16 +0000 Subject: [PATCH 20/33] update date --- A105-max_concurrent_streams-connection-scaling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 4b74f532d..54b570e1e 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-10-24 +* Last updated: 2025-10-29 * Discussion at: https://groups.google.com/g/grpc-io/c/n9Mi7ZODReE ## Abstract From cc50f5d251638ac6dcf700ca7609263aa25a1268 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 31 Oct 2025 19:34:37 +0000 Subject: [PATCH 21/33] update pseudo-code --- ...x_concurrent_streams-connection-scaling.md | 45 ++++++++++++------- 1 file changed, 28 insertions(+), 17 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 54b570e1e..660853d6d 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-10-29 +* Last updated: 2025-10-31 * Discussion at: https://groups.google.com/g/grpc-io/c/n9Mi7ZODReE ## Abstract @@ -498,19 +498,14 @@ The following pseudo-code illustrates the expected functionality in the subchannel: ``` -# Returns True if the RPC has been handled, False if it needs to be queued. -def MaybeDispatchRpc(self, rpc): - # If there are no connections, fail the RPC. - # Note that the RPC will be eligible for transparent retries. - if len(self.connections) == 0: - rpc.Fail(UNAVAILABLE) - return True +# Returns a connection to use for an RPC, or None if no connection is +# currently available to send an RPC on. +def ChooseConnection(self): # Use the oldest connection that can accept a new stream, if any. for connection in self.connections: if connection.rpcs_in_flight < connection.max_concurrent_streams: connection.rpcs_in_flight += 1 - connection.StartRpc(rpc) - return True + return connection # If we aren't yet at the max number of connections, see if we can # create a new one. if (len(self.connections) < self.max_connections_per_subchannel and @@ -518,19 +513,29 @@ def MaybeDispatchRpc(self, rpc): self.pending_backoff_timer is None): self.StartConnectionAttempt() # Didn't find a connection for the RPC, so queue it. - return False + return None -# Starts an RPC on the subchannel. -def StartRpc(self, rpc): - if not self.MaybeDispatchRpc(rpc): - self.queue.add(rpc) +# Get a connection to start a new RPC. +def GetConnection(self): + # If there are no connections, tell the channel to queue the LB pick. + if len(self.connections) == 0: + return None + connection = self.ChooseConnection() + # If we didn't find a connection to use, return a fake connection that + # adds all RPCs to self.queue. + if connection is None: + return FakeConnectionThatQueuesRpcs() + return connection # Retries RPCs from the queue, in order. def RetryRpcsFromQueue(self): while len(self.queue) > 0: + connection = self.ChooseConnection() # Stop at the first RPC that gets queued. - if not self.MaybeDispatchRpc(self.queue.front()): + if connection is None: break + # Otherwise, send the RPC on the connection. + connection.SendRpc(self.queue.front()) self.queue.pop() # Starts a new connection attempt. @@ -561,7 +566,13 @@ def OnBackoffTimer(self): # connection attempt if there are RPCs in the queue. def OnConnectionFailed(self, failed_connection): self.connections.remove(failed_connection) - self.RetryRpcsFromQueue() + if len(self.connections) == 0: + for rpc in self.queue: + # RPC will be eligible for transparent retries. + rpc.Fail(UNAVAILABLE) + self.queue = [] + else: + self.RetryRpcsFromQueue() # Maybe trigger new connection attempt. # Called when a connection reports a new MAX_CONCURRENT_STREAMS value. # May send RPCs on the connection if there are any queued and the From b562b6de364acbc58e58d448d5608660e4028613 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 31 Oct 2025 19:40:08 +0000 Subject: [PATCH 22/33] clarify behavior when there are no working connections --- ...x_concurrent_streams-connection-scaling.md | 31 +++++++++---------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 660853d6d..a5a22c916 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -296,12 +296,6 @@ following are true: max_connections_per_subchannel setting. - There is no connection attempt currently in flight on the subchannel. -The subchannel will never close a connection once it has been established. -However, when a connection is closed for any reason, it is removed from -the subchannel. If the application wishes to garbage collect unused -connections, it should configure MAX_CONNECTION_IDLE on the server side, -as described in [A9]. - Some examples of cases that can trigger a new connection attempt to be started: - A new RPC is started on the subchannel and all existing connections @@ -314,6 +308,17 @@ started: all existing connections are already at their MAX_CONCURRENT_STREAMS limit, and there are queued RPCs. +The subchannel will never close a connection once it has been established. +However, when a connection is closed for any reason, it is removed from +the subchannel. If the application wishes to garbage collect unused +connections, it should configure MAX_CONNECTION_IDLE on the server side, +as described in [A9]. + +If all working connections are terminated, the subchannel will fail +all queued RPCs with status UNAVAILABLE. Note that these RPCs will be +eligible for transparent retries (see [A6]), because no wire traffic +was produced for them. + #### Backoff Behavior and Connectivity State The subchannel must have no more than one connection @@ -386,19 +391,17 @@ is expected to be rare. When choosing a connection for an RPC within a subchannel, the following algorithm will be used (first match wins): -1. If the subchannel has no working connections, then the RPC will be - failed with status UNAVAILABLE. -2. Look through all working connections in order from the oldest to the +1. Look through all working connections in order from the oldest to the newest. For each connection, if the number of RPCs in flight on the connection is lower than the peer's MAX_CONCURRENT_STREAMS setting, then the RPC will be dispatched on that connection. -3. If the number of existing connections is equal to the +2. If the number of existing connections is equal to the max_connections_per_subchannel setting, the RPC will be queued. -4. If the number of existing connections is less than the +3. If the number of existing connections is less than the max_connections_per_subchannel setting and the subchannel is in backoff delay due to the last connection attempt failing, the RPC will be queued. -5. If the number of existing connections is less than the +4. If the number of existing connections is less than the max_connections_per_subchannel setting and no connection attempt is currently in flight, a new connection attempt will be started, and the RPC will be queued. @@ -420,10 +423,6 @@ drained from the queue upon the following events: MAX_CONCURRENT_STREAMS. - When an RPC dispatched on one of the connections completes. -When failing an RPC due to the subchannel not having any established -connections, note that the RPC will be eligible for transparent retries -(see [A6]), because no wire traffic was produced for it. - #### Interaction Between Transport and Subchannel In order for the subchannel to know when to create a new connection, the From 37bfa760acda1c68ce75046900d1a2c7dfa4b3bc Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 3 Nov 2025 22:14:46 +0000 Subject: [PATCH 23/33] channelz changes --- ...x_concurrent_streams-connection-scaling.md | 43 ++++++++++++++++++- 1 file changed, 42 insertions(+), 1 deletion(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index a5a22c916..6f8886b80 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-10-31 +* Last updated: 2025-11-03 * Discussion at: https://groups.google.com/g/grpc-io/c/n9Mi7ZODReE ## Abstract @@ -48,6 +48,7 @@ channels it uses. ### Related Proposals: * [A6: gRPC Retry Design][A6] * [A9: Server-side Connection Management][A9] +* [A14: Channelz][A14] * [A32: xDS Circuit Breaking][A32] * [A61: IPv4 and IPv6 Dualstack Backend Support][A61] * [A74: xDS Config Tears][A74] @@ -56,6 +57,7 @@ channels it uses. [A6]: A6-client-retries.md [A9]: A9-server-side-conn-mgt.md +[A14]: A14-channelz.md [A32]: A32-xds-circuit-breaking.md [A61]: A61-IPv4-IPv6-dualstack-backends.md [A74]: A74-xds-config-tears.md @@ -623,6 +625,45 @@ to increment the counter in the picker. It will then be necessary to decrement the counter when the RPC finishes, regardless of whether it was actually sent out or whether the subchannel failed it. +### Channelz Changes + +The channelz API defined in [A14] will need some changes to expose the +state of connection scaling. + +In the `ChannelData` message, we will add a new `uint32 +max_connections_per_subchannel` field, which will be populated only for +subchannels, not channels. + +The channelz data model already supports [multiple sockets per +subchannel](https://github.com/grpc/grpc-proto/blob/23f5b568eefcb876e6ebc3b01725f1f20cff999e/grpc/channelz/v1/channelz.proto#L80). +Today, a subchannel can have more than one socket when a connection +receives a GOAWAY but remains open until already-in-flight RPCs finish, +while at the same time a new connection is established to send new RPCs. +The current channelz data model does not explicitly indicate which +connection has received the GOAWAY, although that can be inferred from +the channelz socket ID, since the lower ID is older and will therefore +be the one that received the GOAWAY. With this design, that inference +will no longer be possible, because there can now be more than one +established connection in a subchannel. Therefore, we will add a new +`google.protobuf.UInt32Value received_goaway_error` field to the channelz +`SocketData` message, which will be populated if a GOAWAY has been +received, in which case its value will be the HTTP/2 error code from +the received GOAWAY. + +The `SocketData` message already contains +[`streams_started`](https://github.com/grpc/grpc-proto/blob/23f5b568eefcb876e6ebc3b01725f1f20cff999e/grpc/channelz/v1/channelz.proto#L252), +[`streams_succeeded`](https://github.com/grpc/grpc-proto/blob/23f5b568eefcb876e6ebc3b01725f1f20cff999e/grpc/channelz/v1/channelz.proto#L256), +and +[`streams_failed`](https://github.com/grpc/grpc-proto/blob/23f5b568eefcb876e6ebc3b01725f1f20cff999e/grpc/channelz/v1/channelz.proto#L260) +fields, which enable computing the number of streams currently in +flight. However, we will add a new `uint32 peer_max_concurrent_streams` +field that will report the value of the peer's MAX_CONCURRENT_STREAMS +setting; note that this field does not need to be populated in the gRPC +server, only the client. + +Note that all data in `SocketNode` will be reported from the perspective +of the transport, not the perspective of the subchannel. + ### Metrics No new metrics will be defined for this feature. However, From 986ce3da88e0decbf903478cfc715a621b22e101 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 3 Nov 2025 23:38:15 +0000 Subject: [PATCH 24/33] clarify queuing behavior and overhead comments --- ...x_concurrent_streams-connection-scaling.md | 66 ++++++++++--------- 1 file changed, 35 insertions(+), 31 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 6f8886b80..6eaa17a57 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -263,19 +263,18 @@ any new fields in the xds_cluster_impl LB policy configuration. ### Subchannel Behavior -The connection scaling functionality in the subchannel will be used if -the max_connections_per_subchannel setting is greater than 1. - -If the value is 1 (or unset), then implementations must not impose any -additional per-RPC overhead at this layer beyond what already exists -today. In other words, the connection scaling feature must not affect -performance unless it is actually enabled. +If max_connections_per_subchannel is 1, the subchannel will provide +essentially the same behavior as before this gRFC. In that case, +implementations should minimize any additional per-RPC overhead at this +layer beyond what already existed prior to this design. In other words, +the connection scaling feature should ideally not affect performance +unless it is actually enabled. #### Connection Management -When max_connections_per_subchannel is greater than 1, the subchannel will -contain up to that number of established connections. The connections -will be stored in a list ordered by the time at which the connection was +The subchannel will establish up to the number of established connections +specified by max_connections_per_subchannel. The connections will +be stored in a list ordered by the time at which the connection was established, so that the oldest connection is at the start of the list. Each connection will store the peer's MAX_CONCURRENT_STREAMS setting @@ -370,27 +369,6 @@ pick_first will handle these new transitions. #### Picking a Connection for Each RPC -When an RPC is started and a subchannel is picked for that RPC, the -subchannel will find the first available connection for the RPC, in -order of connection creation. - -The subchannel must ensure that races do not happen while dispatching -RPCs to a connection that will lead to one or more RPCs being queued in -the connection despite having available quota elsewhere. For example, -if two RPCs are initiated at the same time and one stream is available -in a connection, both RPCs must not choose the same connection, or else -one will queue. This race can be avoided with locks or atomics. - -One race that may lead to RPCs being queued in a connection is if the -MAX_CONCURRENT_STREAMS setting of a connection is lowered by the server -after RPCs are dispatched to the connection. This race can be avoided -if the connection is modified to not queue RPCs but instead report the -scenario back to the subchannel, or to coordinate the SETTINGS frame ACK -with the subchannel. Such changes are out of scope for this design, -but may be considered in the future. For the purposes of this design, -it is acceptable to queue RPCs on a connection due to this race, which -is expected to be rare. - When choosing a connection for an RPC within a subchannel, the following algorithm will be used (first match wins): 1. Look through all working connections in order from the oldest to the @@ -425,6 +403,32 @@ drained from the queue upon the following events: MAX_CONCURRENT_STREAMS. - When an RPC dispatched on one of the connections completes. +Because we are now handling queuing in the subchannel, transport +implementations no longer need to handle this queuing. Instead, when +a transport sees an RPC and does not have quota to send it on the wire, +the transport may fail the RPC in a way that is eligible for transparent +retries (see [A6]). Note that implementations should make sure that they +do not introduce an infinite loop here: the transparent retry must not +block the subchannel from processing the subsequent MAX_CONCURRENT_STREAMS +update from the transport, although no explicit synchronization is needed +to ensure that the first transparent retry must not happen until the +subchannel has seen that update. + +The subchannel must ensure that races do not happen while dispatching +RPCs to a connection. For example, if two RPCs are initiated at the same +time and one stream is available in a connection, both RPCs must not choose +the same connection. This race can be avoided with locks or atomics. + +One race that may lead to an RPC being sent on a connection with +insufficient quota is if the MAX_CONCURRENT_STREAMS setting of a +connection is lowered by the server after RPCs are dispatched to the +connection. This is expected to be a rare case, so for this design, +we do not attempt to address this race, with the understanding that +transport implementations will either queue the RPC or fail it in a +way that causes it to be transparently retried. In the future, we may +improve the handling of this race by coordinating the SETTINGS frame +ACK with the subchannel. + #### Interaction Between Transport and Subchannel In order for the subchannel to know when to create a new connection, the From c96c640af9f05e271dec0e97066129bacbd14e35 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 3 Nov 2025 23:53:31 +0000 Subject: [PATCH 25/33] improve wording --- ...x_concurrent_streams-connection-scaling.md | 41 ++++++++----------- 1 file changed, 17 insertions(+), 24 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 6eaa17a57..93e59d2b8 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -386,6 +386,11 @@ algorithm will be used (first match wins): currently in flight, a new connection attempt will be started, and the RPC will be queued. +The subchannel must ensure that races do not happen while dispatching +RPCs to a connection. For example, if two RPCs are initiated at the same +time and one stream is available in a connection, both RPCs must not choose +the same connection. This race can be avoided with locks or atomics. + When queueing an RPC, the queue must be roughly fair: RPCs must be dispatched in the order in which they are received into the queue, acknowledging that timing between threads may lead to @@ -404,30 +409,18 @@ drained from the queue upon the following events: - When an RPC dispatched on one of the connections completes. Because we are now handling queuing in the subchannel, transport -implementations no longer need to handle this queuing. Instead, when -a transport sees an RPC and does not have quota to send it on the wire, -the transport may fail the RPC in a way that is eligible for transparent -retries (see [A6]). Note that implementations should make sure that they -do not introduce an infinite loop here: the transparent retry must not -block the subchannel from processing the subsequent MAX_CONCURRENT_STREAMS -update from the transport, although no explicit synchronization is needed -to ensure that the first transparent retry must not happen until the -subchannel has seen that update. - -The subchannel must ensure that races do not happen while dispatching -RPCs to a connection. For example, if two RPCs are initiated at the same -time and one stream is available in a connection, both RPCs must not choose -the same connection. This race can be avoided with locks or atomics. - -One race that may lead to an RPC being sent on a connection with -insufficient quota is if the MAX_CONCURRENT_STREAMS setting of a -connection is lowered by the server after RPCs are dispatched to the -connection. This is expected to be a rare case, so for this design, -we do not attempt to address this race, with the understanding that -transport implementations will either queue the RPC or fail it in a -way that causes it to be transparently retried. In the future, we may -improve the handling of this race by coordinating the SETTINGS frame -ACK with the subchannel. +implementations should no longer need to handle this queuing. The only +case where a transport may see an RPC that it does not have quota to send +is if the MAX_CONCURRENT_STREAMS setting of a connection is lowered by +the server after RPCs are dispatched to the connection. Transports are +encouraged to handle this case by failing RPCs in a way that is eligible +for transparent retries (see [A6]) rather than by queueing the RPC. +Note that implementations should make sure that they do not introduce an +infinite loop here: the transparent retry must not block the subchannel +from processing the subsequent MAX_CONCURRENT_STREAMS update from the +transport, although no explicit synchronization is needed to ensure +that the first transparent retry must not happen until the subchannel +has seen that update. #### Interaction Between Transport and Subchannel From d1fe071f34719271c0df7f13e5edb641b36fd03d Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 7 Nov 2025 20:00:59 +0000 Subject: [PATCH 26/33] 0 is not a valid value --- A105-max_concurrent_streams-connection-scaling.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 93e59d2b8..1d43fad33 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-11-03 +* Last updated: 2025-11-07 * Discussion at: https://groups.google.com/g/grpc-io/c/n9Mi7ZODReE ## Abstract @@ -109,7 +109,7 @@ message ServiceConfig { // // Values higher than the client-enforced limit (by default, 10) // will be clamped to that limit. - google.protobuf.UInt32Value max_connections_per_subchannel = 1; + uint32 max_connections_per_subchannel = 1; } ConnectionScaling connection_scaling = N; } From f6855b02a720ef892607bac6c70b28ffdb57519d Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 7 Nov 2025 20:43:23 +0000 Subject: [PATCH 27/33] separate env vars for service config and xDS --- A105-max_concurrent_streams-connection-scaling.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 1d43fad33..3236b3ef3 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -669,10 +669,16 @@ note that the additional connections will show up in the ### Temporary environment variable protection -Enabling this feature via either the gRPC service config or xDS will -initially be guarded via the environment variable -`GRPC_EXPERIMENTAL_MAX_CONCURRENT_STREAMS_CONNECTION_SCALING`. The -feature will be enabled by default once it has passed interop tests. +Enabling this feature via the gRPC service config will initially be +guarded via the environment variable +`GRPC_EXPERIMENTAL_MAX_CONCURRENT_STREAMS_CONNECTION_SCALING`. + +Enabling this feature via xDS will initially be +guarded via the environment variable +`GRPC_EXPERIMENTAL_XDS_MAX_CONCURRENT_STREAMS_CONNECTION_SCALING`. + +Each of the two configuration mechanisms will be enabled by default once +it has passed interop tests. ## Rationale From 5b855b9ccaf74654d59768fec3871daa62040c89 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 7 Nov 2025 21:41:58 +0000 Subject: [PATCH 28/33] max_connections_per_subchannel cap applies only to service config --- A105-max_concurrent_streams-connection-scaling.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 3236b3ef3..df6711c82 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -140,9 +140,7 @@ limit for the max_connections_per_subchannel setting. This limit will be 10 by default, but gRPC will provide a channel-level setting to allow a client application to raise or lower that limit. Whenever the max_connections_per_subchannel setting is larger than the channel's -limit, it will be capped to that limit. This capping will be performed -in the subchannel itself, so that it will apply regardless of where the -setting is set. +limit, it will be capped to that limit. #### C-core From 54694dd93c333d354f099cc9d1e74a91e27afafa Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 7 Nov 2025 23:11:20 +0000 Subject: [PATCH 29/33] Revert "max_connections_per_subchannel cap applies only to service config" This reverts commit 5b855b9ccaf74654d59768fec3871daa62040c89. --- A105-max_concurrent_streams-connection-scaling.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index df6711c82..3236b3ef3 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -140,7 +140,9 @@ limit for the max_connections_per_subchannel setting. This limit will be 10 by default, but gRPC will provide a channel-level setting to allow a client application to raise or lower that limit. Whenever the max_connections_per_subchannel setting is larger than the channel's -limit, it will be capped to that limit. +limit, it will be capped to that limit. This capping will be performed +in the subchannel itself, so that it will apply regardless of where the +setting is set. #### C-core From 42303fecbbf552966eb74199a40921039120caf3 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 21 Nov 2025 00:04:45 +0000 Subject: [PATCH 30/33] review changes --- ...x_concurrent_streams-connection-scaling.md | 28 ++++++++++--------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 3236b3ef3..3dfedec4c 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-11-07 +* Last updated: 2025-11-20 * Discussion at: https://groups.google.com/g/grpc-io/c/n9Mi7ZODReE ## Abstract @@ -244,10 +244,11 @@ field, we will now add support for the following field: If that entry is found, then within that entry: - [max_connections](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/circuit_breaker.proto#L59): If this field is set, then its value will be used to set the - max_connections_per_subchannel setting for all endpoints for that - xDS cluster. If it is unset, then the - max_connections_per_subchannel setting will remain unset. A value - of 0 will be rejected at resource validation time. + max_connections_per_subchannel setting for all endpoints for the + cluster. If it is unset, then no max_connections_per_subchannel + setting will be set for the cluster's endpoints (i.e., the subchannel + will assume a value of 1 by default). A value of 0 will be rejected + at resource validation time. A new field will be added to the parsed CDS resource representation containing the value of this field. @@ -398,9 +399,9 @@ concurrent RPCs being added to the queue in an arbitrary order. See [Rationale](#rationale) below for a possible adjustment to this queuing strategy. -When retrying a queued RPC, the subchannel will use the same algorithm +When dequeuing an RPC, the subchannel will use the same algorithm described above that it will when it first sees the RPC. RPCs will be -drained from the queue upon the following events: +dequeued upon the following events: - When a connection attempt completes successfully. - When the backoff timer fires. - When an existing connection fails. @@ -479,11 +480,11 @@ There are only two possible results: at which point a new pick will be done for the queued RPC. With this design, there is a third outcome possible: the subchannel may -not have a set of working connections, but they may all already be at -their MAX_CONCURRENT_STREAMS limits, so the RPC may need to be queued -until the subchannel has a connection that it can send it on. This case -can be handled by having the subchannel return a fake connection object -that queues the RPC in the subchannel. +have a set of working connections, but they may all already be at their +MAX_CONCURRENT_STREAMS limits, so the RPC may need to be queued until +the subchannel has a connection that it can send it on. This case can +be handled by having the subchannel return a fake connection object that +queues the RPC in the subchannel. Note that the race condition described in case 2 above will now happen only if the subchannel has no working connections. If there is at least @@ -602,7 +603,8 @@ states. If the selected (READY) subchannel transitions to CONNECTING or TRANSIENT_FAILURE state, then pick_first will go back into CONNECTING state. It will start the happy eyeballs pass across all subchannels, -as described in [A61]. +as described in [A61]. Note that this will trigger a re-resolution +request, just as the existing transition from READY to IDLE does. ### Interaction with xDS Circuit Breaking From 38df1e1c663739e605a6ba4da80fcd6b170617eb Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 21 Nov 2025 16:22:37 +0000 Subject: [PATCH 31/33] add updated-by header to A32 --- A32-xds-circuit-breaking.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A32-xds-circuit-breaking.md b/A32-xds-circuit-breaking.md index 3eafd74b9..908cc8f0d 100644 --- a/A32-xds-circuit-breaking.md +++ b/A32-xds-circuit-breaking.md @@ -6,7 +6,7 @@ * Implemented in: C-core, Java, Go * Last updated: 2021-02-12 * Discussion at: https://groups.google.com/g/grpc-io/c/NEx70p8mcjg - +* Updated by: [gRFC A105: MAX_CONCURRENT_STREAMS Connection Scaling](A105-max_concurrent_streams-connection-scaling.md) ## Abstract From b31071cecaff1fe9d363750a382dbb8478c2bb79 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 1 Dec 2025 19:24:54 +0000 Subject: [PATCH 32/33] review comments --- ...x_concurrent_streams-connection-scaling.md | 193 +++++++++--------- 1 file changed, 101 insertions(+), 92 deletions(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 3dfedec4c..345634c73 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -4,7 +4,7 @@ A105: MAX_CONCURRENT_STREAMS Connection Scaling * Approver: @ejona86 * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-11-20 +* Last updated: 2025-12-01 * Discussion at: https://groups.google.com/g/grpc-io/c/n9Mi7ZODReE ## Abstract @@ -76,7 +76,7 @@ There are several parts to this proposal: - Configuration for the max number of connections per subchannel, via either the [service config](https://github.com/grpc/grpc/blob/master/doc/service_config.md) - or via a per-endpoint attribute in the LB policy tree. + or via xDS. - A mechanism for the transport to report the current MAX_CONCURRENT_STREAMS setting to the subchannel layer. - Connection scaling functionality in the subchannel. @@ -88,9 +88,39 @@ There are several parts to this proposal: ### Configuration -Connection scaling will be configured via a new service config field, -as follows (schema shown in protobuf form, although gRPC actually accepts -the service config in JSON form): +In the subchannel, connection scaling will be configured via a setting +called max_connections_per_subchannel. That setting will be set either +via the service config or via xDS. The approach for plumbing this +setting into the subchannel will be different in C-core than in Java +and Go; see below for details. + +The max_connections_per_subchannel setting for a given subchannel +can change with each resolver update, regardless of whether it is +set via the service config or via xDS. When this happens, we do +not want to throw away the subchannel and create a new one, since +that would cause unnecessary connection churn. This means that the +max_connections_per_subchannel setting must not be considered part of +the subchannel's unique identity that is set only at subchannel creation +time; instead, it must be changeable over the life of a subchannel. + +If the max_connections_per_subchannel setting is unset, the subchannel +will assume a default of 1, which effectively means the same behavior +as before this gRFC. + +The channel will enforce a maximum limit for the +max_connections_per_subchannel setting. This limit will be 10 by +default, but gRPC will provide a channel-level setting to allow +a client application to raise or lower that limit. Whenever the +max_connections_per_subchannel setting is larger than the channel's limit, +it will be capped to that limit. This capping will be performed in the +subchannel itself, so that it will apply regardless of where the setting +is set. + +#### gRPC Service Config + +In the gRPC service config, connection scaling will be configured via +a new field, as follows (schema shown in protobuf form, although gRPC +actually accepts the service config in JSON form): ```proto message ServiceConfig { @@ -115,36 +145,40 @@ message ServiceConfig { } ``` -In the subchannel, connection scaling will be configured via a setting -called max_connections_per_subchannel. That setting will be set -either via the service config or via a per-endpoint attribute in the LB -policy tree. The approach for plumbing this setting into the -subchannel will be different in C-core than in Java and Go; see below -for details. - -The max_connections_per_subchannel setting for a given subchannel can -change with each resolver update, regardless of whether it is set -via the service config or via an LB policy. When this happens, we -do not want to throw away the subchannel and create a new one, since -that would cause unnecessary connection churn. This means that the -max_connections_per_subchannel setting must not be considered part of -the subchannel's unique identity that is set only at subchannel creation -time; instead, it must be changeable over the life of a subchannel. +#### xDS Configuration -If the max_connections_per_subchannel setting is unset, the subchannel -will assume a default of 1, which effectively means the same behavior -as before this gRFC. +In xDS, the max_connections_per_subchannel setting will be configured via +a per-host circuit breaker in the CDS resource. This uses a similar +structure to the circuit breaker described in [A32]. + +In the CDS resource, in the +[circuit_breakers](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/cluster.proto#L885) +field, we will now add support for the following field: +- [per_host_thresholds](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/circuit_breaker.proto#L120): + As in [A32], gRPC will look only at the first entry for priority + [DEFAULT](https://github.com/envoyproxy/envoy/blob/6ab1e7afbfda48911e187c9d653a46b8bca98166/api/envoy/config/core/v3/base.proto#L39). + If that entry is found, then within that entry: + - [max_connections](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/circuit_breaker.proto#L59): + If this field is set, then its value will be used to set the + max_connections_per_subchannel setting for all endpoints for the + cluster. If it is unset, then no max_connections_per_subchannel + setting will be set for the cluster's endpoints (i.e., the subchannel + will assume a value of 1 by default). A value of 0 will be rejected + at resource validation time. + +A new field will be added to the parsed CDS resource representation +containing the value of this field. -As indicated in the comment above, the channel will enforce a maximum -limit for the max_connections_per_subchannel setting. This limit -will be 10 by default, but gRPC will provide a channel-level setting to -allow a client application to raise or lower that limit. Whenever the -max_connections_per_subchannel setting is larger than the channel's -limit, it will be capped to that limit. This capping will be performed -in the subchannel itself, so that it will apply regardless of where the -setting is set. +The xds_cluster_impl LB policy will be responsible for setting the +max_connections_per_subchannel setting based on this xDS configuration. +Note that it makes sense to do this in the xds_cluster_impl LB policy +instead of the cds policy for two reasons: first, this is where circuit +breaking is already configured, and second, this policy is in the right +location in the LB policy tree regardless of whether [A75] has been +implemented yet. Note that post-[A74], this will not require adding +any new fields in the xds_cluster_impl LB policy configuration. -#### C-core +#### Config Plumbing in C-core In C-core, every time there is a resolver update, the LB policy calls `CreateSubchannel()` for every address in the new address list. @@ -189,10 +223,11 @@ To support this, the implementation will be as follows: - If max_connections_per_subchannel is configured via the service config, the `GRPC_ARG_MAX_CONNECTIONS_PER_SUBCHANNEL` channel arg will be set by the channel before passing the resolver update to the LB - policy tree. Individual LB policies may override this channel arg - before `CreateSubchannel()` is called. + policy tree. Individual LB policies (such as the xds_cluster_impl + policy) may override this channel arg before `CreateSubchannel()` + is called. -#### Java and Go +#### Config Plumbing in Java and Go In Java and Go, there is no subchannel pool, and LB policies will not call `CreateSubchannel()` for any address for which they already have a @@ -202,66 +237,40 @@ Therefore, a different approach is called for. A notification object will be used to notify the subchannel of the value of the max_connections_per_subchannel setting. This object will be -passed into the subchannel at creation time via a resolver attribute. - -When a channel is constructed, it will create a single notification -object to be used for all subchannels for the lifetime of the channel. -This notification object will be added to the resolver attributes before -the resolver update is passed to the LB policy, so that it will propagate -through to the `CreateSubchannel()` call. Whenever the resolver returns -a service config that sets max_connections_per_subchannel, the channel will -tell the notification object to use that value, which will then be -communicated to the subchannels. - -For the case where an LB policy wants to override the value of -max_connections_per_subchannel, it can create its own notification -object and replace the channel's notification object with its own in the -resolver attributes. It can then tell the notification object what -value to use whenever it gets a config update. - -Because the notification object is passed to the subchannel only at -creation time, the LB policy must decide whether to inject its own -notification object at that time. If an LB policy initially wants to -just use the default value from the channel, it would need to proxy that -value from the original notification object that was passed down from -the channel. +passed into the subchannel at creation time and will be used for the +life of the subchannel (i.e., there is no way for the subchannel to swap +to a different notification object later). Multiple subchannels can +share the same notification object; whenever the notification object is +told to use a new value for max_connections_per_subchannel, all +subchannels using that object will be notified of the change. + +The notification object can come from one of two places, depending on +whether the configuration comes from the gRPC service config or from xDS. + +For the service config case, when the channel is constructed, it will +create a single notification object to be used for all subchannels for +the lifetime of the channel. Whenever the resolver returns a service +config, the channel will tell the notification object to use the new +value of max_connections_per_subchannel. + +For the xDS case, when the xds_cluster_impl LB policy is constructed, it +will create a single notification object to used for all subchannels +for the lifetime of that xds_cluster_impl policy instance. Whenever the +xds_cluster_impl policy receives an xDS update, it will tell the +notification object to use the new value of +max_connections_per_subchannel. The notification object will be passed +to the child policy via an attribute that will be passed into the +`CreateSubchannel()` call. + +The channel's implementation of `CreateSubchannel()` will check to see +if the LB policy included a notification object in an attribute. If it +did, that notification object will be used when creating the subchannel; +otherwise, the channel-level notification object fed from the service +config will be used. Note that this approach assumes that Java and Go have switched to a model where there is only one address per subchannel, as per [A61]. -#### xDS Configuration - -In xDS, the max_connections_per_subchannel setting will be configured via -a per-host circuit breaker in the CDS resource. This uses a similar -structure to the circuit breaker described in [A32]. - -In the CDS resource, in the -[circuit_breakers](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/cluster.proto#L885) -field, we will now add support for the following field: -- [per_host_thresholds](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/circuit_breaker.proto#L120): - As in [A32], gRPC will look only at the first entry for priority - [DEFAULT](https://github.com/envoyproxy/envoy/blob/6ab1e7afbfda48911e187c9d653a46b8bca98166/api/envoy/config/core/v3/base.proto#L39). - If that entry is found, then within that entry: - - [max_connections](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/circuit_breaker.proto#L59): - If this field is set, then its value will be used to set the - max_connections_per_subchannel setting for all endpoints for the - cluster. If it is unset, then no max_connections_per_subchannel - setting will be set for the cluster's endpoints (i.e., the subchannel - will assume a value of 1 by default). A value of 0 will be rejected - at resource validation time. - -A new field will be added to the parsed CDS resource representation -containing the value of this field. - -The xds_cluster_impl LB policy will be responsible for setting the -max_connections_per_subchannel setting based on this xDS configuration. -Note that it makes sense to do this in the xds_cluster_impl LB policy -instead of the cds policy for two reasons: first, this is where circuit -breaking is already configured, and second, this policy is in the right -location in the LB policy tree regardless of whether [A75] has been -implemented yet. Note that post-[A74], this will not require adding -any new fields in the xds_cluster_impl LB policy configuration. - ### Subchannel Behavior If max_connections_per_subchannel is 1, the subchannel will provide @@ -496,7 +505,7 @@ the RPC will be queued in the subchannel instead. The following pseudo-code illustrates the expected functionality in the subchannel: -``` +```python # Returns a connection to use for an RPC, or None if no connection is # currently available to send an RPC on. def ChooseConnection(self): From 8f024336f779d81888ea1bdd567d23f992b6e1a2 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Tue, 2 Dec 2025 16:27:27 +0000 Subject: [PATCH 33/33] fix typo --- A105-max_concurrent_streams-connection-scaling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md index 345634c73..6f7a96d0b 100644 --- a/A105-max_concurrent_streams-connection-scaling.md +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -254,7 +254,7 @@ config, the channel will tell the notification object to use the new value of max_connections_per_subchannel. For the xDS case, when the xds_cluster_impl LB policy is constructed, it -will create a single notification object to used for all subchannels +will create a single notification object to use for all subchannels for the lifetime of that xds_cluster_impl policy instance. Whenever the xds_cluster_impl policy receives an xDS update, it will tell the notification object to use the new value of