diff --git a/A105-max_concurrent_streams-connection-scaling.md b/A105-max_concurrent_streams-connection-scaling.md new file mode 100644 index 000000000..6f7a96d0b --- /dev/null +++ b/A105-max_concurrent_streams-connection-scaling.md @@ -0,0 +1,758 @@ +A105: MAX_CONCURRENT_STREAMS Connection Scaling +---- +* Author(s): @markdroth, @dfawley +* Approver: @ejona86 +* Status: {Draft, In Review, Ready for Implementation, Implemented} +* Implemented in: +* Last updated: 2025-12-01 +* Discussion at: https://groups.google.com/g/grpc-io/c/n9Mi7ZODReE + +## Abstract + +We propose allowing gRPC clients to automatically establish new +connections to the same endpoint address when it hits the HTTP/2 +MAX_CONCURRENT_STREAMS limit for an existing connection. + +## Background + +HTTP/2 contains a connection-level setting called +[MAX_CONCURRENT_STREAMS][H2MCS] that limits the number of streams a peer +may initiate. This is often set by servers or proxies to protect against +a single client using too many resources. However, in an environment with +reverse proxies or virtual IPs, it may be possible to create multiple +connections to the same IP address that lead to different physical +servers, and this can be a feasible way for a client to achieve more +throughput (QPS) without overloading any server. + +Today, the MAX_CONCURRENT_STREAMS limit is visible only inside the +transport, not to anything at the channel layer. This means that the +channel may dispatch an RPC to a subchannel that is already at its +MAX_CONCURRENT_STREAMS limit, even if there are other subchannels that +are not at that limit and would be able to accept the RPC. In this +situation, the transport normally queues the RPC until it can start +another stream (i.e., either when one of the existing streams ends or +when the server increases its MAX_CONCURRENT_STREAMS limit). + +Applications today are typically forced to deal with this problem +by creating multiple gRPC channels to the same target (and disabling +subchannel sharing, for implementations that share subchannels between +channels) and doing their own load balancing across those channels. +This is an obvious flaw in the gRPC channel abstraction, because the +channel is intended to hide connection management details like this so +that the application doesn't have to know about it. It is also not a +very flexible solution for the application in a case where the server is +dynamically tuning its MAX_CONCURRENT_STREAMS setting, because the +application cannot see that setting in order to tune the number of +channels it uses. + +### Related Proposals: +* [A6: gRPC Retry Design][A6] +* [A9: Server-side Connection Management][A9] +* [A14: Channelz][A14] +* [A32: xDS Circuit Breaking][A32] +* [A61: IPv4 and IPv6 Dualstack Backend Support][A61] +* [A74: xDS Config Tears][A74] +* [A75: xDS Aggregate Cluster Behavior Fixes][A74] +* [A94: OTel Metrics for Subchannels][A94] + +[A6]: A6-client-retries.md +[A9]: A9-server-side-conn-mgt.md +[A14]: A14-channelz.md +[A32]: A32-xds-circuit-breaking.md +[A61]: A61-IPv4-IPv6-dualstack-backends.md +[A74]: A74-xds-config-tears.md +[A75]: A75-xds-aggregate-cluster-behavior-fixes.md +[A94]: A94-subchannel-otel-metrics.md +[H2MCS]: https://httpwg.org/specs/rfc7540.html#SETTINGS_MAX_CONCURRENT_STREAMS + +## Proposal + +We will add a mechanism for the gRPC client to dynamically scale up the +number of connections to a particular endpoint address when it hits the +MAX_CONCURRENT_STREAMS limit. This functionality will be implemented at +the subchannel layer. + +There are several parts to this proposal: +- Configuration for the max number of connections per subchannel, via + either the [service + config](https://github.com/grpc/grpc/blob/master/doc/service_config.md) + or via xDS. +- A mechanism for the transport to report the current MAX_CONCURRENT_STREAMS + setting to the subchannel layer. +- Connection scaling functionality in the subchannel. +- Functionality to pick a connection for each RPC in the subchannel, + including the ability to queue RPCs while waiting for new connections + to be established. +- Modify pick_first to handle the subchannel transitioning from READY to + CONNECTING state. + +### Configuration + +In the subchannel, connection scaling will be configured via a setting +called max_connections_per_subchannel. That setting will be set either +via the service config or via xDS. The approach for plumbing this +setting into the subchannel will be different in C-core than in Java +and Go; see below for details. + +The max_connections_per_subchannel setting for a given subchannel +can change with each resolver update, regardless of whether it is +set via the service config or via xDS. When this happens, we do +not want to throw away the subchannel and create a new one, since +that would cause unnecessary connection churn. This means that the +max_connections_per_subchannel setting must not be considered part of +the subchannel's unique identity that is set only at subchannel creation +time; instead, it must be changeable over the life of a subchannel. + +If the max_connections_per_subchannel setting is unset, the subchannel +will assume a default of 1, which effectively means the same behavior +as before this gRFC. + +The channel will enforce a maximum limit for the +max_connections_per_subchannel setting. This limit will be 10 by +default, but gRPC will provide a channel-level setting to allow +a client application to raise or lower that limit. Whenever the +max_connections_per_subchannel setting is larger than the channel's limit, +it will be capped to that limit. This capping will be performed in the +subchannel itself, so that it will apply regardless of where the setting +is set. + +#### gRPC Service Config + +In the gRPC service config, connection scaling will be configured via +a new field, as follows (schema shown in protobuf form, although gRPC +actually accepts the service config in JSON form): + +```proto +message ServiceConfig { + // ...existing fields... + + // Settings to control dynamic connection scaling. + message ConnectionScaling { + // Maximum connections gRPC will maintain for each subchannel in + // this channel. When no streams are available for an RPC in a + // subchannel, gRPC will automatically create new connections up + // to this limit. If this value changes during the life of a + // channel, existing subchannels will be updated to reflect + // the change. No connections will be closed as a result of + // lowering this value; down-scaling will only happen as + // connections are lost naturally. + // + // Values higher than the client-enforced limit (by default, 10) + // will be clamped to that limit. + uint32 max_connections_per_subchannel = 1; + } + ConnectionScaling connection_scaling = N; +} +``` + +#### xDS Configuration + +In xDS, the max_connections_per_subchannel setting will be configured via +a per-host circuit breaker in the CDS resource. This uses a similar +structure to the circuit breaker described in [A32]. + +In the CDS resource, in the +[circuit_breakers](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/cluster.proto#L885) +field, we will now add support for the following field: +- [per_host_thresholds](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/circuit_breaker.proto#L120): + As in [A32], gRPC will look only at the first entry for priority + [DEFAULT](https://github.com/envoyproxy/envoy/blob/6ab1e7afbfda48911e187c9d653a46b8bca98166/api/envoy/config/core/v3/base.proto#L39). + If that entry is found, then within that entry: + - [max_connections](https://github.com/envoyproxy/envoy/blob/ed76c2e81f428248f682a9a380a4eef476ea4349/api/envoy/config/cluster/v3/circuit_breaker.proto#L59): + If this field is set, then its value will be used to set the + max_connections_per_subchannel setting for all endpoints for the + cluster. If it is unset, then no max_connections_per_subchannel + setting will be set for the cluster's endpoints (i.e., the subchannel + will assume a value of 1 by default). A value of 0 will be rejected + at resource validation time. + +A new field will be added to the parsed CDS resource representation +containing the value of this field. + +The xds_cluster_impl LB policy will be responsible for setting the +max_connections_per_subchannel setting based on this xDS configuration. +Note that it makes sense to do this in the xds_cluster_impl LB policy +instead of the cds policy for two reasons: first, this is where circuit +breaking is already configured, and second, this policy is in the right +location in the LB policy tree regardless of whether [A75] has been +implemented yet. Note that post-[A74], this will not require adding +any new fields in the xds_cluster_impl LB policy configuration. + +#### Config Plumbing in C-core + +In C-core, every time there is a resolver update, the LB policy +calls `CreateSubchannel()` for every address in the new address list. +The `CreateSubchannel()` call returns a subchannel wrapper that holds +a ref to the underlying subchannel. The channel uses a subchannel +pool to store the set of currently existing subchannels: the requested +subchannel is created only if it doesn't already exist in the pool; +otherwise, the returned subchannel wrapper will hold a new ref to the +existing subchannel, so that it doesn't actually wind up creating a new +subchannel (only a new subchannel wrapper). This means that we do not +want the max_connections_per_subchannel setting to be part of the +subchannel's key in the subchannel pool, or else we will wind up +recreating the subchannel whenever the attribute's value changes. + +In addition, by default, C-core's subchannel pool is shared between +channels, meaning that if two channels attempt to create the same +subchannel, they will wind up sharing a single subchannel. In this +case, each channel using a given subchannel may have a different value +for the max_connections_per_subchannel setting. The subchannel will +use the maximum value set for this setting across all channels. + +To support this, the implementation will be as follows: +- The subchannel will store a map from max_connections_per_subchannel + setting to the number of subchannel wrappers currently holding a ref to + it with that value. Entries are added to the map whenever the first + ref for a given value of max_connections_per_subchannel is taken, and + entries are removed from the map whenever the last ref for a given + value of max_connections_per_subchannel is removed. Whenever the set + of entries changes, the subchannel will do a pass over the map to find + the new max value to use. +- We will create a new internal-only channel arg called + `GRPC_ARG_MAX_CONNECTIONS_PER_SUBCHANNEL`, which will be treated + specially by the channel. Specifically, when this attribute is passed + to `CreateSubchannel()`, it will be excluded from the channel args + that are used as a key in the subchannel pool. Instead, when the + subchannel wrapper is instantiated, it will call a new API on the + underlying subchannel to tell it that a new ref is being held for this + value of max_connections_per_subchannel. When the subchannel wrapper + is orphaned, it will call a new API on the underlying subchannel + to tell it that the ref is going away for a particular value of + max_connections_per_subchannel. +- If max_connections_per_subchannel is configured via the service + config, the `GRPC_ARG_MAX_CONNECTIONS_PER_SUBCHANNEL` channel arg will + be set by the channel before passing the resolver update to the LB + policy tree. Individual LB policies (such as the xds_cluster_impl + policy) may override this channel arg before `CreateSubchannel()` + is called. + +#### Config Plumbing in Java and Go + +In Java and Go, there is no subchannel pool, and LB policies will not +call `CreateSubchannel()` for any address for which they already have a +subchannel from the previous address list. There is also no need to +deal with the case of multiple channels sharing the same subchannel. +Therefore, a different approach is called for. + +A notification object will be used to notify the subchannel of the value +of the max_connections_per_subchannel setting. This object will be +passed into the subchannel at creation time and will be used for the +life of the subchannel (i.e., there is no way for the subchannel to swap +to a different notification object later). Multiple subchannels can +share the same notification object; whenever the notification object is +told to use a new value for max_connections_per_subchannel, all +subchannels using that object will be notified of the change. + +The notification object can come from one of two places, depending on +whether the configuration comes from the gRPC service config or from xDS. + +For the service config case, when the channel is constructed, it will +create a single notification object to be used for all subchannels for +the lifetime of the channel. Whenever the resolver returns a service +config, the channel will tell the notification object to use the new +value of max_connections_per_subchannel. + +For the xDS case, when the xds_cluster_impl LB policy is constructed, it +will create a single notification object to use for all subchannels +for the lifetime of that xds_cluster_impl policy instance. Whenever the +xds_cluster_impl policy receives an xDS update, it will tell the +notification object to use the new value of +max_connections_per_subchannel. The notification object will be passed +to the child policy via an attribute that will be passed into the +`CreateSubchannel()` call. + +The channel's implementation of `CreateSubchannel()` will check to see +if the LB policy included a notification object in an attribute. If it +did, that notification object will be used when creating the subchannel; +otherwise, the channel-level notification object fed from the service +config will be used. + +Note that this approach assumes that Java and Go have switched to a +model where there is only one address per subchannel, as per [A61]. + +### Subchannel Behavior + +If max_connections_per_subchannel is 1, the subchannel will provide +essentially the same behavior as before this gRFC. In that case, +implementations should minimize any additional per-RPC overhead at this +layer beyond what already existed prior to this design. In other words, +the connection scaling feature should ideally not affect performance +unless it is actually enabled. + +#### Connection Management + +The subchannel will establish up to the number of established connections +specified by max_connections_per_subchannel. The connections will +be stored in a list ordered by the time at which the connection was +established, so that the oldest connection is at the start of the list. + +Each connection will store the peer's MAX_CONCURRENT_STREAMS setting +reported by the transport (see [Interaction Between Transport and +Subchannel](#interaction-between-transport-and-subchannel) below for +details). It will also track how many RPCs are currently in flight on +the connection. + +The subchannel will also track one in-flight connection attempt, which +will be unset if no connection attempt is currently in flight. + +A new connection attempt will be started when all of the +following are true: +- One or more RPCs are queued in the subchannel, waiting for a + connection to be sent on. +- No existing connection in the subchannel has any available streams -- + i.e., the number of RPCs currently in flight on each connection is + greater than or equal to the connection's MAX_CONCURRENT_STREAMS. +- The number of existing connections in the subchannel is fewer than the + max_connections_per_subchannel setting. +- There is no connection attempt currently in flight on the subchannel. + +Some examples of cases that can trigger a new connection attempt to be +started: +- A new RPC is started on the subchannel and all existing connections + are already at their MAX_CONCURRENT_STREAMS limit. +- RPCs were already queued on the subchannel and a connection was lost + or a connection attempt fails. +- RPCs were already queued on the subchannel and a new connection was just + created that did not provide enough available streams for all pending RPCs. +- The value of the max_connections_per_subchannel setting increases, + all existing connections are already at their MAX_CONCURRENT_STREAMS + limit, and there are queued RPCs. + +The subchannel will never close a connection once it has been established. +However, when a connection is closed for any reason, it is removed from +the subchannel. If the application wishes to garbage collect unused +connections, it should configure MAX_CONNECTION_IDLE on the server side, +as described in [A9]. + +If all working connections are terminated, the subchannel will fail +all queued RPCs with status UNAVAILABLE. Note that these RPCs will be +eligible for transparent retries (see [A6]), because no wire traffic +was produced for them. + +#### Backoff Behavior and Connectivity State + +The subchannel must have no more than one connection +attempt in progress at any given time. The [backoff +state](https://github.com/grpc/grpc/blob/master/doc/connection-backoff.md) +will be used for all connection attempts in the subchannel, regardless +of how many established connections there are. If a connection attempt +fails, the backoff period must be respected and scale accordingly before +starting the next attempt on any connection. When a connection attempt +succeeds, backoff state will be reset, just as it is today. + +For example, if the subchannel has both two existing connections and a +pending connection attempt for a third connection, if one of the original +connections fails, the subchannel may not start a new connection attempt +immediately, because it already has a connection attempt in progress. +Instead, it must wait for the in-flight connection attempt to finish. +If that attempt fails, then backoff must be performed before starting the +next connection attempt. But if that attempt succeeds, backoff state +will be reset, so if there are still enough queued RPCs to warrant a +third connection, then the subchannel may immediately start another +connection attempt. + +The connectivity state of the subchannel should be determined as follows +(first match wins): +- If there is at least one connection established, report READY. +- If there is at least one connection attempt in flight, report + CONNECTING. +- If the subchannel is in backoff after a failed connection attempt, + report TRANSIENT_FAILURE. +- Otherwise, report IDLE. + +Note that subchannels may exhibit two new state transitions that have not +previously been possible. Today, the only possible transition from state +READY is to state IDLE, but with this design, the following additional +transitions are possible: +- If the subchannel has an existing connection and has a connection + attempt in flight for a second connection, and then the first + connection fails before the in-flight connection attempt completes, + then the subchannel will transition from READY to CONNECTING. +- If the subchannel has an existing connection but is in backoff after + failing to establish a second connection attempt, and then the original + connection fails, the subchannel will transition from READY directly + to TRANSIENT_FAILURE. + +See [PickFirst Changes](#pickfirst-changes) below for details on how +pick_first will handle these new transitions. + +#### Picking a Connection for Each RPC + +When choosing a connection for an RPC within a subchannel, the following +algorithm will be used (first match wins): +1. Look through all working connections in order from the oldest to the + newest. For each connection, if the number of RPCs in flight on the + connection is lower than the peer's MAX_CONCURRENT_STREAMS setting, + then the RPC will be dispatched on that connection. +2. If the number of existing connections is equal to the + max_connections_per_subchannel setting, the RPC will be queued. +3. If the number of existing connections is less than the + max_connections_per_subchannel setting and the subchannel is in backoff + delay due to the last connection attempt failing, the RPC will be + queued. +4. If the number of existing connections is less than the + max_connections_per_subchannel setting and no connection attempt is + currently in flight, a new connection attempt will be started, and + the RPC will be queued. + +The subchannel must ensure that races do not happen while dispatching +RPCs to a connection. For example, if two RPCs are initiated at the same +time and one stream is available in a connection, both RPCs must not choose +the same connection. This race can be avoided with locks or atomics. + +When queueing an RPC, the queue must be roughly fair: RPCs must +be dispatched in the order in which they are received into the +queue, acknowledging that timing between threads may lead to +concurrent RPCs being added to the queue in an arbitrary order. +See [Rationale](#rationale) below for a possible adjustment to this +queuing strategy. + +When dequeuing an RPC, the subchannel will use the same algorithm +described above that it will when it first sees the RPC. RPCs will be +dequeued upon the following events: +- When a connection attempt completes successfully. +- When the backoff timer fires. +- When an existing connection fails. +- When the transport for a connection reports a new value for + MAX_CONCURRENT_STREAMS. +- When an RPC dispatched on one of the connections completes. + +Because we are now handling queuing in the subchannel, transport +implementations should no longer need to handle this queuing. The only +case where a transport may see an RPC that it does not have quota to send +is if the MAX_CONCURRENT_STREAMS setting of a connection is lowered by +the server after RPCs are dispatched to the connection. Transports are +encouraged to handle this case by failing RPCs in a way that is eligible +for transparent retries (see [A6]) rather than by queueing the RPC. +Note that implementations should make sure that they do not introduce an +infinite loop here: the transparent retry must not block the subchannel +from processing the subsequent MAX_CONCURRENT_STREAMS update from the +transport, although no explicit synchronization is needed to ensure +that the first transparent retry must not happen until the subchannel +has seen that update. + +#### Interaction Between Transport and Subchannel + +In order for the subchannel to know when to create a new connection, the +transport will need to report the peer's MAX_CONCURRENT_STREAMS setting +to the subchannel. The transport will need to first report this value +immediately after receiving the initial SETTINGS frame from the peer, +and then it will need to send an update any time it receives a SETTINGS +frame from the peer that changes this particular setting. + +Implementations should consider providing some mechanism for flow +control for these updates, to prevent sending updates to the subchannel +faster than it can process them. One possible simple approach would be +for the transport to have only one in-flight update to the subchannel at +any given time, which could be implemented as follows: +- When sending an update to the subchannel, the transport will record + the MAX_CONCURRENT_STREAMS value that it is reporting to the + subchannel. It will include a callback with the update, which the + subchannel must invoke when it has finished processing the update. +- If the transport receives a new MAX_CONCURRENT_STREAMS value from the + peer before it receives the callback from the subchannel, it will + not send another update to the subchannel. +- When the transport receives the callback from the subchannel, it will + check whether the peer's current MAX_CONCURRENT_STREAMS value is + different from the last value it reported to the subchannel. If so, + it will start a new update. + +Note that this approach means that there may be a noticeable delay +between when the transport sees an update and when the subchannel sees +the update. If the update is an *increase* in MAX_CONCURRENT_STREAMS, +that won't cause any problems (the worst case is that the subchannel +may create a new connection that it might have been able to avoid). +If the update is a *decrease* in MAX_CONCURRENT_STREAMS, then the +subchannel may dispatch RPCs to the connection that will wind up being +queued in the transport, which is sub-optimal. However, this design +already has a race condition in that case (see [Picking a Connection +for Each RPC](#picking-a-connection-for-each-rpc) above for details), +so this is not an additional problem. + +#### Interaction Between Channel and Subchannel + +Today, when the channel does an LB pick and gets back a subchannel, +it calls a method on that subchannel to get its underlying connection. +There are only two possible results: + +1. The subchannel returns a ref to the underlying connection. In this + case, the channel starts a stream on the returned connection and + forwards the RPC to that stream. +2. The subchannel returns null to indicate that it no longer has a + working connection. This case can happen due to a race between the + LB policy picking the subchannel and the transport seeing a GOAWAY + or disconnection. When this occurs, the channel will queue the RPC + (i.e., it is treated the same as if the picker indicated to queue + the RPC) in the expectation that the LB policy will soon see the + subchannel report the disconnection and will return a new picker, + at which point a new pick will be done for the queued RPC. + +With this design, there is a third outcome possible: the subchannel may +have a set of working connections, but they may all already be at their +MAX_CONCURRENT_STREAMS limits, so the RPC may need to be queued until +the subchannel has a connection that it can send it on. This case can +be handled by having the subchannel return a fake connection object that +queues the RPC in the subchannel. + +Note that the race condition described in case 2 above will now happen +only if the subchannel has no working connections. If there is at least +one working connection, then even if the RPC cannot be sent immediately, +the RPC will be queued in the subchannel instead. + +#### Subchannel Pseudo-Code + +The following pseudo-code illustrates the expected functionality in the +subchannel: + +```python +# Returns a connection to use for an RPC, or None if no connection is +# currently available to send an RPC on. +def ChooseConnection(self): + # Use the oldest connection that can accept a new stream, if any. + for connection in self.connections: + if connection.rpcs_in_flight < connection.max_concurrent_streams: + connection.rpcs_in_flight += 1 + return connection + # If we aren't yet at the max number of connections, see if we can + # create a new one. + if (len(self.connections) < self.max_connections_per_subchannel and + self.connection_attempt is None and + self.pending_backoff_timer is None): + self.StartConnectionAttempt() + # Didn't find a connection for the RPC, so queue it. + return None + +# Get a connection to start a new RPC. +def GetConnection(self): + # If there are no connections, tell the channel to queue the LB pick. + if len(self.connections) == 0: + return None + connection = self.ChooseConnection() + # If we didn't find a connection to use, return a fake connection that + # adds all RPCs to self.queue. + if connection is None: + return FakeConnectionThatQueuesRpcs() + return connection + +# Retries RPCs from the queue, in order. +def RetryRpcsFromQueue(self): + while len(self.queue) > 0: + connection = self.ChooseConnection() + # Stop at the first RPC that gets queued. + if connection is None: + break + # Otherwise, send the RPC on the connection. + connection.SendRpc(self.queue.front()) + self.queue.pop() + +# Starts a new connection attempt. +def StartConnectionAttempt(self): + if self.backoff_state is None: + self.backoff_state = BackoffState() + self.connection_attempt = ConnectionAttempt() + +# Called when a connection attempt succeeds. +def OnConnectionAttemptSucceeded(self, new_connection): + self.connections.append(new_connection) + self.pending_backoff_timer = None + self.backoff_state = None + self.RetryRpcsFromQueue() + +# Called when a connection attempt fails. This puts us in backoff. +def OnConnectionAttemptFailed(self): + self.connection_attempt = None + self.pending_backoff_timer = Timer(self.backoff_state.NextBackoffDelay()) + +# Called when the backoff timer fires. Will trigger a new connection +# attempt if there are RPCs in the queue. +def OnBackoffTimer(self): + self.pending_backoff_timer = None + self.RetryRpcsFromQueue() + +# Called when an established connection fails. Will trigger a new +# connection attempt if there are RPCs in the queue. +def OnConnectionFailed(self, failed_connection): + self.connections.remove(failed_connection) + if len(self.connections) == 0: + for rpc in self.queue: + # RPC will be eligible for transparent retries. + rpc.Fail(UNAVAILABLE) + self.queue = [] + else: + self.RetryRpcsFromQueue() # Maybe trigger new connection attempt. + +# Called when a connection reports a new MAX_CONCURRENT_STREAMS value. +# May send RPCs on the connection if there are any queued and the +# MAX_CONCURRENT_STREAMS value has increased. +def OnConnectionReportsNewMaxConcurrentStreams( + self, connection, max_concurrent_streams): + connection.max_concurrent_streams = max_concurrent_streams + self.RetryRpcsFromQueue() + +# Called when an RPC completes on one of the subchannel's connections. +def OnRpcComplete(self, connection): + connection.rpcs_in_flight -= 1 + self.RetryRpcsFromQueue() + +# Called when the max_connections_per_subchannel value changes. +def OnMaxConnectionsPerSubchannelChanged(self, max_connections_per_subchannel): + self.max_connections_per_subchannel = max_connections_per_subchannel + self.RetryRpcsFromQueue() +``` + +### PickFirst Changes + +As mentioned above, the pick_first LB policy will need to handle two new +state transitions from subchannels. Previously, a subchannel in READY +state could transition only to IDLE state; now it will be possible for +the subchannel to instead transition to CONNECTING or TRANSIENT_FAILURE +states. + +If the selected (READY) subchannel transitions to CONNECTING or +TRANSIENT_FAILURE state, then pick_first will go back into CONNECTING +state. It will start the happy eyeballs pass across all subchannels, +as described in [A61]. Note that this will trigger a re-resolution +request, just as the existing transition from READY to IDLE does. + +### Interaction with xDS Circuit Breaking + +gRPC currently supports xDS circuit breaking as described in [A32]. +Specifically, we support configuring the max number of RPCs in flight +to each cluster. This is done by having the xds_cluster_impl LB policy +increment an atomic counter that tracks the number of RPCs currently in +flight to the cluster, which is later decremented when the RPC ends. + +Some gRPC implementations don't actually increment the counter until +after a connection is chosen for the RPC, so that they won't erroneously +increment the counter when the picker returns a subchannel for which no +connection is available (the race condition described above). However, +now that there could be a longer delay between when the picker returns +and when a connection is chosen for the RPC, implementations will need +to increment the counter in the picker. It will then be necessary to +decrement the counter when the RPC finishes, regardless of whether it +was actually sent out or whether the subchannel failed it. + +### Channelz Changes + +The channelz API defined in [A14] will need some changes to expose the +state of connection scaling. + +In the `ChannelData` message, we will add a new `uint32 +max_connections_per_subchannel` field, which will be populated only for +subchannels, not channels. + +The channelz data model already supports [multiple sockets per +subchannel](https://github.com/grpc/grpc-proto/blob/23f5b568eefcb876e6ebc3b01725f1f20cff999e/grpc/channelz/v1/channelz.proto#L80). +Today, a subchannel can have more than one socket when a connection +receives a GOAWAY but remains open until already-in-flight RPCs finish, +while at the same time a new connection is established to send new RPCs. +The current channelz data model does not explicitly indicate which +connection has received the GOAWAY, although that can be inferred from +the channelz socket ID, since the lower ID is older and will therefore +be the one that received the GOAWAY. With this design, that inference +will no longer be possible, because there can now be more than one +established connection in a subchannel. Therefore, we will add a new +`google.protobuf.UInt32Value received_goaway_error` field to the channelz +`SocketData` message, which will be populated if a GOAWAY has been +received, in which case its value will be the HTTP/2 error code from +the received GOAWAY. + +The `SocketData` message already contains +[`streams_started`](https://github.com/grpc/grpc-proto/blob/23f5b568eefcb876e6ebc3b01725f1f20cff999e/grpc/channelz/v1/channelz.proto#L252), +[`streams_succeeded`](https://github.com/grpc/grpc-proto/blob/23f5b568eefcb876e6ebc3b01725f1f20cff999e/grpc/channelz/v1/channelz.proto#L256), +and +[`streams_failed`](https://github.com/grpc/grpc-proto/blob/23f5b568eefcb876e6ebc3b01725f1f20cff999e/grpc/channelz/v1/channelz.proto#L260) +fields, which enable computing the number of streams currently in +flight. However, we will add a new `uint32 peer_max_concurrent_streams` +field that will report the value of the peer's MAX_CONCURRENT_STREAMS +setting; note that this field does not need to be populated in the gRPC +server, only the client. + +Note that all data in `SocketNode` will be reported from the perspective +of the transport, not the perspective of the subchannel. + +### Metrics + +No new metrics will be defined for this feature. However, +note that the additional connections will show up in the +`grpc.subchannel.open_connections` metric defined in [A94]. + +### Temporary environment variable protection + +Enabling this feature via the gRPC service config will initially be +guarded via the environment variable +`GRPC_EXPERIMENTAL_MAX_CONCURRENT_STREAMS_CONNECTION_SCALING`. + +Enabling this feature via xDS will initially be +guarded via the environment variable +`GRPC_EXPERIMENTAL_XDS_MAX_CONCURRENT_STREAMS_CONNECTION_SCALING`. + +Each of the two configuration mechanisms will be enabled by default once +it has passed interop tests. + +## Rationale + +We considered several different approaches as part of this design process. + +We considered putting the connection scaling functionality in an LB +policy instead of in the subchannel, but we rejected that approach for +several reasons: +- The LB policy API is designed around the idea that returning a new + picker will cause queued RPCs to be replayed, which works fine when an + updated picker is likely to allow most of the queued RPCs to proceed. + However, in a case where the workload is hitting the + MAX_CONCURRENT_STREAMS limit across all connections, we would wind up + updating the picker as each RPC finishes, which would allow only one + of the queued RPCs to continue, thus exhibiting O(n^2) behavior. We + concluded that the LB policy API is simply not well-suited to solve + this particular problem. +- There were some concerns about possibly exposing the + MAX_CONCURRENT_STREAMS setting from the transport to the LB policy + API. However, this particular concern could have been ameliorated + by initially using an internal or experimental API. +- Putting this functionality in the LB policy would have added + complexity due to subchannels being shared between channels in C-core, + where each channel can see only the RPCs that that channel is sending + to the subchannel. + +Given that we decided to add this new functionality in the subchannel, +we intentionally elected to keep it simple, at least to start with. +We considered a design incorporating a more sophisticated -- or pluggable +-- connection selection method but rejected it due to the complexity +of such a mechanism and the desire to avoid creating a duplicate load +balancing ecosystem within the subchannel. + +Potential future improvements that we could consider: +- Additional configuration knobs: + - min_connections_per_subchannel to force clients to create a minimum + number of connections, even if they are not necessary. + - min_available_streams_per_subchannel to allow clients to create new + connections before the hard MAX_CONCURRENT_STREAMS setting is reached. +- Channel arg to limit streams per connection lower than + MAX_CONCURRENT_STREAMS. +- Instead of queuing RPCs in the subchannel, it may be possible to improve + aggregate performance by failing the RPC, resulting in transparent retry + and a re-pick. Without other systems in place, this would lead to a + busy-loop if the same subchannel is picked repeatedly, so this is not + included in this design. + +This design does not handle connection affinity; there is no way to +ensure related RPCs end up on the same connection without setting +max_connections_per_subchannel to 1. For use cases where connection +affinity is important, multiple channels will continue to be necessary +for now. + +This design also does not attempt to maximize throughput of connections, +which would be a far more complex problem. To maximize throughput +effectively, more information about the nature of RPCs would need to +be exposed; e.g., how much bandwidth they may require and how long they +might be expected to last. And unlike this connection scaling design, +that functionality might actually want to be implemented in the LB policy +layer, since it would be totally compatible with the LB policy API. +The goal of *this* design is to simply overcome the stream limits on +connections, hence the simple and greedy connection selection mechanism. + +## Implementation + +Will be implemented in C-core, Java, and Go. diff --git a/A32-xds-circuit-breaking.md b/A32-xds-circuit-breaking.md index 3eafd74b9..908cc8f0d 100644 --- a/A32-xds-circuit-breaking.md +++ b/A32-xds-circuit-breaking.md @@ -6,7 +6,7 @@ * Implemented in: C-core, Java, Go * Last updated: 2021-02-12 * Discussion at: https://groups.google.com/g/grpc-io/c/NEx70p8mcjg - +* Updated by: [gRFC A105: MAX_CONCURRENT_STREAMS Connection Scaling](A105-max_concurrent_streams-connection-scaling.md) ## Abstract