Envoy as Ingress L7 Load Balancer couldn't perform better if it receives all the traffic (HTTP2) over single TCP connection


*Title*: *Envoy as Ingress L7 Load Balancer couldn't perform better if it receives all the traffic (HTTP2) over single TCP connection*

*Description*:
Hi Team,
We are running Istio Ingress Gateways as our Ingress L7 Load Balancer. Our typical call flow would be like this,

Client ---> Nginx L4 LB ----> Istio Ingress Gateway L7 LB --> Backend Application.

When client sends all the Http2 traffic across multiple TCP connections we are good at handling at Istio IGW.
But when there are few clients which has TCP connection limit and send all the huge Http2 traffic on single TCP connection itself and that will be landed on only few Istio IGW pods.

These few TCP connections are long lasting and eventually Istio IGW is reaching the bottleneck and we could see the delay when few requests are processed at Istio IGW.

Here Istio IGW is running with 2/2 vCPU and 2/2 Gi MEM request and limit set.
Also the concurrency by default is set to 2.

There are only 2 worker threads created, when single TCP connection is coming its picked up by 1 worked thread and all the Http2 calls on the same connection is entirely handled by the same thread. Hence that creates some bottleneck where the CPU is not utilized more than 1 vCPU though there is more room left.

[cloud-user@cna9042212mp2-worker-18 ~]$ ps -L -p 9228
PID LWP TTY TIME CMD
9228 9228 ? 00:00:04 envoy
9228 9244 ? 00:00:00 default-executo
9228 9245 ? 00:00:00 resolver-execut
9228 9246 ? 00:00:00 grpc_global_tim
9228 9247 ? 00:00:00 GrpcGoogClient
9228 9248 ? 00:00:00 dog:main_thread
9228 9249 ? 00:00:00 dog:workers_gua
9228 10685 ? 00:00:42 wrk:worker_0
9228 10686 ? 00:10:30 wrk:worker_1
9228 10687 ? 00:00:00 GrpcGoogClient
9228 10688 ? 00:00:00 GrpcGoogClient
9228 10798 ? 00:00:00 AccessLogFlush
9228 12255 ? 00:00:01 AccessLogFlush

[cloud-user@cna9042212mp2-worker-18 ~]$ ps -L -p 9228 -o pid,tid,psr,pcpu,comm
PID TID PSR %CPU COMMAND
9228 9228 26 0.1 envoy
9228 9244 26 0.0 default-executo
9228 9245 27 0.0 resolver-execut
9228 9246 26 0.0 grpc_global_tim
9228 9247 26 0.0 GrpcGoogClient
9228 9248 26 0.0 dog:main_thread
9228 9249 27 0.0 dog:workers_gua
9228 10685 26 1.0 wrk:worker_0
9228 10686 27 16.4 wrk:worker_1
9228 10687 27 0.0 GrpcGoogClient
9228 10688 26 0.0 GrpcGoogClient
9228 10798 26 0.0 AccessLogFlush
9228 12255 26 0.0 AccessLogFlush

[cloud-user@cna9042212mp2-worker-18 ~]$ top
top - 10:24:09 up 2 days, 10:58, 1 user, load average: 1.32, 1.37, 1.24
Tasks: 472 total, 1 running, 309 sleeping, 0 stopped, 0 zombie
%Cpu0 : 2.0 us, 1.6 sy, 0.0 ni, 95.4 id, 0.3 wa, 0.3 hi, 0.3 si, 0.0 st
%Cpu1 : 2.7 us, 2.0 sy, 0.0 ni, 94.7 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 4.3 us, 2.7 sy, 0.0 ni, 92.4 id, 0.0 wa, 0.3 hi, 0.3 si, 0.0 st
%Cpu3 : 1.3 us, 1.3 sy, 0.0 ni, 97.0 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st
%Cpu4 : 1.3 us, 0.0 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 1.3 us, 0.0 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 1.3 us, 0.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu7 : 1.0 us, 0.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu8 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 1.0 us, 0.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu10 : 0.3 us, 0.0 sy, 0.0 ni, 96.7 id, 0.0 wa, 0.3 hi, 2.7 si, 0.0 st
%Cpu11 : 1.7 us, 0.3 sy, 0.0 ni, 98.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu12 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu14 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu15 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu16 : 0.0 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu17 : 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu18 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu19 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu20 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu21 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu22 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu23 : 0.0 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st
%Cpu24 : 1.0 us, 0.7 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu25 : 0.3 us, 0.7 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu26 : 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu27 : 96.0 us, 2.3 sy, 0.0 ni, 0.3 id, 0.0 wa, 0.7 hi, 0.7 si, 0.0 st >>>>>>>>>> Check this core
%Cpu28 : 1.0 us, 0.7 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu29 : 0.3 us, 0.7 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu30 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu31 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65682864 total, 45634676 free, 8120736 used, 11927452 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 56765072 avail Mem

1. Wanted to know if there is any to handled this few long lasting TCP connections at Istio Ingress gateway L7.
2. How to make use of other cores vCPU so that full capacity of IGW is utilized.
3. In this situation vertical scaling of the Istio IGW also will not be helpful.



NOTE: In one of our other Architecture without Istio service Mesh we use this model,

Client ---> Nginx L4 LB ---->Envoy L7 LB (as ingress L7 LB)--> Backend Application.


Envoy Version used:
     "user_agent_name": "envoy",
     "user_agent_build_version": {
      "version": {
       "major_number": 1,
       "minor_number": 32,
       "patch": 4
      },

Already tried updating below params at listener section,
  "max_concurrent_streams": 65535,
         "initial_stream_window_size": 33554432,
         "initial_connection_window_size": 33554432,

But there is no improvement in the performance.

Hence I would like to know if there are any known limitation at envoy in handling max throughput (HTTP2) traffic on single TCP connection?

Also is there any recommendations for tuning parameters that can be done to enhance the performance better ?


[optional *Relevant Links*:]
> Reference ticket in Istio : https://github.com/istio/istio/issues/58114
> https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/intro/threading_model


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Envoy as Ingress L7 Load Balancer couldn't perform better if it receives all the traffic (HTTP2) over single TCP connection #42464

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Envoy as Ingress L7 Load Balancer couldn't perform better if it receives all the traffic (HTTP2) over single TCP connection #42464

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions