This project will evaluate the performance improvement for Java ByteBuffer allocation and de-allocation between native allocation and pool based allocation.
It also evaluates the performance improvement using single thread and multi-threads.
For single thread, pool-based vs native, it can get 4x improvement for random capacity and 13x improvement for fixed capacity.
For multi-threads, pool-based vs native, it can get 13x improvement for random capacity and 79x improvement for fixed capacity.
Below is the testing results.
In single thread:
| Attempt | #native | #pool based | improvement |
|---|---|---|---|
| random capacity(milliseconds) | 7107 | 1714 | 4X |
| fixed capacity(milliseconds) | 14114 | 1082 | 13X |
In multi-threads:
| Attempt | #native | #pool based | improvement |
|---|---|---|---|
| random capacity(milliseconds) | 969873 | 73861 | 13X |
| fixed capacity(milliseconds) | 1007342 | 12719 | 79X |
Please refer to this project: https://github.com/heyuanliu-intel/AirliftPerformance
During the performance benchmarking of the integration wildfly-openssl with Airlift, we found the bottleneck is the ByteBuffer allocation and de-allocation.
The system CPU utilization is about 20%. Most of the threads are blocked. Please refer to the thread dump file: https://raw.githubusercontent.com/heyuanliu-intel/AirliftPerformance/main/threaddump/thread1.txt
From this thread dump file, we can see that most of the threads are blocked by the ByteBuffer allocation and de-allocation. So change the code from native allocation/de-allocation to pool based and blow is the benchmarking results.
For the native way, below is the benchmarking result.
System CPU utilization is about 20%.
./run_wrk.sh
Running 2m test @ https://localhost:9300/v1/service
128 threads and 2000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.03ms 33.08ms 2.00s 99.79%
Req/Sec 1.46k 817.09 5.63k 60.69%
11666325 requests in 2.00m, 1.71GB read
Socket errors: connect 0, read 0, write 0, timeout 3936
Requests/sec: 97139.29
Transfer/sec: 14.54MB
For the pool based way, below is the benchmarking result.
System CPU utilization is about 80%.
./run_wrk.sh
Running 2m test @ https://localhost:9300/v1/service
128 threads and 2000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 46.23ms 66.43ms 558.99ms 82.74%
Req/Sec 3.30k 1.76k 19.07k 70.27%
50158401 requests in 2.00m, 7.33GB read
Requests/sec: 417646.32
Transfer/sec: 62.53MB
So based on the result, we can get the 4X performance improvement from pool based methods.
| Attempt | #native | #pool based | improvement |
|---|---|---|---|
| Requests/sec | 97139.29 | 417646.32 | 4X |