[fix][broker] Fix data corruption issues when TLS is enabled and optimize TLS between Pulsar client and brokers#22810
Conversation
|
This PR doesn't yet address the root cause although I was already assuming that it does. |
|
It's possible that netty/netty#14086 is needed to address the issues. Tested with with Pulsar v3.2.3 patched to use Netty 4.1.111.Final-SNAPSHOT (without macos or arm64 native libraries) The reproducer https://github.com/lhotari/pulsar-playground/tree/master/issues/issue22601/standalone_env doesn't reproduce the problem with Netty PR 14086 changes. |
|
Finally found the root cause of the TLS data corruption and IndexOutOfBounds issues: netty/netty#14086 (comment) |
|
I'll resume this PR after Netty 4.1.111.Final has released with the required fixes. |
|
Closing this PR since Netty 4.1.111.Final will address the problematic issue: #22892 . I'll create a separate PR to remove the obsolete CopyingEncoder once that is merged. |
Fixes #22601 #21892 #19460
This PR replaces #22760
Motivation
In Pulsar, there are multiple reported issues where the transferred output gets corrupted and fails with exceptions around invalid reader and writer index. One source of these issues are the ones which occur only when TLS is enabled between clients and Pulsar broker or between Pulsar broker and bookies.
In Pulsar, the sharing of ByteBuf instance happens in this case at least via the broker cache (RangeEntryCacheManagerImpl) and the pending reads manager (PendingReadsManager).
The SslHandler related issue was originally reported in Pulsar in 2018 with #2401 . The fix that time was #2464.
The ByteBuf
.copy()method was used to copy the ByteBuf. There hasn't been a similar solution in Bookkeeper or Bookkeeper client to address corruption that is caused by Netty SslHandler.One of the problems with
.copy()is that it's inefficient. I have also created a PR to Netty to make SslHandler not mutate input buffers. The PR is netty/netty#14086 .The
Failed to peek sticky key from the message metadata java.lang.IllegalArgumentException: Invalid unknonwn tag type: 4exceptions are caused by the SslHandler mutation issue between broker and bookies. It also corrupts the data that gets written to bookkeeper since bookkeeper doesn't check the checksum at writing time, only when it's retrieved from the storage.java.lang.IndexOutOfBoundsException: readerIndex: 31215, writerIndex: 21324 (expected: 0 <= readerIndex <= writerIndex <= capacity(65536))type of exceptions on the broker side are also symptoms of the same problem.The root cause of such exceptions could also be different. A shared Netty ByteBuf must have at least have an independent view created with
duplicate,sliceorretainedDuplicateif the readerIndex is mutated.The ByteBuf instance must also be properly shared in a thread safe way. Failing to do that could result in similar symptoms and this PR doesn't fix that.
Modifications
.retainedSlice()ByteBuf needs to be passed to SslHandler so that it doesn't get mutated. A deep copy isn't required..retainedSlice()for the input ByteBuf.Documentation
docdoc-requireddoc-not-neededdoc-complete