Optimized text for full unicode and some escape sequences (#129169)#7
Optimized text for full unicode and some escape sequences (#129169)#7MitchLewis930 wants to merge 1 commit intopr_017_beforefrom
Conversation
…9169) Follow-up to elastic#126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters. Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only supports the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes. Relates to elastic#129072.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| // For now, we can use `len` for `stringLength` because we only support ascii-encoded unescaped strings, | ||
| // which means each character uses exactly 1 byte. | ||
| return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), len); | ||
| return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), stringLength); |
There was a problem hiding this comment.
Second call to getValueAsText returns raw bytes with escapes
Medium Severity
When getValueAsText() is called a second time on a string containing escape sequences (\", \\, \/), the cached path at line 54-56 returns raw bytes from _inputBuffer instead of the processed buffer with backslashes removed. The first call creates a new buffer via _finishAndReturnText(), but subsequent calls bypass this and return unprocessed data. This also causes a mismatch between bytes.length() and stringLength, triggering an assertion failure in Text.string().
| } | ||
| ptr += bytesToSkip; | ||
| ++stringLength; | ||
| } |
There was a problem hiding this comment.
Supplementary characters cause stringLength mismatch with String.length()
Medium Severity
The parser increments stringLength by 1 for 4-byte UTF-8 sequences (case 4), but these represent supplementary Unicode characters (codepoints > U+FFFF) which decode to surrogate pairs in Java, making String.length() return 2. This causes the assertion string.length() == stringLength in Text.string() to fail. The test doesn't catch this because randomCodepoint() only generates BMP characters (0-0xFFFF), never 4-byte UTF-8 sequences.


PR_017
Note
Cursor Bugbot is generating a summary for commit 96300a9. Configure here.