Skip to content

Comments

Optimized text for full unicode and some escape sequences (#129169)#7

Open
MitchLewis930 wants to merge 1 commit intopr_017_beforefrom
pr_017_after
Open

Optimized text for full unicode and some escape sequences (#129169)#7
MitchLewis930 wants to merge 1 commit intopr_017_beforefrom
pr_017_after

Conversation

@MitchLewis930
Copy link

PR_017

…9169)

Follow-up to elastic#126492 to apply the json parsing optimization to strings
containing unicode characters and some backslash-escaped characters.

Supporting backslash-escaped strings is tricky as it requires modifying the
string. There are two types of modification: some just remove the backslash
(e.g. \", \\), and some replace the whole escape sequence with a new
character (e.g. \n, \r, \u00e5). In this implementation, the optimization
only supports the first case--removing the backslash. This is done by
making a copy of the data, skipping the backslash. It should still be more
optimized than full String decoding, but it won't be as fast as 
non-backslashed strings where we can directly reference the input bytes.

Relates to elastic#129072.
@MitchLewis930 MitchLewis930 requested a review from Copilot January 31, 2026 00:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes text parsing for JSON strings containing full Unicode characters and certain escape sequences. The optimization allows the parser to return a direct byte reference to the underlying buffer instead of allocating new strings, extending this capability beyond simple ASCII strings.

Changes:

  • Extended the optimized path to support Unicode characters (UTF-8 encoded) and three specific escape sequences (\", \/, \\)
  • Modified the parser logic to track string length separately from byte length to correctly handle multi-byte UTF-8 characters
  • Added escape sequence handling that removes backslashes from supported sequences while maintaining buffer references

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
ESUTF8StreamJsonParser.java Implemented optimized text parsing for Unicode and select escape sequences by tracking character length separately, handling UTF-8 multi-byte sequences, and copying buffers with backslash removal
ESUTF8StreamJsonParserTests.java Added test cases for Unicode strings and escape sequences, replaced test helper with more comprehensive randomized input generation supporting various character types and escape sequences

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

protected int stringEnd = -1;
protected int stringLength;

private final List<Integer> backslashes = new ArrayList<>();
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an ArrayList as an instance field that's cleared and reused could lead to memory retention if large strings with many backslashes are parsed. Consider using a smaller initial capacity or switching to a primitive int array with manual size tracking to reduce allocation overhead and memory footprint for the common case of few or no backslashes.

Suggested change
private final List<Integer> backslashes = new ArrayList<>();
private final List<Integer> backslashes = new ArrayList<>(4);

Copilot uses AI. Check for mistakes.

private int randomCodepoint(boolean includeAscii) {
while (true) {
char val = Character.toChars(randomInt(0xFFFF))[0];
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Character.toChars() returns an array that could contain two elements for supplementary characters (codepoints > 0xFFFF), but the code only takes the first element. Since randomInt(0xFFFF) limits values to the BMP, this works, but the method name 'randomCodepoint' suggests it should support all codepoints. Either rename the method to clarify it only handles BMP characters, or extend the range and handle supplementary characters properly.

Suggested change
char val = Character.toChars(randomInt(0xFFFF))[0];
char val = (char) randomInt(0xFFFF);

Copilot uses AI. Check for mistakes.
boolean doesSupportOptimized = true;
for (int i = 0; i < length; ++i) {
if (forceSupportOptimized == false && randomBoolean()) {
switch (randomInt(9)) {
Copy link

Copilot AI Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 9 in the switch statement is unclear. This should be documented or replaced with a named constant that explains why 10 possible values (0-9) are used, where case 0 handles escape sequences, case 1 handles unicode escapes, and default handles unicode characters. The probability distribution (1/10 for escapes, 1/10 for unicode escapes, 8/10 for unicode chars) appears intentional but is not obvious.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants