Optimized text for full unicode and some escape sequences (#129169) by MitchLewis930 · Pull Request #7 · Signal65/elasticsearch-Bugbot

MitchLewis930 · 2026-01-30T22:06:11Z

PR_017

Note

^{Cursor Bugbot is generating a summary for commit 96300a9. Configure here.}

…9169) Follow-up to elastic#126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters. Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only supports the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes. Relates to elastic#129072.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-30T22:17:10Z

...tent/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java

-                // For now, we can use `len` for `stringLength` because we only support ascii-encoded unescaped strings,
-                // which means each character uses exactly 1 byte.
-                return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), len);
+                return new Text(new XContentString.UTF8Bytes(_inputBuffer, _inputPtr, len), stringLength);


Second call to getValueAsText returns raw bytes with escapes

Medium Severity

When getValueAsText() is called a second time on a string containing escape sequences (\", \\, \/), the cached path at line 54-56 returns raw bytes from _inputBuffer instead of the processed buffer with backslashes removed. The first call creates a new buffer via _finishAndReturnText(), but subsequent calls bypass this and return unprocessed data. This also causes a mismatch between bytes.length() and stringLength, triggering an assertion failure in Text.string().

cursor · 2026-01-30T22:17:10Z

...tent/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java

+                    }
+                    ptr += bytesToSkip;
+                    ++stringLength;
+                }


Supplementary characters cause stringLength mismatch with String.length()

Medium Severity

The parser increments stringLength by 1 for 4-byte UTF-8 sequences (case 4), but these represent supplementary Unicode characters (codepoints > U+FFFF) which decode to surrogate pairs in Java, making String.length() return 2. This causes the assertion string.length() == stringLength in Text.string() to fail. The test doesn't catch this because randomCodepoint() only generates BMP characters (0-0xFFFF), never 4-byte UTF-8 sequences.

cursor bot reviewed Jan 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Optimized text for full unicode and some escape sequences (#129169)#7

Optimized text for full unicode and some escape sequences (#129169)#7
MitchLewis930 wants to merge 1 commit intopr_017_beforefrom
pr_017_after

MitchLewis930 commented Jan 30, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 30, 2026

Uh oh!

cursor bot Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

MitchLewis930 commented Jan 30, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 30, 2026

Choose a reason for hiding this comment

Second call to getValueAsText returns raw bytes with escapes

Uh oh!

cursor bot Jan 30, 2026

Choose a reason for hiding this comment

Supplementary characters cause stringLength mismatch with String.length()

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MitchLewis930 commented Jan 30, 2026 •

edited by cursor bot

Loading

Second call to `getValueAsText` returns raw bytes with escapes