-
Notifications
You must be signed in to change notification settings - Fork 0
PEP 822: d-string draft implementation #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Fix all issues with AI Agents 🤖
In @Parser/action_helpers.c:
- Around line 1295-1392: unicodewriter_write_line currently writes trailing
backslash+newline sequences verbatim for non-raw d-strings, causing newline
suppression to be ignored; modify unicodewriter_write_line to, when is_raw is
false, detect a final "\\\n" at the end of the provided segment and drop the
backslash and newline before writing (so non-raw d-strings and the dedent logic
in _PyPegen_dedent_string_part preserve line-continuation behavior), while
leaving the raw path unchanged (the early is_raw short-circuit should still
bypass this trimming).
In @Parser/string_parser.c:
- Line 284: Update the comment that currently reads "checks the prefix is
consistant." to fix the typo by changing "consistant" to "consistent" so the
comment reads "checks the prefix is consistent." This change is in the comment
string present in Parser/string_parser.c.
- Around line 337-344: The code passes raw=1 to _PyPegen_decode_string in the
non-raw branch so escapes are not processed and also fails to discard the
PyUnicodeWriter on the error return; change the call to
_PyPegen_decode_string(p, 0, line_start, line_end - line_start, token) (or use
the existing is_raw false variable) so escape sequences are handled, and on the
error path before returning NULL call PyUnicodeWriter_Discard(w) to free writer
state (ensure you still Py_XDECREF(line) as currently done).
- Around line 280-292: The code reads endline[0] without ensuring indent_len>0
which can read past the buffer for cases like d"""\nfoo\n"""; add an early check
for indent_len == 0 before computing indent_char/endline loop and raise a syntax
error using RAISE_ERROR_KNOWN_LOCATION (use PyExc_SyntaxError or equivalent)
with the same token position info (token->end_lineno, token->end_col_offset
offsets) and a message like "d-strings require trailing indentation to determine
dedent level" so the function returns NULL instead of reading out-of-bounds.
🧹 Nitpick comments (3)
Parser/action_helpers.c (1)
1394-1572: Dedent-aware f-/t-string plumbing is structurally sound, butis_rawis overloadedThe integration between
_get_resized_exprs()and_PyPegen_decode_fstring_part()(handlingis_first,indent_char, anddedent_count) looks correct: enforcing
- “d-string must start with a newline” via the first
Constant,- “d-string must end with an indent line” and computing a uniform
indent_char/indent_countfrom the lastConstant,- and reusing that dedent context across all constant pieces of the f-/t-string.
One readability nit is that
is_rawnow encodes both the actualr-prefix and the “no backslashes in this piece” optimization:is_raw = is_raw || strchr(bstr, '\\') == NULL;This is fine functionally, but separating these concerns into
has_raw_prefixvs.no_backslashes(and deriving an internaldecode_rawflag from them) would make the control flow clearer, especially as more d-string edge cases accumulate.Lib/test/test_dstring.py (1)
4-13: Avoid shadowing built-instrin the helperIn
assertAllRaise, the loop variable is namedstr, which shadows the built-in type. It’s harmless here but slightly obscures tracebacks and REPL debugging; renaming it to something likesorexprwould be clearer.Small naming tweak
- def assertAllRaise(self, exception_type, regex, error_strings): - for str in error_strings: - with self.subTest(str=str): - with self.assertRaisesRegex(exception_type, regex) as cm: - eval(str) + def assertAllRaise(self, exception_type, regex, error_strings): + for expr in error_strings: + with self.subTest(expr=expr): + with self.assertRaisesRegex(exception_type, regex) as cm: + eval(expr)Parser/string_parser.c (1)
381-384: Minor style inconsistency.Line 382 is missing a space after
=(compare with line 371:quote =(unsigned char)*++s;vsquote = (unsigned char)*++s;). This matches the existing inconsistency on line 371, so it may be intentional to match, but worth noting.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
Lib/test/test_dstring.pyLib/test/test_tokenize.pyLib/tokenize.pyParser/action_helpers.cParser/lexer/lexer.cParser/string_parser.c
🧰 Additional context used
🧬 Code graph analysis (3)
Parser/action_helpers.c (2)
Parser/string_parser.c (1)
_PyPegen_decode_string(241-248)Parser/pegen.h (1)
RAISE_ERROR_KNOWN_LOCATION(187-200)
Parser/string_parser.c (3)
Parser/pegen.h (1)
RAISE_ERROR_KNOWN_LOCATION(187-200)Objects/unicode_writer.c (6)
PyUnicodeWriter_Create(157-185)PyUnicodeWriter_Discard(188-195)PyUnicodeWriter_WriteChar(316-326)PyUnicodeWriter_WriteUTF8(531-548)PyUnicodeWriter_WriteStr(360-380)PyUnicodeWriter_Finish(625-632)Include/refcount.h (1)
Py_XDECREF(510-515)
Parser/lexer/lexer.c (1)
Parser/tokenizer/helpers.c (1)
_PyTokenizer_syntaxerror(65-74)
🪛 GitHub Actions: Tests
Lib/test/test_dstring.py
[error] 41-41: test_simple_dstring failed. Expected ' hello world' but got ' hello world\\n'.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: Sanitizers / TSan
- GitHub Check: Sanitizers / UBSan
- GitHub Check: Sanitizers / TSan (free-threading)
- GitHub Check: Ubuntu (free-threading) / build and test (ubuntu-24.04-arm)
- GitHub Check: Ubuntu / build and test (ubuntu-24.04-arm)
- GitHub Check: Windows / Build and test (Win32)
- GitHub Check: Windows (free-threading) / Build and test (x64)
- GitHub Check: Windows / Build and test (x64)
- GitHub Check: Windows (free-threading) / Build and test (arm64)
- GitHub Check: Windows / Build and test (arm64)
- GitHub Check: WASI / build and test
- GitHub Check: Hypothesis tests on Ubuntu
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 1.1.1w)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.3.5)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.2.6)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.0.18)
- GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.5.4)
- GitHub Check: Check if generated files are up to date
🔇 Additional comments (9)
Lib/tokenize.py (1)
85-92: d-string prefixes list looks consistent with lexer behaviorThe extended
_valid_string_prefixesset (addingdanddf/dt/dr/dfr/dtr) matches the new dedent-string semantics and should stay in sync with the prefix handling inParser/lexer/lexer.c. No functional issues spotted here.Lib/test/test_tokenize.py (1)
3415-3455: Prefix-discovery test correctly updated for multiline-only d-stringsSwitching
evalto use"""\n"""ensuresdetermine_valid_prefixes()only treatsd-prefixed forms as valid in the multiline (triple-quoted) case, which matches the d-string requirements and the updated prefix set intokenize.StringPrefix. The change is sound.Parser/lexer/lexer.c (2)
456-505: Prefix compatibility matrix correctly extended for d-stringsThe updated
maybe_raise_syntax_error_for_string_prefixes():
- keeps existing constraints (no
ucombos, nobwithf/t, nof+t),- newly forbids
u+dandb+d,which matches the intended valid prefix set (
d,df,dt,dr,dfr,dtrplus permutations, but noud,bd, etc.) and stays aligned withLib/tokenize._all_string_prefixes. The behavior here looks correct.
747-792: f/t/d-string prefix parsing and “d must be triple-quoted” check look correctThe identifier/prefix scanner now tracks
saw_dand passes it intomaybe_raise_syntax_error_for_string_prefixes, then:
- routes only
f/t-containing prefixes into thef_string_quotepath (df,dt,dfr,dtr, plus permutations),- leaves
d/drprefixes for plain d-strings to the regular string path.Inside
f_string_quote, the new prefix walk:
- sets
string_kindtoTSTRINGwhen at/Tis present,- sets
rawwhen anr/Ris present,- enforces that any
d/Din the prefix requiresquote_size == 3, otherwise returning a syntax error (“d-string must be a multiline string”).This neatly enforces the syntactic constraints for dedented f-/t-strings without regressing existing f-string behavior.
Also applies to: 1061-1138
Parser/string_parser.c (5)
250-264: LGTM!The initial validation correctly checks for empty string before accessing
s[0], and the error message provides clear feedback about the newline requirement.
266-278: LGTM!The backwards search for the last newline is correctly guarded by the assert, and non-whitespace characters after the last newline are properly rejected.
316-323: LGTM!The empty line handling correctly allows lines with fewer characters than
indent_lenif they contain only a newline, which permits blank lines within the d-string content.
424-429: LGTM!Correctly enforces that d-strings must be triple-quoted with a clear error message.
433-454: LGTM!The assert correctly documents that bytes mode and dedent mode are mutually exclusive, and the routing to
_PyPegen_dedent_stringis properly placed.
Parser/string_parser.c
Outdated
| // Now, prefix is both the dedent indentation and the end of the d-string body. | ||
| Py_ssize_t indent_len = s + len - endline; | ||
| int indent_char = endline[0]; // ' ', '\t', or '\0'. | ||
|
|
||
| // checks the prefix is consistant. | ||
| for (Py_ssize_t i = 1; i < indent_len; i++) { | ||
| if (endline[i] != indent_char) { | ||
| RAISE_ERROR_KNOWN_LOCATION( | ||
| p, PyExc_TabError, token->end_lineno, token->end_col_offset - 3, token->end_lineno, token->end_col_offset -2, | ||
| "inconsistent use of tabs and spaces in indentation"); | ||
| return NULL; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edge case: indent_len == 0 leads to undefined behavior.
If the d-string ends with just a newline and no trailing whitespace (e.g., d"""\nfoo\n"""), then indent_len is 0 and endline points to s + len. Accessing endline[0] on line 282 reads past the string boundary.
Consider adding a check for indent_len == 0 and raising an appropriate syntax error (d-strings require trailing indentation to determine dedent level).
🔎 Proposed fix
// Now, prefix is both the dedent indentation and the end of the d-string body.
Py_ssize_t indent_len = s + len - endline;
+ if (indent_len == 0) {
+ RAISE_ERROR_KNOWN_LOCATION(p, PyExc_SyntaxError,
+ token->end_lineno, token->end_col_offset - 3, token->end_lineno, token->end_col_offset - 2,
+ "d-string must have trailing indentation before closing quotes");
+ return NULL;
+ }
int indent_char = endline[0]; // ' ', '\t', or '\0'.🤖 Prompt for AI Agents
In @Parser/string_parser.c around lines 280-292, The code reads endline[0]
without ensuring indent_len>0 which can read past the buffer for cases like
d"""\nfoo\n"""; add an early check for indent_len == 0 before computing
indent_char/endline loop and raise a syntax error using
RAISE_ERROR_KNOWN_LOCATION (use PyExc_SyntaxError or equivalent) with the same
token position info (token->end_lineno, token->end_col_offset offsets) and a
message like "d-strings require trailing indentation to determine dedent level"
so the function returns NULL instead of reading out-of-bounds.
Parser/string_parser.c
Outdated
| Py_ssize_t indent_len = s + len - endline; | ||
| int indent_char = endline[0]; // ' ', '\t', or '\0'. | ||
|
|
||
| // checks the prefix is consistant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "consistant" → "consistent".
- // checks the prefix is consistant.
+ // checks the prefix is consistent.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // checks the prefix is consistant. | |
| // checks the prefix is consistent. |
🤖 Prompt for AI Agents
In @Parser/string_parser.c around line 284, Update the comment that currently
reads "checks the prefix is consistant." to fix the typo by changing
"consistant" to "consistent" so the comment reads "checks the prefix is
consistent." This change is in the comment string present in
Parser/string_parser.c.
5ce8afc to
21cc35a
Compare
📝 WalkthroughWalkthroughAdds support for a new dedented d-string literal: lexer recognizes Changes
Sequence Diagram(s)sequenceDiagram
participant L as Lexer (lexer.c)
participant P as Parser (string_parser.c)
participant A as ActionHelpers (action_helpers.c)
participant U as UnicodeHelper (unicodeobject.c)
L->>P: Emit string token with saw_d flag
P->>P: Validate triple-quote & dedent preconditions
alt invalid d-string
P->>P: Raise SyntaxError ("d-string must be triple-quoted" / "must start with a newline" / indentation error)
else valid d-string
P->>U: Call _Py_search_longest_common_leading_whitespace(src, end)
U-->>P: Return common indent start/length
P->>P: Build dedented bytes (_PyPegen_dedent_string)
P->>A: Decode parts via _PyPegen_decode_fstring_part(is_first, indent, indent_len, ...)
A->>U: Use _Py_search_longest_common_leading_whitespace for part lines as needed
A-->>P: Return decoded/concatenated parts
P-->>L: Produce final string token / AST node
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Parser/lexer/lexer.c (1)
483-495: Missing incompatibility check between 'b' and 'd' prefixes.The code adds incompatibility checks for
uandd, but doesn't check forbanddcombinations. Thedprefix is for dedented strings which should producestr, notbytes. Abdordbprefix combination should be invalid.🐛 Proposed fix
if (saw_u && saw_d) { RETURN_SYNTAX_ERROR("u", "d"); } + if (saw_b && saw_d) { + RETURN_SYNTAX_ERROR("b", "d"); + } if (saw_b && saw_f) { RETURN_SYNTAX_ERROR("b", "f");
🤖 Fix all issues with AI agents
In `@Parser/action_helpers.c`:
- Around line 1475-1517: After creating temp_bytes via PyBytesWriter_Finish,
check the return value of _PyArena_AddPyObject(p->arena, temp_bytes); if it
fails (non-zero) then DECREF/clean up temp_bytes (e.g., Py_DECREF(temp_bytes) or
PyBytesWriter_Discard equivalent) and return NULL instead of proceeding to use
temp_bytes; ensure you only call PyBytes_AsString(temp_bytes) after a successful
_PyArena_AddPyObject to avoid leaking temp_bytes on arena insertion failure and
to propagate the existing exception.
♻️ Duplicate comments (2)
Parser/action_helpers.c (1)
1295-1311: Process escapes in non‑raw dedented lines.
This branch runs whenis_rawis false, butraw=1skips escape decoding (e.g.,\nstays literal).Proposed fix
- PyObject *line = _PyPegen_decode_string(p, 1, line_start, line_end - line_start, token); + PyObject *line = _PyPegen_decode_string(p, 0, line_start, line_end - line_start, token);Lib/test/test_dstring.py (1)
29-34: Tests correctly verify d-string dedentation semantics.The test expectations align with d-string specification: dedentation is based on the minimum common leading whitespace. Lines 33-34 correctly distinguish between non-raw (backslash-newline as continuation) and raw (backslash-newline preserved) behavior. The failing test at line 33 reflects the implementation bug noted in previous review.
🧹 Nitpick comments (1)
Lib/test/test_dstring.py (1)
5-12: Avoid shadowing built-instrand remove debug code.The parameter name
strshadows Python's built-instrtype, which can cause confusion and potential issues. Also, the commented debug print statements should be removed before merging.♻️ Proposed fix
- def assertAllRaise(self, exception_type, regex, error_strings): - for str in error_strings: - with self.subTest(str=str): + def assertAllRaise(self, exception_type, regex, error_strings): + for expr in error_strings: + with self.subTest(expr=expr): with self.assertRaisesRegex(exception_type, regex) as cm: - eval(str) - # print("Testing expression:", repr(str)) - # print(repr(cm.exception)) - # print(repr(cm.exception.text)) + eval(expr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@Parser/string_parser.c`:
- Line 392: Rename the local variable _prev_call_invald to _prev_call_invalid in
Parser/string_parser.c and update all its usages (the assignments/reads
currently referencing _prev_call_invald around the earlier usages) so they
reference _prev_call_invalid instead (also update the declaration "int
_prev_call_invald = p->call_invalid_rules;" and every subsequent use at the
three places noted).
- Around line 269-290: The inner loop in Parser/string_parser.c that compares
s[i] to indent (inside the for (; s < end; lineno++) block) can read past the
buffer when the remaining bytes are shorter than indent_len; modify the loop in
the function to check bounds (ensure s + i < end) before accessing s[i] and
break/handle as a short-line case, and also guard the subsequent check that uses
s[i] after the loop (the `if (s[i] == '\n')` branch) so it only runs when i <
remaining_length; on short lines call the same error/empty-line handling path
(using PyBytesWriter_Discard, RAISE_ERROR_KNOWN_LOCATION, or writing the newline
via PyBytesWriter_WriteBytes) as appropriate.
| for (; s < end; lineno++) { | ||
| Py_ssize_t i; | ||
| for (i = 0; i < indent_len; i++) { | ||
| if (s[i] != indent[i]) { | ||
| if (s[i] == '\n') { | ||
| break; // empty line | ||
| } | ||
| PyBytesWriter_Discard(w); | ||
| RAISE_ERROR_KNOWN_LOCATION(p, PyExc_IndentationError, lineno, i, lineno, i+1, | ||
| "d-string missing valid indentation"); | ||
| return NULL; | ||
| } | ||
| } | ||
|
|
||
| if (s[i] == '\n') { // found an empty line with newline. | ||
| if (PyBytesWriter_WriteBytes(w, "\n", 1) < 0) { | ||
| PyBytesWriter_Discard(w); | ||
| return NULL; | ||
| } | ||
| s += i+1; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential buffer overread when processing lines shorter than indent_len.
The inner loop accesses s[i] without verifying s + i < end. If the remaining content is shorter than indent_len (e.g., an incomplete last line or edge cases), this reads past the buffer. Similarly, line 283 accesses s[i] where i == indent_len after the loop completes, which could also exceed bounds.
🔎 Proposed fix: Add bounds check in inner loop
for (; s < end; lineno++) {
Py_ssize_t i;
- for (i = 0; i < indent_len; i++) {
+ for (i = 0; i < indent_len && s + i < end; i++) {
if (s[i] != indent[i]) {
if (s[i] == '\n') {
break; // empty line
}
PyBytesWriter_Discard(w);
RAISE_ERROR_KNOWN_LOCATION(p, PyExc_IndentationError, lineno, i, lineno, i+1,
"d-string missing valid indentation");
return NULL;
}
}
- if (s[i] == '\n') { // found an empty line with newline.
+ if (s + i >= end) {
+ break; // reached end of content
+ }
+ else if (s[i] == '\n') { // found an empty line with newline.
if (PyBytesWriter_WriteBytes(w, "\n", 1) < 0) {
PyBytesWriter_Discard(w);
return NULL;
}
s += i+1;
continue;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for (; s < end; lineno++) { | |
| Py_ssize_t i; | |
| for (i = 0; i < indent_len; i++) { | |
| if (s[i] != indent[i]) { | |
| if (s[i] == '\n') { | |
| break; // empty line | |
| } | |
| PyBytesWriter_Discard(w); | |
| RAISE_ERROR_KNOWN_LOCATION(p, PyExc_IndentationError, lineno, i, lineno, i+1, | |
| "d-string missing valid indentation"); | |
| return NULL; | |
| } | |
| } | |
| if (s[i] == '\n') { // found an empty line with newline. | |
| if (PyBytesWriter_WriteBytes(w, "\n", 1) < 0) { | |
| PyBytesWriter_Discard(w); | |
| return NULL; | |
| } | |
| s += i+1; | |
| continue; | |
| } | |
| for (; s < end; lineno++) { | |
| Py_ssize_t i; | |
| for (i = 0; i < indent_len && s + i < end; i++) { | |
| if (s[i] != indent[i]) { | |
| if (s[i] == '\n') { | |
| break; // empty line | |
| } | |
| PyBytesWriter_Discard(w); | |
| RAISE_ERROR_KNOWN_LOCATION(p, PyExc_IndentationError, lineno, i, lineno, i+1, | |
| "d-string missing valid indentation"); | |
| return NULL; | |
| } | |
| } | |
| if (s + i >= end) { | |
| break; // reached end of content | |
| } | |
| else if (s[i] == '\n') { // found an empty line with newline. | |
| if (PyBytesWriter_WriteBytes(w, "\n", 1) < 0) { | |
| PyBytesWriter_Discard(w); | |
| return NULL; | |
| } | |
| s += i+1; | |
| continue; | |
| } |
🤖 Prompt for AI Agents
In `@Parser/string_parser.c` around lines 269 - 290, The inner loop in
Parser/string_parser.c that compares s[i] to indent (inside the for (; s < end;
lineno++) block) can read past the buffer when the remaining bytes are shorter
than indent_len; modify the loop in the function to check bounds (ensure s + i <
end) before accessing s[i] and break/handle as a short-line case, and also guard
the subsequent check that uses s[i] after the loop (the `if (s[i] == '\n')`
branch) so it only runs when i < remaining_length; on short lines call the same
error/empty-line handling path (using PyBytesWriter_Discard,
RAISE_ERROR_KNOWN_LOCATION, or writing the newline via PyBytesWriter_WriteBytes)
as appropriate.
| /* Avoid invoking escape decoding routines if possible. */ | ||
| rawmode = rawmode || strchr(s, '\\') == NULL; | ||
|
|
||
| int _prev_call_invald = p->call_invalid_rules; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: _prev_call_invald → _prev_call_invalid.
- int _prev_call_invald = p->call_invalid_rules;
+ int _prev_call_invalid = p->call_invalid_rules;Update all usages on lines 404, 452, and 467 accordingly.
🤖 Prompt for AI Agents
In `@Parser/string_parser.c` at line 392, Rename the local variable
_prev_call_invald to _prev_call_invalid in Parser/string_parser.c and update all
its usages (the assignments/reads currently referencing _prev_call_invald around
the earlier usages) so they reference _prev_call_invalid instead (also update
the declaration "int _prev_call_invald = p->call_invalid_rules;" and every
subsequent use at the three places noted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@Parser/action_helpers.c`:
- Around line 1295-1311: The bug is that unicodewriter_write_line calls
_PyPegen_decode_string with a hardcoded raw=1, preventing escape processing for
non-raw strings; update the call in unicodewriter_write_line to pass the correct
raw flag (use 0 or the is_raw variable) instead of 1 so escapes (e.g.
backslash-newline) are decoded; ensure the call to
_PyPegen_decode_string(line_start, line_end - line_start, token) uses the proper
raw parameter consistent with is_raw.
In `@Parser/lexer/lexer.c`:
- Around line 483-485: The parser's prefix validation misses the incompatible
combination of byte and dedent prefixes: add a check that if saw_b && saw_d then
call RETURN_SYNTAX_ERROR("b", "d") (place it alongside the existing saw_b &&
saw_f / saw_b && saw_t checks in the same validation function), and update the
nearby comment that lists unsupported prefix combos to include "bd" so the
supported/incompatible list accurately reflects this case; reference the boolean
flags saw_b and saw_d and the macro/function RETURN_SYNTAX_ERROR to locate where
to insert the check and update the comment.
♻️ Duplicate comments (1)
Parser/action_helpers.c (1)
1538-1543: Check_PyArena_AddPyObjectreturn value to prevent memory leak.If arena insertion fails,
temp_bytesleaks while an exception is set.Proposed fix
PyObject *temp_bytes = PyBytesWriter_Finish(w); if (temp_bytes == NULL) { return NULL; } - _PyArena_AddPyObject(p->arena, temp_bytes); + if (_PyArena_AddPyObject(p->arena, temp_bytes) < 0) { + Py_DECREF(temp_bytes); + return NULL; + } const char *temp_str = PyBytes_AsString(temp_bytes);
🧹 Nitpick comments (1)
Parser/action_helpers.c (1)
1448-1454: Consider moving extern declaration to a shared header.Declaring
externfunctions inline in.cfiles works but risks signature drift if the definition inObjects/unicodeobject.cchanges. Consider adding this declaration topycore_unicodeobject.hor a similar internal header for better maintainability.
| static int | ||
| unicodewriter_write_line(Parser *p, PyUnicodeWriter *w, const char *line_start, const char *line_end, | ||
| int is_raw, Token* token) | ||
| { | ||
| if (is_raw || memchr(line_start, '\\', line_end - line_start) == NULL) { | ||
| return PyUnicodeWriter_WriteUTF8(w, line_start, line_end - line_start); | ||
| } | ||
| else { | ||
| PyObject *line = _PyPegen_decode_string(p, 1, line_start, line_end - line_start, token); | ||
| if (line == NULL || PyUnicodeWriter_WriteStr(w, line) < 0) { | ||
| Py_XDECREF(line); | ||
| return -1; | ||
| } | ||
| Py_DECREF(line); | ||
| } | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Wrong raw parameter passed to _PyPegen_decode_string.
Line 1303 passes 1 (raw=true) to _PyPegen_decode_string, but this is in the else branch where is_raw is known to be false. This means escape sequences like \n, \t, etc. won't be processed for non-raw d-strings - they'll be written as literal characters.
This is likely the root cause of the line continuation issue flagged in past reviews. When escape processing is skipped, backslash-newline continuation won't work.
Proposed fix
static int
unicodewriter_write_line(Parser *p, PyUnicodeWriter *w, const char *line_start, const char *line_end,
int is_raw, Token* token)
{
if (is_raw || memchr(line_start, '\\', line_end - line_start) == NULL) {
return PyUnicodeWriter_WriteUTF8(w, line_start, line_end - line_start);
}
else {
- PyObject *line = _PyPegen_decode_string(p, 1, line_start, line_end - line_start, token);
+ PyObject *line = _PyPegen_decode_string(p, 0, line_start, line_end - line_start, token);
if (line == NULL || PyUnicodeWriter_WriteStr(w, line) < 0) {
Py_XDECREF(line);
return -1;
}
Py_DECREF(line);
}
return 0;
}🤖 Prompt for AI Agents
In `@Parser/action_helpers.c` around lines 1295 - 1311, The bug is that
unicodewriter_write_line calls _PyPegen_decode_string with a hardcoded raw=1,
preventing escape processing for non-raw strings; update the call in
unicodewriter_write_line to pass the correct raw flag (use 0 or the is_raw
variable) instead of 1 so escapes (e.g. backslash-newline) are decoded; ensure
the call to _PyPegen_decode_string(line_start, line_end - line_start, token)
uses the proper raw parameter consistent with is_raw.
| if (saw_u && saw_d) { | ||
| RETURN_SYNTAX_ERROR("u", "d"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for any existing tests or documentation about b+d prefix combination
rg -n "bd\"|db\"" --type pyRepository: methane/cpython
Length of output: 8394
🏁 Script executed:
#!/bin/bash
# Search in the lexer.c file for how b prefix is handled in validation
rg -n "saw_b" Parser/lexer/lexer.c | head -20Repository: methane/cpython
Length of output: 471
🏁 Script executed:
#!/bin/bash
# Look at the full validation function around lines 456-485
sed -n '450,490p' Parser/lexer/lexer.cRepository: methane/cpython
Length of output: 1455
🏁 Script executed:
#!/bin/bash
# Get the complete validation function - expand the range to see all checks
sed -n '450,520p' Parser/lexer/lexer.cRepository: methane/cpython
Length of output: 2017
Add incompatibility check for b and d prefixes.
The validation function checks for u+d, b+f, and b+t incompatibilities, but b+d is missing. A bd"..." combination would silently fall through to regular string handling without dedentation. Since b (bytes) and f/t (formatting) are already marked incompatible, b+d (dedented strings) should also be explicitly validated as incompatible.
Add a check after line 490:
if (saw_b && saw_d) {
RETURN_SYNTAX_ERROR("b", "d");
}
Also update the comment at line 460 to include bd in the unsupported list for clarity.
🤖 Prompt for AI Agents
In `@Parser/lexer/lexer.c` around lines 483 - 485, The parser's prefix validation
misses the incompatible combination of byte and dedent prefixes: add a check
that if saw_b && saw_d then call RETURN_SYNTAX_ERROR("b", "d") (place it
alongside the existing saw_b && saw_f / saw_b && saw_t checks in the same
validation function), and update the nearby comment that lists unsupported
prefix combos to include "bd" so the supported/incompatible list accurately
reflects this case; reference the boolean flags saw_b and saw_d and the
macro/function RETURN_SYNTAX_ERROR to locate where to insert the check and
update the comment.
57a3451 to
8d6981d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@Parser/action_helpers.c`:
- Around line 1365-1377: The code reads line_start[i] after the
indentation-match loop without ensuring the pointer is still within bounds; add
a guard after the loop to ensure (line_start + i) < end before any dereference.
Specifically, in the block containing variables line_start, end, i, indent_len
and the calls to PyUnicodeWriter_WriteChar, insert a check like if (line_start +
i >= end) break; (or otherwise handle end-of-buffer) before testing
line_start[i] == '\0' or '\n' so you never read past end.
| while (line_start < end) { | ||
| lineno++; | ||
|
|
||
| Py_ssize_t i = 0; | ||
| while (line_start + i < end && i < indent_len && line_start[i] == indent[i]) { | ||
| i++; | ||
| } | ||
|
|
||
| if (line_start[i] == '\0') { // found an empty line without newline. | ||
| break; | ||
| } | ||
| if (line_start[i] == '\n') { // found an empty line with newline. | ||
| if (PyUnicodeWriter_WriteChar(w, '\n') < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add bounds check before reading line_start[i].
If the remaining line is shorter than indent_len, line_start[i] can read past end. Guard the access after the loop.
🔧 Proposed fix
- while (line_start + i < end && i < indent_len && line_start[i] == indent[i]) {
+ while (line_start + i < end && i < indent_len && line_start[i] == indent[i]) {
i++;
}
-
- if (line_start[i] == '\0') { // found an empty line without newline.
+ if (line_start + i >= end) {
+ break; // reached end of content
+ }
+ if (line_start[i] == '\0') { // found an empty line without newline.
break;
}🤖 Prompt for AI Agents
In `@Parser/action_helpers.c` around lines 1365 - 1377, The code reads
line_start[i] after the indentation-match loop without ensuring the pointer is
still within bounds; add a guard after the loop to ensure (line_start + i) < end
before any dereference. Specifically, in the block containing variables
line_start, end, i, indent_len and the calls to PyUnicodeWriter_WriteChar,
insert a check like if (line_start + i >= end) break; (or otherwise handle
end-of-buffer) before testing line_start[i] == '\0' or '\n' so you never read
past end.
8d6981d to
fbaa98e
Compare
fbaa98e to
733c2d0
Compare
Discourse thread: https://discuss.python.org/t/pep-822-dedented-multiline-string-d-string/105519
PEP PR (not merged yet): python/peps#4768
Summary by CodeRabbit
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.