Skip to content

Conversation

@methane
Copy link
Owner

@methane methane commented Jan 5, 2026

Discourse thread: https://discuss.python.org/t/pep-822-dedented-multiline-string-d-string/105519

PEP PR (not merged yet): python/peps#4768

Summary by CodeRabbit

  • New Features

    • Adds a new "d" dedent string prefix for triple-quoted multiline strings that strip common leading indentation and integrate with f-strings.
  • Bug Fixes

    • Enforces triple-quoted form for dedent strings, disallows incompatible prefix combinations, and provides clearer syntax errors for single-quoted or empty forms.
    • Ensures prefix parsing validates combinations involving d and exercises multiline prefix checks.
  • Tests

    • Adds tests covering dedent-string behavior, indentation/dedent rules, and related error messages.

✏️ Tip: You can customize this high-level summary in your review settings.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Fix all issues with AI Agents 🤖
In @Parser/action_helpers.c:
- Around line 1295-1392: unicodewriter_write_line currently writes trailing
backslash+newline sequences verbatim for non-raw d-strings, causing newline
suppression to be ignored; modify unicodewriter_write_line to, when is_raw is
false, detect a final "\\\n" at the end of the provided segment and drop the
backslash and newline before writing (so non-raw d-strings and the dedent logic
in _PyPegen_dedent_string_part preserve line-continuation behavior), while
leaving the raw path unchanged (the early is_raw short-circuit should still
bypass this trimming).

In @Parser/string_parser.c:
- Line 284: Update the comment that currently reads "checks the prefix is
consistant." to fix the typo by changing "consistant" to "consistent" so the
comment reads "checks the prefix is consistent." This change is in the comment
string present in Parser/string_parser.c.
- Around line 337-344: The code passes raw=1 to _PyPegen_decode_string in the
non-raw branch so escapes are not processed and also fails to discard the
PyUnicodeWriter on the error return; change the call to
_PyPegen_decode_string(p, 0, line_start, line_end - line_start, token) (or use
the existing is_raw false variable) so escape sequences are handled, and on the
error path before returning NULL call PyUnicodeWriter_Discard(w) to free writer
state (ensure you still Py_XDECREF(line) as currently done).
- Around line 280-292: The code reads endline[0] without ensuring indent_len>0
which can read past the buffer for cases like d"""\nfoo\n"""; add an early check
for indent_len == 0 before computing indent_char/endline loop and raise a syntax
error using RAISE_ERROR_KNOWN_LOCATION (use PyExc_SyntaxError or equivalent)
with the same token position info (token->end_lineno, token->end_col_offset
offsets) and a message like "d-strings require trailing indentation to determine
dedent level" so the function returns NULL instead of reading out-of-bounds.
🧹 Nitpick comments (3)
Parser/action_helpers.c (1)

1394-1572: Dedent-aware f-/t-string plumbing is structurally sound, but is_raw is overloaded

The integration between _get_resized_exprs() and _PyPegen_decode_fstring_part() (handling is_first, indent_char, and dedent_count) looks correct: enforcing

  • “d-string must start with a newline” via the first Constant,
  • “d-string must end with an indent line” and computing a uniform indent_char/indent_count from the last Constant,
  • and reusing that dedent context across all constant pieces of the f-/t-string.

One readability nit is that is_raw now encodes both the actual r-prefix and the “no backslashes in this piece” optimization:

is_raw = is_raw || strchr(bstr, '\\') == NULL;

This is fine functionally, but separating these concerns into has_raw_prefix vs. no_backslashes (and deriving an internal decode_raw flag from them) would make the control flow clearer, especially as more d-string edge cases accumulate.

Lib/test/test_dstring.py (1)

4-13: Avoid shadowing built-in str in the helper

In assertAllRaise, the loop variable is named str, which shadows the built-in type. It’s harmless here but slightly obscures tracebacks and REPL debugging; renaming it to something like s or expr would be clearer.

Small naming tweak
-    def assertAllRaise(self, exception_type, regex, error_strings):
-        for str in error_strings:
-            with self.subTest(str=str):
-                with self.assertRaisesRegex(exception_type, regex) as cm:
-                    eval(str)
+    def assertAllRaise(self, exception_type, regex, error_strings):
+        for expr in error_strings:
+            with self.subTest(expr=expr):
+                with self.assertRaisesRegex(exception_type, regex) as cm:
+                    eval(expr)
Parser/string_parser.c (1)

381-384: Minor style inconsistency.

Line 382 is missing a space after = (compare with line 371: quote =(unsigned char)*++s; vs quote = (unsigned char)*++s;). This matches the existing inconsistency on line 371, so it may be intentional to match, but worth noting.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61fc72a and 5ce8afc.

📒 Files selected for processing (6)
  • Lib/test/test_dstring.py
  • Lib/test/test_tokenize.py
  • Lib/tokenize.py
  • Parser/action_helpers.c
  • Parser/lexer/lexer.c
  • Parser/string_parser.c
🧰 Additional context used
🧬 Code graph analysis (3)
Parser/action_helpers.c (2)
Parser/string_parser.c (1)
  • _PyPegen_decode_string (241-248)
Parser/pegen.h (1)
  • RAISE_ERROR_KNOWN_LOCATION (187-200)
Parser/string_parser.c (3)
Parser/pegen.h (1)
  • RAISE_ERROR_KNOWN_LOCATION (187-200)
Objects/unicode_writer.c (6)
  • PyUnicodeWriter_Create (157-185)
  • PyUnicodeWriter_Discard (188-195)
  • PyUnicodeWriter_WriteChar (316-326)
  • PyUnicodeWriter_WriteUTF8 (531-548)
  • PyUnicodeWriter_WriteStr (360-380)
  • PyUnicodeWriter_Finish (625-632)
Include/refcount.h (1)
  • Py_XDECREF (510-515)
Parser/lexer/lexer.c (1)
Parser/tokenizer/helpers.c (1)
  • _PyTokenizer_syntaxerror (65-74)
🪛 GitHub Actions: Tests
Lib/test/test_dstring.py

[error] 41-41: test_simple_dstring failed. Expected ' hello world' but got ' hello world\\n'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: Sanitizers / TSan
  • GitHub Check: Sanitizers / UBSan
  • GitHub Check: Sanitizers / TSan (free-threading)
  • GitHub Check: Ubuntu (free-threading) / build and test (ubuntu-24.04-arm)
  • GitHub Check: Ubuntu / build and test (ubuntu-24.04-arm)
  • GitHub Check: Windows / Build and test (Win32)
  • GitHub Check: Windows (free-threading) / Build and test (x64)
  • GitHub Check: Windows / Build and test (x64)
  • GitHub Check: Windows (free-threading) / Build and test (arm64)
  • GitHub Check: Windows / Build and test (arm64)
  • GitHub Check: WASI / build and test
  • GitHub Check: Hypothesis tests on Ubuntu
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 1.1.1w)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.3.5)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.2.6)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.0.18)
  • GitHub Check: Ubuntu SSL tests with OpenSSL (ubuntu-24.04, 3.5.4)
  • GitHub Check: Check if generated files are up to date
🔇 Additional comments (9)
Lib/tokenize.py (1)

85-92: d-string prefixes list looks consistent with lexer behavior

The extended _valid_string_prefixes set (adding d and df/dt/dr/dfr/dtr) matches the new dedent-string semantics and should stay in sync with the prefix handling in Parser/lexer/lexer.c. No functional issues spotted here.

Lib/test/test_tokenize.py (1)

3415-3455: Prefix-discovery test correctly updated for multiline-only d-strings

Switching eval to use """\n""" ensures determine_valid_prefixes() only treats d-prefixed forms as valid in the multiline (triple-quoted) case, which matches the d-string requirements and the updated prefix set in tokenize.StringPrefix. The change is sound.

Parser/lexer/lexer.c (2)

456-505: Prefix compatibility matrix correctly extended for d-strings

The updated maybe_raise_syntax_error_for_string_prefixes():

  • keeps existing constraints (no u combos, no b with f/t, no f+t),
  • newly forbids u+d and b+d,

which matches the intended valid prefix set (d, df, dt, dr, dfr, dtr plus permutations, but no ud, bd, etc.) and stays aligned with Lib/tokenize._all_string_prefixes. The behavior here looks correct.


747-792: f/t/d-string prefix parsing and “d must be triple-quoted” check look correct

The identifier/prefix scanner now tracks saw_d and passes it into maybe_raise_syntax_error_for_string_prefixes, then:

  • routes only f/t-containing prefixes into the f_string_quote path (df, dt, dfr, dtr, plus permutations),
  • leaves d/dr prefixes for plain d-strings to the regular string path.

Inside f_string_quote, the new prefix walk:

  • sets string_kind to TSTRING when a t/T is present,
  • sets raw when an r/R is present,
  • enforces that any d/D in the prefix requires quote_size == 3, otherwise returning a syntax error (“d-string must be a multiline string”).

This neatly enforces the syntactic constraints for dedented f-/t-strings without regressing existing f-string behavior.

Also applies to: 1061-1138

Parser/string_parser.c (5)

250-264: LGTM!

The initial validation correctly checks for empty string before accessing s[0], and the error message provides clear feedback about the newline requirement.


266-278: LGTM!

The backwards search for the last newline is correctly guarded by the assert, and non-whitespace characters after the last newline are properly rejected.


316-323: LGTM!

The empty line handling correctly allows lines with fewer characters than indent_len if they contain only a newline, which permits blank lines within the d-string content.


424-429: LGTM!

Correctly enforces that d-strings must be triple-quoted with a clear error message.


433-454: LGTM!

The assert correctly documents that bytes mode and dedent mode are mutually exclusive, and the routing to _PyPegen_dedent_string is properly placed.

Comment on lines 280 to 292
// Now, prefix is both the dedent indentation and the end of the d-string body.
Py_ssize_t indent_len = s + len - endline;
int indent_char = endline[0]; // ' ', '\t', or '\0'.

// checks the prefix is consistant.
for (Py_ssize_t i = 1; i < indent_len; i++) {
if (endline[i] != indent_char) {
RAISE_ERROR_KNOWN_LOCATION(
p, PyExc_TabError, token->end_lineno, token->end_col_offset - 3, token->end_lineno, token->end_col_offset -2,
"inconsistent use of tabs and spaces in indentation");
return NULL;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Edge case: indent_len == 0 leads to undefined behavior.

If the d-string ends with just a newline and no trailing whitespace (e.g., d"""\nfoo\n"""), then indent_len is 0 and endline points to s + len. Accessing endline[0] on line 282 reads past the string boundary.

Consider adding a check for indent_len == 0 and raising an appropriate syntax error (d-strings require trailing indentation to determine dedent level).

🔎 Proposed fix
     // Now, prefix is both the dedent indentation and the end of the d-string body.
     Py_ssize_t indent_len = s + len - endline;
+    if (indent_len == 0) {
+        RAISE_ERROR_KNOWN_LOCATION(p, PyExc_SyntaxError,
+            token->end_lineno, token->end_col_offset - 3, token->end_lineno, token->end_col_offset - 2,
+            "d-string must have trailing indentation before closing quotes");
+        return NULL;
+    }
     int indent_char = endline[0];  // ' ', '\t', or '\0'.
🤖 Prompt for AI Agents
In @Parser/string_parser.c around lines 280-292, The code reads endline[0]
without ensuring indent_len>0 which can read past the buffer for cases like
d"""\nfoo\n"""; add an early check for indent_len == 0 before computing
indent_char/endline loop and raise a syntax error using
RAISE_ERROR_KNOWN_LOCATION (use PyExc_SyntaxError or equivalent) with the same
token position info (token->end_lineno, token->end_col_offset offsets) and a
message like "d-strings require trailing indentation to determine dedent level"
so the function returns NULL instead of reading out-of-bounds.

Py_ssize_t indent_len = s + len - endline;
int indent_char = endline[0]; // ' ', '\t', or '\0'.

// checks the prefix is consistant.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: "consistant" → "consistent".

-    // checks the prefix is consistant.
+    // checks the prefix is consistent.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// checks the prefix is consistant.
// checks the prefix is consistent.
🤖 Prompt for AI Agents
In @Parser/string_parser.c around line 284, Update the comment that currently
reads "checks the prefix is consistant." to fix the typo by changing
"consistant" to "consistent" so the comment reads "checks the prefix is
consistent." This change is in the comment string present in
Parser/string_parser.c.

Repository owner deleted a comment from coderabbitai bot Jan 20, 2026
@methane methane force-pushed the peps/0822-d-string branch from 5ce8afc to 21cc35a Compare January 20, 2026 16:40
@coderabbitai
Copy link

coderabbitai bot commented Jan 20, 2026

📝 Walkthrough

Walkthrough

Adds support for a new dedented d-string literal: lexer recognizes d prefixes, parser and action helpers implement dedentation and validation, a unicode helper for computing common leading whitespace was renamed/exported, tokenize tests adjusted, and a new unittest validates d-string behavior.

Changes

Cohort / File(s) Summary
Tests
Lib/test/test_dstring.py, Lib/test/test_tokenize.py
New Lib/test/test_dstring.py with DStringTestCase and helpers testing d-string errors, empty/start/newline rules, and dedentation. Updated Lib/test/test_tokenize.py eval checks to use multiline string literals (eval(f'{letter}\n'), eval(f'{p}\n')).
Tokenizer prefixes
Lib/tokenize.py
Extended _all_string_prefixes with d and combined prefixes (bd,rd,fd,td,brd,frd,trd).
Lexer
Parser/lexer/lexer.c
Track saw_d; extend maybe_raise_syntax_error_for_string_prefixes to accept saw_d; validate incompatible combos with d; require triple-quoted d-strings; integrate d into f/t-string detection and token initialization.
String parsing / dedent
Parser/string_parser.c, Parser/action_helpers.c
Added _PyPegen_dedent_string and dedent helpers; refactored _PyPegen_decode_fstring_part to accept is_first, indent, and indent_len; compute common leading whitespace, validate per-line indentation, build dedented bytes, and use dedented content in parsing/decoding flows.
Unicode helper rename / export
Objects/unicodeobject.c, Parser/...
Renamed search_longest_common_leading_whitespace_Py_search_longest_common_leading_whitespace, exported via extern, and updated call sites to use the new symbol.
API / signatures
Parser/action_helpers.c, Parser/lexer/lexer.c
Updated _PyPegen_decode_fstring_part signature to include dedent-related params; updated maybe_raise_syntax_error_for_string_prefixes to accept saw_d.
Memory / error handling
Parser/..., Objects/...
Added temporary buffers and arena tracking for dedent computation; ensured proper Py_XDECREF of dedented bytes; introduced d-string-specific SyntaxError messages (e.g., "d-string must be triple-quoted", "d-string must start with a newline") and per-line indentation validation.

Sequence Diagram(s)

sequenceDiagram
    participant L as Lexer (lexer.c)
    participant P as Parser (string_parser.c)
    participant A as ActionHelpers (action_helpers.c)
    participant U as UnicodeHelper (unicodeobject.c)

    L->>P: Emit string token with saw_d flag
    P->>P: Validate triple-quote & dedent preconditions
    alt invalid d-string
        P->>P: Raise SyntaxError ("d-string must be triple-quoted" / "must start with a newline" / indentation error)
    else valid d-string
        P->>U: Call _Py_search_longest_common_leading_whitespace(src, end)
        U-->>P: Return common indent start/length
        P->>P: Build dedented bytes (_PyPegen_dedent_string)
        P->>A: Decode parts via _PyPegen_decode_fstring_part(is_first, indent, indent_len, ...)
        A->>U: Use _Py_search_longest_common_leading_whitespace for part lines as needed
        A-->>P: Return decoded/concatenated parts
        P-->>L: Produce final string token / AST node
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I found a tiny d in the quotes,
Nibbled margins, lined up all the notes,
Triple-quote gates kept the messy at bay,
Lexer hummed, parser danced, dedent led the way,
Rabbit hops on—clean strings for today.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.58% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically summarizes the main change: implementation of PEP 822's d-string feature across multiple files and subsystems.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
Parser/lexer/lexer.c (1)

483-495: Missing incompatibility check between 'b' and 'd' prefixes.

The code adds incompatibility checks for u and d, but doesn't check for b and d combinations. The d prefix is for dedented strings which should produce str, not bytes. A bd or db prefix combination should be invalid.

🐛 Proposed fix
     if (saw_u && saw_d) {
         RETURN_SYNTAX_ERROR("u", "d");
     }
+    if (saw_b && saw_d) {
+        RETURN_SYNTAX_ERROR("b", "d");
+    }

     if (saw_b && saw_f) {
         RETURN_SYNTAX_ERROR("b", "f");
🤖 Fix all issues with AI agents
In `@Parser/action_helpers.c`:
- Around line 1475-1517: After creating temp_bytes via PyBytesWriter_Finish,
check the return value of _PyArena_AddPyObject(p->arena, temp_bytes); if it
fails (non-zero) then DECREF/clean up temp_bytes (e.g., Py_DECREF(temp_bytes) or
PyBytesWriter_Discard equivalent) and return NULL instead of proceeding to use
temp_bytes; ensure you only call PyBytes_AsString(temp_bytes) after a successful
_PyArena_AddPyObject to avoid leaking temp_bytes on arena insertion failure and
to propagate the existing exception.
♻️ Duplicate comments (2)
Parser/action_helpers.c (1)

1295-1311: Process escapes in non‑raw dedented lines.
This branch runs when is_raw is false, but raw=1 skips escape decoding (e.g., \n stays literal).

Proposed fix
-        PyObject *line = _PyPegen_decode_string(p, 1, line_start, line_end - line_start, token);
+        PyObject *line = _PyPegen_decode_string(p, 0, line_start, line_end - line_start, token);
Lib/test/test_dstring.py (1)

29-34: Tests correctly verify d-string dedentation semantics.

The test expectations align with d-string specification: dedentation is based on the minimum common leading whitespace. Lines 33-34 correctly distinguish between non-raw (backslash-newline as continuation) and raw (backslash-newline preserved) behavior. The failing test at line 33 reflects the implementation bug noted in previous review.

🧹 Nitpick comments (1)
Lib/test/test_dstring.py (1)

5-12: Avoid shadowing built-in str and remove debug code.

The parameter name str shadows Python's built-in str type, which can cause confusion and potential issues. Also, the commented debug print statements should be removed before merging.

♻️ Proposed fix
-    def assertAllRaise(self, exception_type, regex, error_strings):
-        for str in error_strings:
-            with self.subTest(str=str):
+    def assertAllRaise(self, exception_type, regex, error_strings):
+        for expr in error_strings:
+            with self.subTest(expr=expr):
                 with self.assertRaisesRegex(exception_type, regex) as cm:
-                    eval(str)
-                # print("Testing expression:", repr(str))
-                # print(repr(cm.exception))
-                # print(repr(cm.exception.text))
+                    eval(expr)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@Parser/string_parser.c`:
- Line 392: Rename the local variable _prev_call_invald to _prev_call_invalid in
Parser/string_parser.c and update all its usages (the assignments/reads
currently referencing _prev_call_invald around the earlier usages) so they
reference _prev_call_invalid instead (also update the declaration "int
_prev_call_invald = p->call_invalid_rules;" and every subsequent use at the
three places noted).
- Around line 269-290: The inner loop in Parser/string_parser.c that compares
s[i] to indent (inside the for (; s < end; lineno++) block) can read past the
buffer when the remaining bytes are shorter than indent_len; modify the loop in
the function to check bounds (ensure s + i < end) before accessing s[i] and
break/handle as a short-line case, and also guard the subsequent check that uses
s[i] after the loop (the `if (s[i] == '\n')` branch) so it only runs when i <
remaining_length; on short lines call the same error/empty-line handling path
(using PyBytesWriter_Discard, RAISE_ERROR_KNOWN_LOCATION, or writing the newline
via PyBytesWriter_WriteBytes) as appropriate.

Comment on lines +269 to +290
for (; s < end; lineno++) {
Py_ssize_t i;
for (i = 0; i < indent_len; i++) {
if (s[i] != indent[i]) {
if (s[i] == '\n') {
break; // empty line
}
PyBytesWriter_Discard(w);
RAISE_ERROR_KNOWN_LOCATION(p, PyExc_IndentationError, lineno, i, lineno, i+1,
"d-string missing valid indentation");
return NULL;
}
}

if (s[i] == '\n') { // found an empty line with newline.
if (PyBytesWriter_WriteBytes(w, "\n", 1) < 0) {
PyBytesWriter_Discard(w);
return NULL;
}
s += i+1;
continue;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential buffer overread when processing lines shorter than indent_len.

The inner loop accesses s[i] without verifying s + i < end. If the remaining content is shorter than indent_len (e.g., an incomplete last line or edge cases), this reads past the buffer. Similarly, line 283 accesses s[i] where i == indent_len after the loop completes, which could also exceed bounds.

🔎 Proposed fix: Add bounds check in inner loop
     for (; s < end; lineno++) {
         Py_ssize_t i;
-        for (i = 0; i < indent_len; i++) {
+        for (i = 0; i < indent_len && s + i < end; i++) {
             if (s[i] != indent[i]) {
                 if (s[i] == '\n') {
                     break; // empty line
                 }
                 PyBytesWriter_Discard(w);
                 RAISE_ERROR_KNOWN_LOCATION(p, PyExc_IndentationError, lineno, i, lineno, i+1,
                     "d-string missing valid indentation");
                 return NULL;
             }
         }

-        if (s[i] == '\n') {  // found an empty line with newline.
+        if (s + i >= end) {
+            break;  // reached end of content
+        }
+        else if (s[i] == '\n') {  // found an empty line with newline.
             if (PyBytesWriter_WriteBytes(w, "\n", 1) < 0) {
                 PyBytesWriter_Discard(w);
                 return NULL;
             }
             s += i+1;
             continue;
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (; s < end; lineno++) {
Py_ssize_t i;
for (i = 0; i < indent_len; i++) {
if (s[i] != indent[i]) {
if (s[i] == '\n') {
break; // empty line
}
PyBytesWriter_Discard(w);
RAISE_ERROR_KNOWN_LOCATION(p, PyExc_IndentationError, lineno, i, lineno, i+1,
"d-string missing valid indentation");
return NULL;
}
}
if (s[i] == '\n') { // found an empty line with newline.
if (PyBytesWriter_WriteBytes(w, "\n", 1) < 0) {
PyBytesWriter_Discard(w);
return NULL;
}
s += i+1;
continue;
}
for (; s < end; lineno++) {
Py_ssize_t i;
for (i = 0; i < indent_len && s + i < end; i++) {
if (s[i] != indent[i]) {
if (s[i] == '\n') {
break; // empty line
}
PyBytesWriter_Discard(w);
RAISE_ERROR_KNOWN_LOCATION(p, PyExc_IndentationError, lineno, i, lineno, i+1,
"d-string missing valid indentation");
return NULL;
}
}
if (s + i >= end) {
break; // reached end of content
}
else if (s[i] == '\n') { // found an empty line with newline.
if (PyBytesWriter_WriteBytes(w, "\n", 1) < 0) {
PyBytesWriter_Discard(w);
return NULL;
}
s += i+1;
continue;
}
🤖 Prompt for AI Agents
In `@Parser/string_parser.c` around lines 269 - 290, The inner loop in
Parser/string_parser.c that compares s[i] to indent (inside the for (; s < end;
lineno++) block) can read past the buffer when the remaining bytes are shorter
than indent_len; modify the loop in the function to check bounds (ensure s + i <
end) before accessing s[i] and break/handle as a short-line case, and also guard
the subsequent check that uses s[i] after the loop (the `if (s[i] == '\n')`
branch) so it only runs when i < remaining_length; on short lines call the same
error/empty-line handling path (using PyBytesWriter_Discard,
RAISE_ERROR_KNOWN_LOCATION, or writing the newline via PyBytesWriter_WriteBytes)
as appropriate.

/* Avoid invoking escape decoding routines if possible. */
rawmode = rawmode || strchr(s, '\\') == NULL;

int _prev_call_invald = p->call_invalid_rules;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: _prev_call_invald_prev_call_invalid.

-    int _prev_call_invald = p->call_invalid_rules;
+    int _prev_call_invalid = p->call_invalid_rules;

Update all usages on lines 404, 452, and 467 accordingly.

🤖 Prompt for AI Agents
In `@Parser/string_parser.c` at line 392, Rename the local variable
_prev_call_invald to _prev_call_invalid in Parser/string_parser.c and update all
its usages (the assignments/reads currently referencing _prev_call_invald around
the earlier usages) so they reference _prev_call_invalid instead (also update
the declaration "int _prev_call_invald = p->call_invalid_rules;" and every
subsequent use at the three places noted).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@Parser/action_helpers.c`:
- Around line 1295-1311: The bug is that unicodewriter_write_line calls
_PyPegen_decode_string with a hardcoded raw=1, preventing escape processing for
non-raw strings; update the call in unicodewriter_write_line to pass the correct
raw flag (use 0 or the is_raw variable) instead of 1 so escapes (e.g.
backslash-newline) are decoded; ensure the call to
_PyPegen_decode_string(line_start, line_end - line_start, token) uses the proper
raw parameter consistent with is_raw.

In `@Parser/lexer/lexer.c`:
- Around line 483-485: The parser's prefix validation misses the incompatible
combination of byte and dedent prefixes: add a check that if saw_b && saw_d then
call RETURN_SYNTAX_ERROR("b", "d") (place it alongside the existing saw_b &&
saw_f / saw_b && saw_t checks in the same validation function), and update the
nearby comment that lists unsupported prefix combos to include "bd" so the
supported/incompatible list accurately reflects this case; reference the boolean
flags saw_b and saw_d and the macro/function RETURN_SYNTAX_ERROR to locate where
to insert the check and update the comment.
♻️ Duplicate comments (1)
Parser/action_helpers.c (1)

1538-1543: Check _PyArena_AddPyObject return value to prevent memory leak.

If arena insertion fails, temp_bytes leaks while an exception is set.

Proposed fix
         PyObject *temp_bytes = PyBytesWriter_Finish(w);
         if (temp_bytes == NULL) {
             return NULL;
         }
-        _PyArena_AddPyObject(p->arena, temp_bytes);
+        if (_PyArena_AddPyObject(p->arena, temp_bytes) < 0) {
+            Py_DECREF(temp_bytes);
+            return NULL;
+        }
         const char *temp_str = PyBytes_AsString(temp_bytes);
🧹 Nitpick comments (1)
Parser/action_helpers.c (1)

1448-1454: Consider moving extern declaration to a shared header.

Declaring extern functions inline in .c files works but risks signature drift if the definition in Objects/unicodeobject.c changes. Consider adding this declaration to pycore_unicodeobject.h or a similar internal header for better maintainability.

Comment on lines +1295 to +1311
static int
unicodewriter_write_line(Parser *p, PyUnicodeWriter *w, const char *line_start, const char *line_end,
int is_raw, Token* token)
{
if (is_raw || memchr(line_start, '\\', line_end - line_start) == NULL) {
return PyUnicodeWriter_WriteUTF8(w, line_start, line_end - line_start);
}
else {
PyObject *line = _PyPegen_decode_string(p, 1, line_start, line_end - line_start, token);
if (line == NULL || PyUnicodeWriter_WriteStr(w, line) < 0) {
Py_XDECREF(line);
return -1;
}
Py_DECREF(line);
}
return 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Wrong raw parameter passed to _PyPegen_decode_string.

Line 1303 passes 1 (raw=true) to _PyPegen_decode_string, but this is in the else branch where is_raw is known to be false. This means escape sequences like \n, \t, etc. won't be processed for non-raw d-strings - they'll be written as literal characters.

This is likely the root cause of the line continuation issue flagged in past reviews. When escape processing is skipped, backslash-newline continuation won't work.

Proposed fix
 static int
 unicodewriter_write_line(Parser *p, PyUnicodeWriter *w, const char *line_start, const char *line_end,
                          int is_raw, Token* token)
 {
     if (is_raw || memchr(line_start, '\\', line_end - line_start) == NULL) {
         return PyUnicodeWriter_WriteUTF8(w, line_start, line_end - line_start);
     }
     else {
-        PyObject *line = _PyPegen_decode_string(p, 1, line_start, line_end - line_start, token);
+        PyObject *line = _PyPegen_decode_string(p, 0, line_start, line_end - line_start, token);
         if (line == NULL || PyUnicodeWriter_WriteStr(w, line) < 0) {
             Py_XDECREF(line);
             return -1;
         }
         Py_DECREF(line);
     }
     return 0;
 }
🤖 Prompt for AI Agents
In `@Parser/action_helpers.c` around lines 1295 - 1311, The bug is that
unicodewriter_write_line calls _PyPegen_decode_string with a hardcoded raw=1,
preventing escape processing for non-raw strings; update the call in
unicodewriter_write_line to pass the correct raw flag (use 0 or the is_raw
variable) instead of 1 so escapes (e.g. backslash-newline) are decoded; ensure
the call to _PyPegen_decode_string(line_start, line_end - line_start, token)
uses the proper raw parameter consistent with is_raw.

Comment on lines +483 to +485
if (saw_u && saw_d) {
RETURN_SYNTAX_ERROR("u", "d");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for any existing tests or documentation about b+d prefix combination
rg -n "bd\"|db\"" --type py

Repository: methane/cpython

Length of output: 8394


🏁 Script executed:

#!/bin/bash
# Search in the lexer.c file for how b prefix is handled in validation
rg -n "saw_b" Parser/lexer/lexer.c | head -20

Repository: methane/cpython

Length of output: 471


🏁 Script executed:

#!/bin/bash
# Look at the full validation function around lines 456-485
sed -n '450,490p' Parser/lexer/lexer.c

Repository: methane/cpython

Length of output: 1455


🏁 Script executed:

#!/bin/bash
# Get the complete validation function - expand the range to see all checks
sed -n '450,520p' Parser/lexer/lexer.c

Repository: methane/cpython

Length of output: 2017


Add incompatibility check for b and d prefixes.

The validation function checks for u+d, b+f, and b+t incompatibilities, but b+d is missing. A bd"..." combination would silently fall through to regular string handling without dedentation. Since b (bytes) and f/t (formatting) are already marked incompatible, b+d (dedented strings) should also be explicitly validated as incompatible.

Add a check after line 490:

if (saw_b && saw_d) {
    RETURN_SYNTAX_ERROR("b", "d");
}

Also update the comment at line 460 to include bd in the unsupported list for clarity.

🤖 Prompt for AI Agents
In `@Parser/lexer/lexer.c` around lines 483 - 485, The parser's prefix validation
misses the incompatible combination of byte and dedent prefixes: add a check
that if saw_b && saw_d then call RETURN_SYNTAX_ERROR("b", "d") (place it
alongside the existing saw_b && saw_f / saw_b && saw_t checks in the same
validation function), and update the nearby comment that lists unsupported
prefix combos to include "bd" so the supported/incompatible list accurately
reflects this case; reference the boolean flags saw_b and saw_d and the
macro/function RETURN_SYNTAX_ERROR to locate where to insert the check and
update the comment.

@methane methane force-pushed the peps/0822-d-string branch from 57a3451 to 8d6981d Compare January 27, 2026 09:57
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@Parser/action_helpers.c`:
- Around line 1365-1377: The code reads line_start[i] after the
indentation-match loop without ensuring the pointer is still within bounds; add
a guard after the loop to ensure (line_start + i) < end before any dereference.
Specifically, in the block containing variables line_start, end, i, indent_len
and the calls to PyUnicodeWriter_WriteChar, insert a check like if (line_start +
i >= end) break; (or otherwise handle end-of-buffer) before testing
line_start[i] == '\0' or '\n' so you never read past end.

Comment on lines +1365 to +1377
while (line_start < end) {
lineno++;

Py_ssize_t i = 0;
while (line_start + i < end && i < indent_len && line_start[i] == indent[i]) {
i++;
}

if (line_start[i] == '\0') { // found an empty line without newline.
break;
}
if (line_start[i] == '\n') { // found an empty line with newline.
if (PyUnicodeWriter_WriteChar(w, '\n') < 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add bounds check before reading line_start[i].

If the remaining line is shorter than indent_len, line_start[i] can read past end. Guard the access after the loop.

🔧 Proposed fix
-        while (line_start + i < end && i < indent_len && line_start[i] == indent[i]) {
+        while (line_start + i < end && i < indent_len && line_start[i] == indent[i]) {
             i++;
         }
-
-        if (line_start[i] == '\0') {  // found an empty line without newline.
+        if (line_start + i >= end) {
+            break;  // reached end of content
+        }
+        if (line_start[i] == '\0') {  // found an empty line without newline.
             break;
         }
🤖 Prompt for AI Agents
In `@Parser/action_helpers.c` around lines 1365 - 1377, The code reads
line_start[i] after the indentation-match loop without ensuring the pointer is
still within bounds; add a guard after the loop to ensure (line_start + i) < end
before any dereference. Specifically, in the block containing variables
line_start, end, i, indent_len and the calls to PyUnicodeWriter_WriteChar,
insert a check like if (line_start + i >= end) break; (or otherwise handle
end-of-buffer) before testing line_start[i] == '\0' or '\n' so you never read
past end.

@methane methane force-pushed the peps/0822-d-string branch from 8d6981d to fbaa98e Compare January 27, 2026 10:23
@methane methane force-pushed the peps/0822-d-string branch from fbaa98e to 733c2d0 Compare January 27, 2026 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants