Skip to content

Conversation

@hardglitch
Copy link

Fix UTF-8 safety in string slicing

Problem:
The original code indexed &str using byte offsets (bytes[i] as char and slices like &s[a..b]), which breaks UTF-8 when multi-byte characters are present (e.g., Cyrillic). This caused panics like: byte index N is not a char boundary

Solution:

  • Loops rewritten to use char_indices() so all slice indices always land on valid UTF-8 character boundaries.
  • Slices computed safely with i and c.len_utf8().
  • Functions scan() and split_first_whitespace() now handle Unicode correctly.
  • Added is_only_whitespace() implemented via chars().all(|c| c.is_whitespace()).

Effect:
The library no longer panics on RTF containing Cyrillic or other multi-byte characters, while preserving correct behavior for ASCII.

Copy link
Owner

@d0rianb d0rianb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the work,
Just a few comments on linting belwow

@d0rianb
Copy link
Owner

d0rianb commented Oct 7, 2025

This could fix #18.
However, a test on the parser is not passing. It seems that a whitespace is missing.
maybe you could try this test in the lexer :

#[test]
fn should_lex_unicode() {
  let rtf = r#"{\u21834  \u21834 }"#;
  let tokens = Lexer::scan(rtf).unwrap();
  assert_eq!(
      tokens,
      vec![OpeningBracket, ControlSymbol((Unicode, Value(21834))), PlainText(" "), ControlSymbol((Unicode, Value(21834))), ClosingBracket]
  );
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants