Fix lexer::scan() and utils::impl MyStrExt for &str #21

hardglitch · 2025-10-06T19:36:56Z

Fix UTF-8 safety in string slicing

Problem:
The original code indexed &str using byte offsets (bytes[i] as char and slices like &s[a..b]), which breaks UTF-8 when multi-byte characters are present (e.g., Cyrillic). This caused panics like: byte index N is not a char boundary

Solution:

Loops rewritten to use char_indices() so all slice indices always land on valid UTF-8 character boundaries.
Slices computed safely with i and c.len_utf8().
Functions scan() and split_first_whitespace() now handle Unicode correctly.
Added is_only_whitespace() implemented via chars().all(|c| c.is_whitespace()).

Effect:
The library no longer panics on RTF containing Cyrillic or other multi-byte characters, while preserving correct behavior for ASCII.

d0rianb

Thank for the work,
Just a few comments on linting belwow

src/lexer.rs

src/utils.rs

d0rianb · 2025-10-07T10:26:52Z

This could fix #18.
However, a test on the parser is not passing. It seems that a whitespace is missing.
maybe you could try this test in the lexer :

#[test]
fn should_lex_unicode() {
  let rtf = r#"{\u21834  \u21834 }"#;
  let tokens = Lexer::scan(rtf).unwrap();
  assert_eq!(
      tokens,
      vec![OpeningBracket, ControlSymbol((Unicode, Value(21834))), PlainText(" "), ControlSymbol((Unicode, Value(21834))), ClosingBracket]
  );
}

hardglitch added 5 commits October 6, 2025 12:12

Fixed lexer::scan()

4a241b3

Update utils::StrUtils

3691a91

Update utils.rs

41c5a93

Update lexer.rs

bb8b5e9

Update lexer::tokenize()

4b45af0

d0rianb reviewed Oct 7, 2025

View reviewed changes

src/lexer.rs Show resolved Hide resolved

src/lexer.rs Show resolved Hide resolved

src/utils.rs Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix lexer::scan() and utils::impl MyStrExt for &str #21

Fix lexer::scan() and utils::impl MyStrExt for &str #21

Uh oh!

hardglitch commented Oct 6, 2025

Uh oh!

d0rianb left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d0rianb commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix lexer::scan() and utils::impl MyStrExt for &str #21

Are you sure you want to change the base?

Fix lexer::scan() and utils::impl MyStrExt for &str #21

Uh oh!

Conversation

hardglitch commented Oct 6, 2025

Uh oh!

d0rianb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d0rianb commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants