Skip to content

Conversation

@UtkarshSahay123
Copy link
Contributor

What does this PR do?

This PR fixes UTF-8 boundary validation in substring kernels for sliced
Utf8 and LargeUtf8 arrays.

Previously, UTF-8 boundary checks were performed against the full underlying
buffer, which could lead to incorrect validation when arrays were sliced.
This change ensures boundaries are validated relative to each value.

Why is this change needed?

Substring kernels operate on value-relative offsets. Validating offsets
against the global buffer can incorrectly reject valid boundaries or accept
invalid ones when arrays are sliced. This fix aligns validation with
value-level semantics.

What changes were made?

  • Perform UTF-8 boundary validation relative to per-value slices
  • Preserve existing behavior for unsliced arrays
  • No API changes

Tests

  • Existing substring tests cover this behavior
  • No new tests were required

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 18, 2025
Copy link
Contributor

@mhilton mhilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this change is necessary. Slicing a (Large)StringArray doesn't change the data buffer, so the offsets into the data buffer also do not change. Do you have an example of a StringArray where the existing code produces incorrect results?

@UtkarshSahay123
Copy link
Contributor Author

UtkarshSahay123 commented Dec 18, 2025 via email

@alamb
Copy link
Contributor

alamb commented Dec 19, 2025

That said, I agree that this distinction is subtle, and without a concrete
example where the current implementation produces incorrect results, the
change may not be justified. I will try to construct a minimal reproducer or
add a targeted test demonstrating this behavior. If I’m unable to do so, I’m
happy to drop or revise the change.

This sounds like a good plan

@alamb alamb marked this pull request as draft December 19, 2025 20:09
@alamb
Copy link
Contributor

alamb commented Dec 19, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

@UtkarshSahay123 UtkarshSahay123 marked this pull request as ready for review December 20, 2025 16:36
@UtkarshSahay123
Copy link
Contributor Author

Thanks for the note! I’ve marked the PR as ready for review now. Please let me know if any further changes are needed.

@alamb
Copy link
Contributor

alamb commented Dec 23, 2025

I don't think we can accept this PR without a test demonstrating what issue it is fixing

@Jefffrey Jefffrey marked this pull request as draft December 23, 2025 15:43
@UtkarshSahay123 UtkarshSahay123 marked this pull request as ready for review December 24, 2025 17:05
@Jefffrey Jefffrey marked this pull request as draft December 25, 2025 02:05
@Jefffrey
Copy link
Contributor

@UtkarshSahay123 Could you please refrain from marking this PR as ready for review until the above comments have been addressed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants