Skip to content

fix(cache): validate cache_index schema collisions in worker dat…#3216

Open
Hitanshi7556 wants to merge 2 commits intokubeflow:masterfrom
Hitanshi7556:fix/data-cache-cache-index-collision-validation
Open

fix(cache): validate cache_index schema collisions in worker dat…#3216
Hitanshi7556 wants to merge 2 commits intokubeflow:masterfrom
Hitanshi7556:fix/data-cache-cache-index-collision-validation

Conversation

@Hitanshi7556
Copy link

What this PR does
Adds schema validation in WorkerDataSource::try_new to reject source tables that already define cache_index.

Why
cache_index is appended by data-cache worker logic; collisions can lead to ambiguous schema behavior.

Scope

  • Validation + error path in worker datasource
  • Unit tests for collision / non-collision
  • No projection/subset-scan changes (left for follow-up)

Testing

  • make test-rust
  • focused worker datasource tests

fixes #3174

Copilot AI review requested due to automatic review settings February 17, 2026 11:43
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign akshaychitneni for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

…asource

Signed-off-by: Hitanshi Goklani <hitanshigoklani33@gmail.com>
@Hitanshi7556 Hitanshi7556 force-pushed the fix/data-cache-cache-index-collision-validation branch from 94a4c9d to f3fa59d Compare February 17, 2026 11:45
@Hitanshi7556 Hitanshi7556 changed the title fix(data-cache): validate cache_index schema collisions in worker dat… fix(dcache): validate cache_index schema collisions in worker dat… Feb 17, 2026
@Hitanshi7556 Hitanshi7556 changed the title fix(dcache): validate cache_index schema collisions in worker dat… fix(cache): validate cache_index schema collisions in worker dat… Feb 17, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds schema validation to prevent column name collisions when the data cache worker appends its cache_index column to source table schemas. The validation rejects source tables that already define a cache_index column, addressing the TODO comment that was previously in the code.

Changes:

  • Added validation in WorkerDataSource::new to check for cache_index column collision before appending it to the schema
  • Removed the TODO comment about validating name collisions
  • Added two unit tests to verify the schema collision detection logic

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Hitanshi7556 <hitanshigoklani33@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support local iceberg table with parquet files to local validation and e2es

1 participant

Comments