Skip to content

Conversation

@arushi297
Copy link
Collaborator

@arushi297 arushi297 commented Aug 7, 2025

This PR addresses the issue where the reader doesn't give reproducible data order for multiple workers when shuffling is disabled or seed is set. The following changes are made to address the issue:

  1. Ventilator Queue Modifications: Each worker now has a separate ventilator queue, ensuring deterministic data ordering through a round-robin strategy and eliminating race conditions.

  2. np.default_rng(): The switch from np.random.RandomState() to np.default_rng() provides reproducible shuffling, ensuring consistent outcomes when a seed is provided.

  3. Result Queue Improvements: Individual result queues for each worker have been implemented. Results are collected in a round-robin manner.

The following behavior is expected as a result of this change:

  1. Shuffling = False, Deterministically natural order
  2. Shuffling = True, Seed = Provided, Deterministically randomized order
  3. Shuffling = True, Seed = Not Provided (or 0), Non-deterministically randomized order

Testing and Validation: Added unit tests to cover the 3 scenario stated. Also, performed experiments that confirmed that data order remains consistent when shuffling is disabled and is reproducible with a seed, improving the reliability of dataloader.

@arushi297 arushi297 force-pushed the arushi.arora/consistent-data-order branch 7 times, most recently from 3958c2b to 87fd4f7 Compare August 8, 2025 02:46
@arushi297 arushi297 force-pushed the arushi.arora/consistent-data-order branch 5 times, most recently from 82479d5 to 4039b27 Compare August 8, 2025 06:59
@codecov
Copy link

codecov bot commented Aug 8, 2025

Codecov Report

❌ Patch coverage is 94.64286% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.22%. Comparing base (5f2797c) to head (70d42ea).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
petastorm/workers_pool/thread_pool.py 91.17% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #810      +/-   ##
==========================================
- Coverage   86.26%   86.22%   -0.05%     
==========================================
  Files          84       84              
  Lines        5090     5109      +19     
  Branches      790      801      +11     
==========================================
+ Hits         4391     4405      +14     
- Misses        560      562       +2     
- Partials      139      142       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arushi297 arushi297 force-pushed the arushi.arora/consistent-data-order branch from 4039b27 to ae21eb3 Compare August 8, 2025 19:28
@arushi297 arushi297 force-pushed the arushi.arora/consistent-data-order branch from ae21eb3 to 9def508 Compare August 8, 2025 19:36
@arushi297 arushi297 force-pushed the arushi.arora/consistent-data-order branch from fa270df to 0f24c93 Compare August 8, 2025 20:07
@arushi297 arushi297 requested review from diyu2025 and irasit August 8, 2025 23:24
@arushi297 arushi297 changed the title [WIP] Support consistent ordering in Petastorm Support reproducible ordering in Petastorm Aug 8, 2025
@diyu2025 diyu2025 closed this Aug 9, 2025
@diyu2025 diyu2025 reopened this Aug 9, 2025
@arushi297 arushi297 merged commit 6a710cf into uber:master Aug 9, 2025
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants