Fix: Refactor DelegatingMultiprocessQueueFactory for fork safety by rwkeane · Pull Request #170 · rwkeane/tsercom

rwkeane · 2025-06-19T05:38:32Z

This commit refactors DelegatingMultiprocessQueueFactory and its associated sink/source wrappers to address a FileNotFoundError that occurred when using the 'fork' multiprocessing start method with PyTorch tensors.

Key changes:

Eager Queue Creation:
- DelegatingMultiprocessQueueFactory now eagerly creates two sets
  of underlying queues in its constructor (default and Torch-specific
  if PyTorch is available). This ensures queue resources are
  initialized in the parent process before forking.
Lazy Queue Selection:
- DelegatingMultiprocessQueueSink and Source now use a
  coordination mechanism on the default queue. The first item's type
  determines if the Torch or default queue pair is used for actual
  data transfer for that and all subsequent items.
Unit Test Updates:
- The test suite in delegating_multiprocess_queue_factory_unittest.py
  has been extensively updated.
- The 'fork' + tensor test (test_mp_correctness_fork_tensor_item_first)
  now passes.
- A new minimal test (test_minimal_torch_mp_queue_spawn_tensor)
  was added to help diagnose issues with 'spawn' and PyTorch tensors.

Outcome:

The primary goal of fixing the FileNotFoundError with the 'fork'
start method and PyTorch tensors is achieved. All 'fork' method
tests, including tensor scenarios, now pass.
'Spawn' method tests with non-tensor items also pass.

Investigation of 'spawn' + PyTorch Tensor Failures:

Extensive investigation was conducted into why tests using the 'spawn'
start method with PyTorch Tensors fail with EOFError/FileNotFoundError
(tests: test_mp_correctness_spawn_tensor_item_first,
test_mp_correctness_spawn_mixed_items_tensor_first, and the
diagnostic test_minimal_torch_mp_queue_spawn_tensor).
The minimal test, which uses torch.multiprocessing.Queue directly
(bypassing all tsercom factory/wrapper logic), also fails under
'spawn' with tensors.
Experiments included:
- Explicitly setting PyTorch's sharing strategy to 'file_system'.
- Explicitly calling tensor.contiguous().share_memory_() before
  putting tensors into the raw torch.multiprocessing.Queue in
  the minimal test.
None of these experiments resolved the failures in the minimal test
case for 'spawn' + tensor.

Conclusion on 'spawn' + PyTorch Tensor Issue:

The persistent failures in the minimal torch.multiprocessing.Queue
test strongly indicate that the root cause is external to the
tsercom.DelegatingMultiprocessQueueFactory code. It is likely an
underlying issue or limitation within PyTorch's multiprocessing
module regarding tensor sharing with the 'spawn' start method in the
current testing environment.
The tsercom code now correctly handles 'fork' scenarios and 'spawn'
for non-tensor data. The spawn+tensor issue is documented via the
failing minimal test included in the test suite.

This commit refactors `DelegatingMultiprocessQueueFactory` and its associated sink/source wrappers to address a `FileNotFoundError` that occurred when using the 'fork' multiprocessing start method with PyTorch tensors. Key changes: 1. **Eager Queue Creation:** * `DelegatingMultiprocessQueueFactory` now eagerly creates two sets of underlying queues in its constructor (default and Torch-specific if PyTorch is available). This ensures queue resources are initialized in the parent process before forking. 2. **Lazy Queue Selection:** * `DelegatingMultiprocessQueueSink` and `Source` now use a coordination mechanism on the default queue. The first item's type determines if the Torch or default queue pair is used for actual data transfer for that and all subsequent items. 3. **Unit Test Updates:** * The test suite in `delegating_multiprocess_queue_factory_unittest.py` has been extensively updated. * The 'fork' + tensor test (`test_mp_correctness_fork_tensor_item_first`) now passes. * A new minimal test (`test_minimal_torch_mp_queue_spawn_tensor`) was added to help diagnose issues with 'spawn' and PyTorch tensors. **Outcome:** * The primary goal of fixing the `FileNotFoundError` with the 'fork' start method and PyTorch tensors is **achieved**. All 'fork' method tests, including tensor scenarios, now pass. * 'Spawn' method tests with non-tensor items also pass. **Investigation of 'spawn' + PyTorch Tensor Failures:** * Extensive investigation was conducted into why tests using the 'spawn' start method with PyTorch Tensors fail with EOFError/FileNotFoundError (tests: `test_mp_correctness_spawn_tensor_item_first`, `test_mp_correctness_spawn_mixed_items_tensor_first`, and the diagnostic `test_minimal_torch_mp_queue_spawn_tensor`). * The minimal test, which uses `torch.multiprocessing.Queue` directly (bypassing all `tsercom` factory/wrapper logic), also fails under 'spawn' with tensors. * Experiments included: * Explicitly setting PyTorch's sharing strategy to 'file_system'. * Explicitly calling `tensor.contiguous().share_memory_()` before putting tensors into the raw `torch.multiprocessing.Queue` in the minimal test. * None of these experiments resolved the failures in the minimal test case for 'spawn' + tensor. **Conclusion on 'spawn' + PyTorch Tensor Issue:** * The persistent failures in the minimal `torch.multiprocessing.Queue` test strongly indicate that the root cause is external to the `tsercom.DelegatingMultiprocessQueueFactory` code. It is likely an underlying issue or limitation within PyTorch's `multiprocessing` module regarding tensor sharing with the 'spawn' start method in the current testing environment. * The `tsercom` code now correctly handles 'fork' scenarios and 'spawn' for non-tensor data. The `spawn`+tensor issue is documented via the failing minimal test included in the test suite.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix: Refactor DelegatingMultiprocessQueueFactory for fork safety#170

Fix: Refactor DelegatingMultiprocessQueueFactory for fork safety#170
rwkeane wants to merge 1 commit intofeature/shared-memory-ipc-coordinationfrom
fix/ipc-fork-safety-d

rwkeane commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

rwkeane commented Jun 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant