Skip to content

Comments

Fix: Refactor DelegatingMultiprocessQueueFactory for fork safety#170

Open
rwkeane wants to merge 1 commit intofeature/shared-memory-ipc-coordinationfrom
fix/ipc-fork-safety-d
Open

Fix: Refactor DelegatingMultiprocessQueueFactory for fork safety#170
rwkeane wants to merge 1 commit intofeature/shared-memory-ipc-coordinationfrom
fix/ipc-fork-safety-d

Conversation

@rwkeane
Copy link
Owner

@rwkeane rwkeane commented Jun 19, 2025

This commit refactors DelegatingMultiprocessQueueFactory and its associated sink/source wrappers to address a FileNotFoundError that occurred when using the 'fork' multiprocessing start method with PyTorch tensors.

Key changes:

  1. Eager Queue Creation:

    • DelegatingMultiprocessQueueFactory now eagerly creates two sets
      of underlying queues in its constructor (default and Torch-specific
      if PyTorch is available). This ensures queue resources are
      initialized in the parent process before forking.
  2. Lazy Queue Selection:

    • DelegatingMultiprocessQueueSink and Source now use a
      coordination mechanism on the default queue. The first item's type
      determines if the Torch or default queue pair is used for actual
      data transfer for that and all subsequent items.
  3. Unit Test Updates:

    • The test suite in delegating_multiprocess_queue_factory_unittest.py
      has been extensively updated.
    • The 'fork' + tensor test (test_mp_correctness_fork_tensor_item_first)
      now passes.
    • A new minimal test (test_minimal_torch_mp_queue_spawn_tensor)
      was added to help diagnose issues with 'spawn' and PyTorch tensors.

Outcome:

  • The primary goal of fixing the FileNotFoundError with the 'fork'
    start method and PyTorch tensors is achieved. All 'fork' method
    tests, including tensor scenarios, now pass.
  • 'Spawn' method tests with non-tensor items also pass.

Investigation of 'spawn' + PyTorch Tensor Failures:

  • Extensive investigation was conducted into why tests using the 'spawn'
    start method with PyTorch Tensors fail with EOFError/FileNotFoundError
    (tests: test_mp_correctness_spawn_tensor_item_first,
    test_mp_correctness_spawn_mixed_items_tensor_first, and the
    diagnostic test_minimal_torch_mp_queue_spawn_tensor).
  • The minimal test, which uses torch.multiprocessing.Queue directly
    (bypassing all tsercom factory/wrapper logic), also fails under
    'spawn' with tensors.
  • Experiments included:
    • Explicitly setting PyTorch's sharing strategy to 'file_system'.
    • Explicitly calling tensor.contiguous().share_memory_() before
      putting tensors into the raw torch.multiprocessing.Queue in
      the minimal test.
  • None of these experiments resolved the failures in the minimal test
    case for 'spawn' + tensor.

Conclusion on 'spawn' + PyTorch Tensor Issue:

  • The persistent failures in the minimal torch.multiprocessing.Queue
    test strongly indicate that the root cause is external to the
    tsercom.DelegatingMultiprocessQueueFactory code. It is likely an
    underlying issue or limitation within PyTorch's multiprocessing
    module regarding tensor sharing with the 'spawn' start method in the
    current testing environment.
  • The tsercom code now correctly handles 'fork' scenarios and 'spawn'
    for non-tensor data. The spawn+tensor issue is documented via the
    failing minimal test included in the test suite.

This commit refactors `DelegatingMultiprocessQueueFactory` and its
associated sink/source wrappers to address a `FileNotFoundError` that
occurred when using the 'fork' multiprocessing start method with
PyTorch tensors.

Key changes:
1.  **Eager Queue Creation:**
    *   `DelegatingMultiprocessQueueFactory` now eagerly creates two sets
        of underlying queues in its constructor (default and Torch-specific
        if PyTorch is available). This ensures queue resources are
        initialized in the parent process before forking.

2.  **Lazy Queue Selection:**
    *   `DelegatingMultiprocessQueueSink` and `Source` now use a
        coordination mechanism on the default queue. The first item's type
        determines if the Torch or default queue pair is used for actual
        data transfer for that and all subsequent items.

3.  **Unit Test Updates:**
    *   The test suite in `delegating_multiprocess_queue_factory_unittest.py`
        has been extensively updated.
    *   The 'fork' + tensor test (`test_mp_correctness_fork_tensor_item_first`)
        now passes.
    *   A new minimal test (`test_minimal_torch_mp_queue_spawn_tensor`)
        was added to help diagnose issues with 'spawn' and PyTorch tensors.

**Outcome:**
*   The primary goal of fixing the `FileNotFoundError` with the 'fork'
    start method and PyTorch tensors is **achieved**. All 'fork' method
    tests, including tensor scenarios, now pass.
*   'Spawn' method tests with non-tensor items also pass.

**Investigation of 'spawn' + PyTorch Tensor Failures:**
*   Extensive investigation was conducted into why tests using the 'spawn'
    start method with PyTorch Tensors fail with EOFError/FileNotFoundError
    (tests: `test_mp_correctness_spawn_tensor_item_first`,
    `test_mp_correctness_spawn_mixed_items_tensor_first`, and the
    diagnostic `test_minimal_torch_mp_queue_spawn_tensor`).
*   The minimal test, which uses `torch.multiprocessing.Queue` directly
    (bypassing all `tsercom` factory/wrapper logic), also fails under
    'spawn' with tensors.
*   Experiments included:
    *   Explicitly setting PyTorch's sharing strategy to 'file_system'.
    *   Explicitly calling `tensor.contiguous().share_memory_()` before
      putting tensors into the raw `torch.multiprocessing.Queue` in
      the minimal test.
*   None of these experiments resolved the failures in the minimal test
    case for 'spawn' + tensor.

**Conclusion on 'spawn' + PyTorch Tensor Issue:**
*   The persistent failures in the minimal `torch.multiprocessing.Queue`
    test strongly indicate that the root cause is external to the
    `tsercom.DelegatingMultiprocessQueueFactory` code. It is likely an
    underlying issue or limitation within PyTorch's `multiprocessing`
    module regarding tensor sharing with the 'spawn' start method in the
    current testing environment.
*   The `tsercom` code now correctly handles 'fork' scenarios and 'spawn'
    for non-tensor data. The `spawn`+tensor issue is documented via the
    failing minimal test included in the test suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant