Fix: Refactor DelegatingMultiprocessQueueFactory for fork safety#170
Open
rwkeane wants to merge 1 commit intofeature/shared-memory-ipc-coordinationfrom
Open
Fix: Refactor DelegatingMultiprocessQueueFactory for fork safety#170rwkeane wants to merge 1 commit intofeature/shared-memory-ipc-coordinationfrom
rwkeane wants to merge 1 commit intofeature/shared-memory-ipc-coordinationfrom
Conversation
This commit refactors `DelegatingMultiprocessQueueFactory` and its
associated sink/source wrappers to address a `FileNotFoundError` that
occurred when using the 'fork' multiprocessing start method with
PyTorch tensors.
Key changes:
1. **Eager Queue Creation:**
* `DelegatingMultiprocessQueueFactory` now eagerly creates two sets
of underlying queues in its constructor (default and Torch-specific
if PyTorch is available). This ensures queue resources are
initialized in the parent process before forking.
2. **Lazy Queue Selection:**
* `DelegatingMultiprocessQueueSink` and `Source` now use a
coordination mechanism on the default queue. The first item's type
determines if the Torch or default queue pair is used for actual
data transfer for that and all subsequent items.
3. **Unit Test Updates:**
* The test suite in `delegating_multiprocess_queue_factory_unittest.py`
has been extensively updated.
* The 'fork' + tensor test (`test_mp_correctness_fork_tensor_item_first`)
now passes.
* A new minimal test (`test_minimal_torch_mp_queue_spawn_tensor`)
was added to help diagnose issues with 'spawn' and PyTorch tensors.
**Outcome:**
* The primary goal of fixing the `FileNotFoundError` with the 'fork'
start method and PyTorch tensors is **achieved**. All 'fork' method
tests, including tensor scenarios, now pass.
* 'Spawn' method tests with non-tensor items also pass.
**Investigation of 'spawn' + PyTorch Tensor Failures:**
* Extensive investigation was conducted into why tests using the 'spawn'
start method with PyTorch Tensors fail with EOFError/FileNotFoundError
(tests: `test_mp_correctness_spawn_tensor_item_first`,
`test_mp_correctness_spawn_mixed_items_tensor_first`, and the
diagnostic `test_minimal_torch_mp_queue_spawn_tensor`).
* The minimal test, which uses `torch.multiprocessing.Queue` directly
(bypassing all `tsercom` factory/wrapper logic), also fails under
'spawn' with tensors.
* Experiments included:
* Explicitly setting PyTorch's sharing strategy to 'file_system'.
* Explicitly calling `tensor.contiguous().share_memory_()` before
putting tensors into the raw `torch.multiprocessing.Queue` in
the minimal test.
* None of these experiments resolved the failures in the minimal test
case for 'spawn' + tensor.
**Conclusion on 'spawn' + PyTorch Tensor Issue:**
* The persistent failures in the minimal `torch.multiprocessing.Queue`
test strongly indicate that the root cause is external to the
`tsercom.DelegatingMultiprocessQueueFactory` code. It is likely an
underlying issue or limitation within PyTorch's `multiprocessing`
module regarding tensor sharing with the 'spawn' start method in the
current testing environment.
* The `tsercom` code now correctly handles 'fork' scenarios and 'spawn'
for non-tensor data. The `spawn`+tensor issue is documented via the
failing minimal test included in the test suite.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit refactors
DelegatingMultiprocessQueueFactoryand its associated sink/source wrappers to address aFileNotFoundErrorthat occurred when using the 'fork' multiprocessing start method with PyTorch tensors.Key changes:
Eager Queue Creation:
DelegatingMultiprocessQueueFactorynow eagerly creates two setsof underlying queues in its constructor (default and Torch-specific
if PyTorch is available). This ensures queue resources are
initialized in the parent process before forking.
Lazy Queue Selection:
DelegatingMultiprocessQueueSinkandSourcenow use acoordination mechanism on the default queue. The first item's type
determines if the Torch or default queue pair is used for actual
data transfer for that and all subsequent items.
Unit Test Updates:
delegating_multiprocess_queue_factory_unittest.pyhas been extensively updated.
test_mp_correctness_fork_tensor_item_first)now passes.
test_minimal_torch_mp_queue_spawn_tensor)was added to help diagnose issues with 'spawn' and PyTorch tensors.
Outcome:
FileNotFoundErrorwith the 'fork'start method and PyTorch tensors is achieved. All 'fork' method
tests, including tensor scenarios, now pass.
Investigation of 'spawn' + PyTorch Tensor Failures:
start method with PyTorch Tensors fail with EOFError/FileNotFoundError
(tests:
test_mp_correctness_spawn_tensor_item_first,test_mp_correctness_spawn_mixed_items_tensor_first, and thediagnostic
test_minimal_torch_mp_queue_spawn_tensor).torch.multiprocessing.Queuedirectly(bypassing all
tsercomfactory/wrapper logic), also fails under'spawn' with tensors.
tensor.contiguous().share_memory_()beforeputting tensors into the raw
torch.multiprocessing.Queueinthe minimal test.
case for 'spawn' + tensor.
Conclusion on 'spawn' + PyTorch Tensor Issue:
torch.multiprocessing.Queuetest strongly indicate that the root cause is external to the
tsercom.DelegatingMultiprocessQueueFactorycode. It is likely anunderlying issue or limitation within PyTorch's
multiprocessingmodule regarding tensor sharing with the 'spawn' start method in the
current testing environment.
tsercomcode now correctly handles 'fork' scenarios and 'spawn'for non-tensor data. The
spawn+tensor issue is documented via thefailing minimal test included in the test suite.