feat: Add queue configuration options and suggest future params by rwkeane · Pull Request #201 · rwkeane/tsercom

rwkeane · 2025-06-21T02:35:21Z

This commit introduces several new configuration parameters to Tsercom's RuntimeConfig, enhancing control over queue behaviors.

Phase 1: Configurable Response Queue Size

Added max_queued_responses_per_endpoint to RuntimeConfig.
This parameter limits the size of the asyncio.Queue used by AsyncPoller within RuntimeDataHandlerBase for each remote endpoint.
Plumbed this parameter from RuntimeConfig through RuntimeFactory and runtime_main.py to the data handlers.
Updated relevant tests and added a new test to verify queue size respect.

Phase 2: Configurable IPC Queue Behavior

Added max_ipc_queue_size and is_ipc_blocking to RuntimeConfig.
max_ipc_queue_size controls the maxsize of multiprocessing.Queue (or torch.multiprocessing.Queue) instances used for core IPC.
is_ipc_blocking determines if put operations on these IPC queues should block or be lossy when full.
Plumbed these parameters from RuntimeManager (when creating default factories) through SplitRuntimeFactoryFactory to the queue factory constructors (DefaultMultiprocessQueueFactory, TorchMultiprocessQueueFactory, TorchMemcpyQueueFactory).
Queue factories now use max_ipc_queue_size when creating queues.
MultiprocessQueueSink now accepts an is_blocking flag and its put_blocking method honors this flag.
Updated relevant tests and added new tests for blocking/non-blocking IPC queue behavior.

Phase 3: Suggestions for Future Configuration Parameters

The following parameters could be added to RuntimeConfig in the future to further enhance configurability:

runtime_manager_process_join_timeout_seconds:
- Affects: tsercom.api.runtime_manager.RuntimeManager
- Why: Configure timeout for joining the out-of-process runtime process during shutdown.
runtime_command_bridge_stop_timeout_seconds:
- Affects: tsercom.api.local_process.runtime_command_bridge.RuntimeCommandBridge
- Why: Configure timeout for runtime's stop method completion.
data_reader_source_poll_timeout_seconds:
- Affects: tsercom.api.split_process.data_reader_source.DataReaderSource
- Why: Tune responsiveness of the polling thread to stop signals.
data_reader_source_join_timeout_seconds:
- Affects: tsercom.api.split_process.data_reader_source.DataReaderSource
- Why: Configure join timeout for the polling thread.
event_source_poll_timeout_seconds:
- Affects: tsercom.api.split_process.event_source.EventSource
- Why: Tune responsiveness of the event polling thread.
event_source_join_timeout_seconds:
- Affects: tsercom.api.split_process.event_source.EventSource
- Why: Configure join timeout for the event polling thread.
runtime_command_source_poll_timeout_seconds:
- Affects: tsercom.api.split_process.runtime_command_source.RuntimeCommandSource
- Why: Tune responsiveness of the command watching thread.
runtime_command_source_join_timeout_seconds:
- Affects: tsercom.api.split_process.runtime_command_source.RuntimeCommandSource
- Why: Configure join timeout for the command source thread.
split_error_watcher_source_poll_timeout_seconds:
- Affects: tsercom.api.split_process.split_process_error_watcher_source.SplitProcessErrorWatcherSource
- Why: Tune responsiveness of the error watching thread.
split_error_watcher_source_join_timeout_seconds:
- Affects: tsercom.api.split_process.split_process_error_watcher_source.SplitProcessErrorWatcherSource
- Why: Configure join timeout for the error watcher thread.
async_poller_wait_timeout_seconds:
- Affects: tsercom.threading.aio.async_poller.AsyncPoller (via IsRunningTracker)
- Why: Tune internal polling interval for item/stop checks.
default_thread_pool_max_workers_local_factory:
- Affects: tsercom.api.runtime_manager.RuntimeManager
- Why: Configure max_workers for default LocalRuntimeFactoryFactory thread pool.
default_thread_pool_max_workers_split_factory:
- Affects: tsercom.api.runtime_manager.RuntimeManager
- Why: Configure max_workers for default SplitRuntimeFactoryFactory thread pool.
data_reader_sink_is_lossy_default:
- Affects: tsercom.api.split_process.data_reader_sink.DataReaderSink
- Why: Global default for whether data sinks are lossy or error on full queue.
remote_process_main_stop_timeout_seconds:
- Affects: tsercom.runtime.runtime_main
- Why: Configure timeout for stopping all runtimes in a remote process.

This commit introduces several new configuration parameters to Tsercom's RuntimeConfig, enhancing control over queue behaviors. Phase 1: Configurable Response Queue Size - Added `max_queued_responses_per_endpoint` to `RuntimeConfig`. - This parameter limits the size of the `asyncio.Queue` used by `AsyncPoller` within `RuntimeDataHandlerBase` for each remote endpoint. - Plumbed this parameter from `RuntimeConfig` through `RuntimeFactory` and `runtime_main.py` to the data handlers. - Updated relevant tests and added a new test to verify queue size respect. Phase 2: Configurable IPC Queue Behavior - Added `max_ipc_queue_size` and `is_ipc_blocking` to `RuntimeConfig`. - `max_ipc_queue_size` controls the `maxsize` of `multiprocessing.Queue` (or `torch.multiprocessing.Queue`) instances used for core IPC. - `is_ipc_blocking` determines if `put` operations on these IPC queues should block or be lossy when full. - Plumbed these parameters from `RuntimeManager` (when creating default factories) through `SplitRuntimeFactoryFactory` to the queue factory constructors (`DefaultMultiprocessQueueFactory`, `TorchMultiprocessQueueFactory`, `TorchMemcpyQueueFactory`). - Queue factories now use `max_ipc_queue_size` when creating queues. - `MultiprocessQueueSink` now accepts an `is_blocking` flag and its `put_blocking` method honors this flag. - Updated relevant tests and added new tests for blocking/non-blocking IPC queue behavior. Phase 3: Suggestions for Future Configuration Parameters The following parameters could be added to `RuntimeConfig` in the future to further enhance configurability: 1. `runtime_manager_process_join_timeout_seconds`: * Affects: `tsercom.api.runtime_manager.RuntimeManager` * Why: Configure timeout for joining the out-of-process runtime process during shutdown. 2. `runtime_command_bridge_stop_timeout_seconds`: * Affects: `tsercom.api.local_process.runtime_command_bridge.RuntimeCommandBridge` * Why: Configure timeout for runtime's `stop` method completion. 3. `data_reader_source_poll_timeout_seconds`: * Affects: `tsercom.api.split_process.data_reader_source.DataReaderSource` * Why: Tune responsiveness of the polling thread to stop signals. 4. `data_reader_source_join_timeout_seconds`: * Affects: `tsercom.api.split_process.data_reader_source.DataReaderSource` * Why: Configure join timeout for the polling thread. 5. `event_source_poll_timeout_seconds`: * Affects: `tsercom.api.split_process.event_source.EventSource` * Why: Tune responsiveness of the event polling thread. 6. `event_source_join_timeout_seconds`: * Affects: `tsercom.api.split_process.event_source.EventSource` * Why: Configure join timeout for the event polling thread. 7. `runtime_command_source_poll_timeout_seconds`: * Affects: `tsercom.api.split_process.runtime_command_source.RuntimeCommandSource` * Why: Tune responsiveness of the command watching thread. 8. `runtime_command_source_join_timeout_seconds`: * Affects: `tsercom.api.split_process.runtime_command_source.RuntimeCommandSource` * Why: Configure join timeout for the command source thread. 9. `split_error_watcher_source_poll_timeout_seconds`: * Affects: `tsercom.api.split_process.split_process_error_watcher_source.SplitProcessErrorWatcherSource` * Why: Tune responsiveness of the error watching thread. 10. `split_error_watcher_source_join_timeout_seconds`: * Affects: `tsercom.api.split_process.split_process_error_watcher_source.SplitProcessErrorWatcherSource` * Why: Configure join timeout for the error watcher thread. 11. `async_poller_wait_timeout_seconds`: * Affects: `tsercom.threading.aio.async_poller.AsyncPoller` (via `IsRunningTracker`) * Why: Tune internal polling interval for item/stop checks. 12. `default_thread_pool_max_workers_local_factory`: * Affects: `tsercom.api.runtime_manager.RuntimeManager` * Why: Configure `max_workers` for default `LocalRuntimeFactoryFactory` thread pool. 13. `default_thread_pool_max_workers_split_factory`: * Affects: `tsercom.api.runtime_manager.RuntimeManager` * Why: Configure `max_workers` for default `SplitRuntimeFactoryFactory` thread pool. 14. `data_reader_sink_is_lossy_default`: * Affects: `tsercom.api.split_process.data_reader_sink.DataReaderSink` * Why: Global default for whether data sinks are lossy or error on full queue. 15. `remote_process_main_stop_timeout_seconds`: * Affects: `tsercom.runtime.runtime_main` * Why: Configure timeout for stopping all runtimes in a remote process.

This commit adds the `data_reader_sink_is_lossy` parameter to `RuntimeConfig`. - Added `data_reader_sink_is_lossy: bool` to `RuntimeConfig` with a default of `True`. - Updated `RuntimeConfig.__init__` and clone logic. - Added a corresponding property to `RuntimeFactory`. - Updated `RemoteRuntimeFactory._remote_data_reader()` to pass this `is_lossy` flag to the `DataReaderSink` constructor. - Updated `FakeRuntimeInitializer` in relevant test files (`runtime_config_unittest.py`, `remote_runtime_factory_unittest.py`, `local_runtime_factory_unittest.py`, `local_runtime_factory_factory_unittest.py`) to include the new parameter and property for consistency. - Updated `FakeDataReaderSink` in `remote_runtime_factory_unittest.py` to accept `is_lossy` and updated tests to verify its propagation.

This commit incorporates feedback to improve the configuration handling and code quality: - Refactored IPC configuration: - `RuntimeManager` no longer takes IPC queue parameters in its constructor. - `SplitRuntimeFactoryFactory` now retrieves IPC settings (`max_ipc_queue_size`, `is_ipc_blocking`) directly from the `RuntimeInitializer` instance it processes. - `RuntimeConfig.max_ipc_queue_size` now uses `Optional[int]` with a default of `None` to signify unbounded queues. Queue factories correctly interpret `None` or non-positive values as `maxsize=0` for the underlying `multiprocessing.Queue`. - Removed redundant delegating properties from `RuntimeFactory` as these are inherited from `RuntimeConfig` via `RuntimeInitializer`. - Ensured new private instance variables in queue factories and `MultiprocessQueueSink` use the `__` prefix (e.g., `__max_ipc_queue_size`). - Removed unnecessary/meta comments from modified files. - Corrected test logic for `queue.Empty` exceptions in queue factory unit tests. - Updated `FakeRuntimeInitializer` in test files to align with `RuntimeConfig` changes, ensuring all new config properties are present.

This commit incorporates further feedback to improve configuration handling, private variable naming, and comment clarity. - Refactored IPC configuration: - Removed IPC queue parameters (`max_ipc_queue_size`, `is_ipc_blocking`) from `RuntimeManager.__init__`. - `SplitRuntimeFactoryFactory` now retrieves these IPC settings directly from the `RuntimeInitializer` instance it processes (as `RuntimeInitializer` inherits from `RuntimeConfig`). - `RuntimeConfig.max_ipc_queue_size` now correctly uses `Optional[int]` with a default of `None` to signify unbounded queues. Queue factories now interpret `None` or non-positive values as `maxsize=0` for the underlying `multiprocessing.Queue` and `torch.multiprocessing.Queue`. - Removed redundant delegating properties from `RuntimeFactory` for `max_queued_responses_per_endpoint`, `max_ipc_queue_size`, `is_ipc_blocking`, and `data_reader_sink_is_lossy`, as these are directly inherited from `RuntimeConfig`. - Ensured new private instance variables in queue factories (`DefaultMultiprocessQueueFactory`, `TorchMultiprocessQueueFactory`, `TorchMemcpyQueueFactory`) and `MultiprocessQueueSink` consistently use the `__` prefix. - Conducted a thorough manual review and cleanup of comments in all modified application files to remove meta-comments, "what" comments, and ensure only necessary "why" comments remain. - Corrected test logic for `queue.Empty` exceptions in queue factory unit tests, ensuring they expect `None` from `get_blocking` or correctly catch `queue.Empty`. - Updated `FakeRuntimeInitializer` in all relevant test files to ensure full compatibility with `RuntimeConfig`'s expected attributes and properties, especially for the cloning mechanism.

This commit applies the final set of fixes based on feedback to ensure robust configuration handling, correct private member naming, and adherence to commenting standards. - Corrected `RuntimeManager` and `SplitRuntimeFactoryFactory` to ensure IPC configuration (`max_ipc_queue_size`, `is_ipc_blocking`) is consistently sourced from `RuntimeInitializer` (which is a `RuntimeConfig`) rather than being passed through `RuntimeManager`'s constructor. - Ensured `max_ipc_queue_size` in `RuntimeConfig` is `Optional[int]` defaulting to `None`, and that queue factories correctly interpret `None` or non-positive values as `maxsize=0` (unbounded) for underlying multiprocessing queues. - Removed now-redundant properties from `RuntimeFactory` as they are inherited. - Enforced `__` prefix for all newly introduced private instance variables in factories and sinks. - Performed a final thorough pass to remove all non-essential comments (meta-comments, "what" comments) from all modified application files, retaining only crucial "why" comments. - Corrected test assertions for `queue.Empty` in queue factory unit tests to align with `MultiprocessQueueSource.get_blocking`'s behavior of returning `None` on timeout/empty.

#206) Phase 1: Fix Existing Tests - Resolved initial 12 test failures. - Root cause of E2E TypeErrors was SplitRuntimeFactoryFactory passing IPC params to queue_factory.create_queues() which didn't accept them. - Refactored MultiprocessQueueFactory ABC and its concrete implementations (Default, Torch, TorchMemcpy) to accept max_ipc_queue_size and is_ipc_blocking in their create_queues() method, not __init__. - Updated SplitRuntimeFactoryFactory to use the provider's queue_factory and call the updated create_queues method with IPC params. - Fixed AttributeError in multiprocessing_context_provider_unittest.py. - Updated unit tests for all affected factories and SplitRuntimeFactoryFactory to align with new signatures and mocking strategies. - All 981 tests now pass. Phase 2: Add New IPC Config Tests - Added test_create_pair_interaction_with_provider_and_factory to split_runtime_factory_factory_unittest.py, verifying that IPC params from RuntimeInitializer are passed to the queue_factory's create_queues method. - Added test_factory_with_non_blocking_queue_is_lossy to split_runtime_factory_factory_unittest.py, confirming that a queue with max_ipc_queue_size=1 and is_ipc_blocking=False raises queue.Full on the second item put to the underlying mp.Queue. Other: - Ran static analysis (black, ruff, mypy), fixed mypy error in TorchMemcpyQueueFactory. - Partially completed comment cleanup in modified files to adhere to "Why, not What" principle. Some minor cleanup might still be needed. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>

The unit test `test_create_pair_interaction_with_provider_and_factory` in `tsercom/api/split_process/split_runtime_factory_factory_unittest.py` incorrectly asserted that `create_queues` would be called 3 times on the mocked `queue_factory_instance`. Analysis of the `SplitRuntimeFactoryFactory._create_pair` method shows that the mocked `queue_factory_instance` (from `self.__mp_context_provider.queue_factory`) is used for creating event queues and data queues (2 calls). The command queues are created using a new, separate instance of `DefaultMultiprocessQueueFactory`. This commit updates the assertion in the unit test to expect 2 calls to `mock_queue_factory_instance.create_queues`, aligning the test with the actual application code behavior. The corresponding assertion for the third call's arguments has also been removed. Verification: - The modified unit test now passes. - Full E2E test suites (`runtime_e2etest.py`, `rpc_e2etest.py`) were run and passed, confirming no regressions. An initial E2E test failure was investigated and attributed to test flakiness, not this change. - Static analysis (Black, Ruff, Mypy) and formatting checks pass. - The full test suite (`pytest --timeout=120`) passed twice consecutively. Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>

google-labs-jules bot and others added 9 commits June 21, 2025 02:35

Merge branch 'main' into feature/add-queue-config-options-a

bd6e4f8

Formatting

b95a912

rwkeane closed this Jun 21, 2025

rwkeane reopened this Jun 21, 2025

rwkeane merged commit 4a52665 into main Jun 21, 2025
8 checks passed

rwkeane deleted the feature/add-queue-config-options-a branch June 21, 2025 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Add queue configuration options and suggest future params#201

feat: Add queue configuration options and suggest future params#201
rwkeane merged 9 commits intomainfrom
feature/add-queue-config-options-a

rwkeane commented Jun 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

rwkeane commented Jun 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant