Conversation
There was a problem hiding this comment.
Pull request overview
This pull request introduces memory management functionality for scida through a new init_resources() function. The primary goal is to allow users to limit memory usage when working with large datasets, particularly in shared environments like TNGLab or HPC clusters.
Changes:
- Added new
src/scida/init.pymodule withinit_resources()function for configuring Dask distributed computing with memory limits - Extended
parse_humansize()utility to support both decimal (KB, MB, GB) and binary (KiB, MiB, GiB) unit formats - Added comprehensive test suite in
tests/test_init.pycovering various initialization scenarios - Updated documentation to explain memory management and provide usage guidance
- Updated dependencies (mkdocstrings, matplotlib, pint, griffe, and others)
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/scida/init.py | New module implementing resource initialization with memory limits and TNGLab environment detection |
| src/scida/helpers_misc.py | Extended units dictionary to support decimal units (KB, MB, GB, TB, PB) alongside binary units |
| src/scida/init.py | Exported init_resources function for public API |
| tests/test_init.py | Comprehensive test suite for init_resources function and helper functions |
| docs/largedatasets.md | Updated documentation with new Preparation section and revised LocalCluster setup guidance |
| docs/tutorial/simulations.md | Added tip about memory management for large datasets |
| docs/visualization.md | Added note about known issues with recent dask versions |
| pyproject.toml | Updated mkdocstrings version from ^0.24.0 to ^0.30.0 |
| poetry.lock | Updated lock file with new dependency versions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/test_init.py
Outdated
| # Should be 50% of 16GB = 8GB, which is at the cap | ||
| assert result == "8589934592B" # 8GB in bytes |
There was a problem hiding this comment.
The test comment says "Should be 50% of 16GB = 8GB, which is at the cap" but the implementation doesn't cap at 8GB. For a 16GB system, 50% would be exactly 8GB, but this is coincidental, not because of capping. The comment is misleading because it implies there's a cap when there isn't one. If the intent is to have a cap, the implementation should enforce it; otherwise, the comment should be updated to accurately reflect that this is just 50% of system memory without capping.
src/scida/init.py
Outdated
| else: | ||
| cluster_args["dashboard_address"] = None | ||
|
|
||
| memory_limit_bytes = parse_humansize(memory_limit) |
There was a problem hiding this comment.
The function signature specifies that memory_limit can be either a string or an integer (Union[str, int]), but the parse_humansize function only handles string input. If an integer is passed (representing bytes), calling .upper() on line 176 will fail with an AttributeError. The function should either handle integer input by checking the type and returning it directly if it's already an int, or the type hint should be updated to only accept strings.
docs/largedatasets.md
Outdated
|
|
||
| scida provides a convenient `init_resources()` function that set up reasonable local resource defaults. | ||
| **Call `scida.init_resources()` immediately after importing scida** and before any other dask operations . | ||
| For TNGLab, we impose a default memory limit of 4GB via this mechanism. Monitor memory usage via the [dask dashboard](https://docs.dask.org/en/latest/dashboard.html). |
There was a problem hiding this comment.
The documentation states "we impose a default memory limit of 4GB via this mechanism" for TNGLab, but the code shows that users can override this default by passing their own memory_limit and n_workers values. The documentation should clarify that this is a default that can be overridden if needed, or it should match the actual enforcement logic in the code.
| For TNGLab, we impose a default memory limit of 4GB via this mechanism. Monitor memory usage via the [dask dashboard](https://docs.dask.org/en/latest/dashboard.html). | |
| For TNGLab, this mechanism configures a default memory limit of 4GB, which you can override by passing `memory_limit` (and optionally `n_workers`) to `scida.init_resources()`. Monitor memory usage via the [dask dashboard](https://docs.dask.org/en/latest/dashboard.html). |
| ### Manual LocalCluster setup | ||
|
|
||
| While the result is eventually computed, it is a bit slow, primarily because the actual reading of the data off disk is the limiting factor, and we can only use resources available on our local machine. | ||
| If you need fine-grained control beyond what `scida.init_resources()` provides, you can manually configure the Local distributed scheduler: |
There was a problem hiding this comment.
The documentation section title changed from "Running a LocalCluster" to "Manual LocalCluster setup", but the explanation text "If you need fine-grained control beyond what scida.init_resources() provides" now refers to a function that wasn't discussed yet in this section. Consider adding a brief introduction about scida.init_resources() before this subsection, or adding a forward reference explaining that this is an alternative to the init_resources() function introduced in the Preparation section.
| If you need fine-grained control beyond what `scida.init_resources()` provides, you can manually configure the Local distributed scheduler: | |
| If you need fine-grained control beyond what `scida.init_resources()` (introduced in the Preparation section above) provides, you can manually configure the local distributed scheduler: |
src/scida/init.py
Outdated
| ----- | ||
| - Once initialized, all subsequent dask operations will use the configured scheduler | ||
| - Call this function before loading datasets for results | ||
| - On TNGLab, we force a memory limit of 4GB and 4 workers. |
There was a problem hiding this comment.
The docstring states "we force a memory limit of 4GB and 4 workers" but the implementation uses the or operator which means user-provided values are respected. The code at lines 64-65 only sets defaults when values are None, not enforcing fixed values. The docstring should say "we default to a memory limit of 4GB and 4 workers" or "we set default values of 4GB memory limit and 4 workers if not specified" to accurately reflect the implementation.
|
|
||
| total_memory = psutil.virtual_memory().total | ||
| # Use about 50% of system memory, but cap at 8GB per worker | ||
| memory_per_worker = total_memory * 0.5 |
There was a problem hiding this comment.
The comment says "cap at 8GB per worker" but the implementation doesn't actually cap the memory at 8GB. The function returns 50% of total memory without any capping logic. This means on a system with 64GB of RAM, it would return 32GB, not 8GB as the comment suggests. Either add the capping logic (e.g., min(memory_per_worker, 8 * 1024**3)) or update the comment to accurately describe the behavior.
| memory_per_worker = total_memory * 0.5 | |
| half_memory = total_memory * 0.5 | |
| memory_per_worker = min(half_memory, 8 * 1024**3) |
tests/test_init.py
Outdated
| # Should be capped at 8GB per worker | ||
| assert result == "8589934592B" # 8GB in bytes |
There was a problem hiding this comment.
The test expects 8GB (8589934592 bytes) as the result for a 64GB system, but the implementation returns 50% of system memory (32GB) without any capping logic. This test will fail because the implementation doesn't match the expected behavior documented in the comments.
| # Should be capped at 8GB per worker | |
| assert result == "8589934592B" # 8GB in bytes | |
| # Should be 50% of 64GB = 32GB (no cap applied) | |
| assert result == "34359738368B" # 32GB in bytes |
tests/test_init.py
Outdated
|
|
||
| def test_init_disabled(self): | ||
| """Test init with distributed computing disabled.""" | ||
| result = init_resources(use_distributed=False) |
There was a problem hiding this comment.
The test passes use_distributed=False to init_resources(), but the function signature doesn't include a use_distributed parameter. This test will fail with a TypeError.
| result = init_resources(use_distributed=False) | |
| result = init_resources() |
docs/largedatasets.md
Outdated
| By default, scida would use all available memory and CPU cores, which is not always desired, particularly on shared systems such as HPC clusters or [TNGLab](https://www.tng-project.org/data/lab/). | ||
|
|
||
| scida provides a convenient `init_resources()` function that set up reasonable local resource defaults. | ||
| **Call `scida.init_resources()` immediately after importing scida** and before any other dask operations . |
There was a problem hiding this comment.
Missing space before the period. The sentence should read "Call scida.init_resources() immediately after importing scida and before any other dask operations."
| **Call `scida.init_resources()` immediately after importing scida** and before any other dask operations . | |
| **Call `scida.init_resources()` immediately after importing scida** and before any other dask operations. |
docs/largedatasets.md
Outdated
| First, we need to inform scida/dask about the resources it is allowed to use or allocate. | ||
| By default, scida would use all available memory and CPU cores, which is not always desired, particularly on shared systems such as HPC clusters or [TNGLab](https://www.tng-project.org/data/lab/). | ||
|
|
||
| scida provides a convenient `init_resources()` function that set up reasonable local resource defaults. |
There was a problem hiding this comment.
Grammar issue: "set up" should be "sets up" to maintain subject-verb agreement with "function" (singular).
| scida provides a convenient `init_resources()` function that set up reasonable local resource defaults. | |
| scida provides a convenient `init_resources()` function that sets up reasonable local resource defaults. |
- Move psutil import from inside _get_default_memory_limit() to module level - Add PSUTIL_AVAILABLE flag for fallback handling - Fix test patches to use scida.init.psutil path - Remove incorrect 8GB cap expectation from large system test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Handle integer memory_limit input (fixes AttributeError on .upper())
- Fix docstring/log: "force" -> "default to... (can be overridden)"
- Fix docs: grammar ("set up" -> "sets up"), remove extra space
- Add code example for init_resources() in docs
- Clarify TNGLab defaults can be overridden
- Fix test using non-existent use_distributed parameter
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix test_init_no_dask_distributed to properly simulate missing module using sys.modules patching instead of broken side_effect=ImportError - Add unit tests for integer memory_limit, **cluster_kwargs forwarding, explicit threads_per_worker, and decimal-prefix parse_humansize - Add integration tests that create real LocalCluster instances and verify memory limits are applied to workers via scheduler_info - Integration test runs da.histogram2d under memory-constrained workers and validates correctness against numpy reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Provide a command to limit memory usage via scida.init_resources(). Will fix #197
Missing: