Limit Memory Usage by cbyrohl · Pull Request #203 · cbyrohl/scida

cbyrohl · 2025-08-07T10:53:20Z

Provide a command to limit memory usage via scida.init_resources(). Will fix #197

Missing:

For TNGLab call scida.init_resources() upon scida.load() to enforce limit in case init_resources() has not been called yet.

Copilot

Pull request overview

This pull request introduces memory management functionality for scida through a new init_resources() function. The primary goal is to allow users to limit memory usage when working with large datasets, particularly in shared environments like TNGLab or HPC clusters.

Changes:

Added new src/scida/init.py module with init_resources() function for configuring Dask distributed computing with memory limits
Extended parse_humansize() utility to support both decimal (KB, MB, GB) and binary (KiB, MiB, GiB) unit formats
Added comprehensive test suite in tests/test_init.py covering various initialization scenarios
Updated documentation to explain memory management and provide usage guidance
Updated dependencies (mkdocstrings, matplotlib, pint, griffe, and others)

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
src/scida/init.py	New module implementing resource initialization with memory limits and TNGLab environment detection
src/scida/helpers_misc.py	Extended units dictionary to support decimal units (KB, MB, GB, TB, PB) alongside binary units
src/scida/init.py	Exported init_resources function for public API
tests/test_init.py	Comprehensive test suite for init_resources function and helper functions
docs/largedatasets.md	Updated documentation with new Preparation section and revised LocalCluster setup guidance
docs/tutorial/simulations.md	Added tip about memory management for large datasets
docs/visualization.md	Added note about known issues with recent dask versions
pyproject.toml	Updated mkdocstrings version from ^0.24.0 to ^0.30.0
poetry.lock	Updated lock file with new dependency versions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-16T21:23:41Z

tests/test_init.py

+            # Should be 50% of 16GB = 8GB, which is at the cap
+            assert result == "8589934592B"  # 8GB in bytes


The test comment says "Should be 50% of 16GB = 8GB, which is at the cap" but the implementation doesn't cap at 8GB. For a 16GB system, 50% would be exactly 8GB, but this is coincidental, not because of capping. The comment is misleading because it implies there's a cap when there isn't one. If the intent is to have a cap, the implementation should enforce it; otherwise, the comment should be updated to accurately reflect that this is just 50% of system memory without capping.

Copilot · 2026-01-16T21:23:42Z

src/scida/init.py

+    else:
+        cluster_args["dashboard_address"] = None
+
+    memory_limit_bytes = parse_humansize(memory_limit)


The function signature specifies that memory_limit can be either a string or an integer (Union[str, int]), but the parse_humansize function only handles string input. If an integer is passed (representing bytes), calling .upper() on line 176 will fail with an AttributeError. The function should either handle integer input by checking the type and returning it directly if it's already an int, or the type hint should be updated to only accept strings.

Copilot · 2026-01-16T21:23:42Z

docs/largedatasets.md

+
+scida provides a convenient `init_resources()` function that set up reasonable local resource defaults.
+**Call `scida.init_resources()` immediately after importing scida** and before any other dask operations .
+For TNGLab, we impose a default memory limit of 4GB via this mechanism. Monitor memory usage via the [dask dashboard](https://docs.dask.org/en/latest/dashboard.html).


The documentation states "we impose a default memory limit of 4GB via this mechanism" for TNGLab, but the code shows that users can override this default by passing their own memory_limit and n_workers values. The documentation should clarify that this is a default that can be overridden if needed, or it should match the actual enforcement logic in the code.

Suggested change

For TNGLab, we impose a default memory limit of 4GB via this mechanism. Monitor memory usage via the [dask dashboard](https://docs.dask.org/en/latest/dashboard.html).

For TNGLab, this mechanism configures a default memory limit of 4GB, which you can override by passing `memory_limit` (and optionally `n_workers`) to `scida.init_resources()`. Monitor memory usage via the [dask dashboard](https://docs.dask.org/en/latest/dashboard.html).

Copilot · 2026-01-16T21:23:43Z

docs/largedatasets.md

+### Manual LocalCluster setup

-While the result is eventually computed, it is a bit slow, primarily because the actual reading of the data off disk is the limiting factor, and we can only use resources available on our local machine.
+If you need fine-grained control beyond what `scida.init_resources()` provides, you can manually configure the Local distributed scheduler:


The documentation section title changed from "Running a LocalCluster" to "Manual LocalCluster setup", but the explanation text "If you need fine-grained control beyond what scida.init_resources() provides" now refers to a function that wasn't discussed yet in this section. Consider adding a brief introduction about scida.init_resources() before this subsection, or adding a forward reference explaining that this is an alternative to the init_resources() function introduced in the Preparation section.

Suggested change

If you need fine-grained control beyond what `scida.init_resources()` provides, you can manually configure the Local distributed scheduler:

If you need fine-grained control beyond what `scida.init_resources()` (introduced in the Preparation section above) provides, you can manually configure the local distributed scheduler:

Copilot · 2026-01-16T21:23:43Z

src/scida/init.py

+    -----
+    - Once initialized, all subsequent dask operations will use the configured scheduler
+    - Call this function before loading datasets for results
+    - On TNGLab, we force a memory limit of 4GB and 4 workers.


The docstring states "we force a memory limit of 4GB and 4 workers" but the implementation uses the or operator which means user-provided values are respected. The code at lines 64-65 only sets defaults when values are None, not enforcing fixed values. The docstring should say "we default to a memory limit of 4GB and 4 workers" or "we set default values of 4GB memory limit and 4 workers if not specified" to accurately reflect the implementation.

Copilot · 2026-01-16T21:23:43Z

src/scida/init.py

+
+        total_memory = psutil.virtual_memory().total
+        # Use about 50% of system memory, but cap at 8GB per worker
+        memory_per_worker = total_memory * 0.5


The comment says "cap at 8GB per worker" but the implementation doesn't actually cap the memory at 8GB. The function returns 50% of total memory without any capping logic. This means on a system with 64GB of RAM, it would return 32GB, not 8GB as the comment suggests. Either add the capping logic (e.g., min(memory_per_worker, 8 * 1024**3)) or update the comment to accurately describe the behavior.

Suggested change

memory_per_worker = total_memory * 0.5

half_memory = total_memory * 0.5

memory_per_worker = min(half_memory, 8 * 1024**3)

Copilot · 2026-01-16T21:23:44Z

tests/test_init.py

+            # Should be capped at 8GB per worker
+            assert result == "8589934592B"  # 8GB in bytes


The test expects 8GB (8589934592 bytes) as the result for a 64GB system, but the implementation returns 50% of system memory (32GB) without any capping logic. This test will fail because the implementation doesn't match the expected behavior documented in the comments.

Suggested change

# Should be capped at 8GB per worker

assert result == "8589934592B" # 8GB in bytes

# Should be 50% of 64GB = 32GB (no cap applied)

assert result == "34359738368B" # 32GB in bytes

Copilot · 2026-01-16T21:23:44Z

tests/test_init.py

+
+    def test_init_disabled(self):
+        """Test init with distributed computing disabled."""
+        result = init_resources(use_distributed=False)


The test passes use_distributed=False to init_resources(), but the function signature doesn't include a use_distributed parameter. This test will fail with a TypeError.

Suggested change

result = init_resources(use_distributed=False)

result = init_resources()

Copilot · 2026-01-16T21:23:44Z

docs/largedatasets.md

+By default, scida would use all available memory and CPU cores, which is not always desired, particularly on shared systems such as HPC clusters or [TNGLab](https://www.tng-project.org/data/lab/).
+
+scida provides a convenient `init_resources()` function that set up reasonable local resource defaults.
+**Call `scida.init_resources()` immediately after importing scida** and before any other dask operations .


Missing space before the period. The sentence should read "Call scida.init_resources() immediately after importing scida and before any other dask operations."

Suggested change

**Call `scida.init_resources()` immediately after importing scida** and before any other dask operations .

**Call `scida.init_resources()` immediately after importing scida** and before any other dask operations.

Copilot · 2026-01-16T21:23:44Z

docs/largedatasets.md

+First, we need to inform scida/dask about the resources it is allowed to use or allocate.
+By default, scida would use all available memory and CPU cores, which is not always desired, particularly on shared systems such as HPC clusters or [TNGLab](https://www.tng-project.org/data/lab/).
+
+scida provides a convenient `init_resources()` function that set up reasonable local resource defaults.


Grammar issue: "set up" should be "sets up" to maintain subject-verb agreement with "function" (singular).

Suggested change

scida provides a convenient `init_resources()` function that set up reasonable local resource defaults.

scida provides a convenient `init_resources()` function that sets up reasonable local resource defaults.

- Move psutil import from inside _get_default_memory_limit() to module level - Add PSUTIL_AVAILABLE flag for fallback handling - Fix test patches to use scida.init.psutil path - Remove incorrect 8GB cap expectation from large system test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Handle integer memory_limit input (fixes AttributeError on .upper()) - Fix docstring/log: "force" -> "default to... (can be overridden)" - Fix docs: grammar ("set up" -> "sets up"), remove extra space - Add code example for init_resources() in docs - Clarify TNGLab defaults can be overridden - Fix test using non-existent use_distributed parameter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix test_init_no_dask_distributed to properly simulate missing module using sys.modules patching instead of broken side_effect=ImportError - Add unit tests for integer memory_limit, **cluster_kwargs forwarding, explicit threads_per_worker, and decimal-prefix parse_humansize - Add integration tests that create real LocalCluster instances and verify memory limits are applied to workers via scheduler_info - Integration test runs da.histogram2d under memory-constrained workers and validates correctness against numpy reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cbyrohl added 3 commits July 21, 2025 11:26

add info for holoviews example

f16bda0

introduce init_resources to limit resources in scida/dask

59bab94

add file containing init_resources

bc40c11

cbyrohl requested a review from Copilot January 16, 2026 21:18

Copilot started reviewing on behalf of cbyrohl January 16, 2026 21:19 View session

Copilot AI reviewed Jan 16, 2026

View reviewed changes

cbyrohl and others added 4 commits January 20, 2026 00:02

Trigger CI

f8ada39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Limit Memory Usage#203

Limit Memory Usage#203
cbyrohl wants to merge 7 commits intomainfrom
fix-histogram2d-memory-usage

cbyrohl commented Aug 7, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Copilot AI Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# Should be 50% of 16GB = 8GB, which is at the cap
		assert result == "8589934592B" # 8GB in bytes

	For TNGLab, we impose a default memory limit of 4GB via this mechanism. Monitor memory usage via the [dask dashboard](https://docs.dask.org/en/latest/dashboard.html).
	For TNGLab, this mechanism configures a default memory limit of 4GB, which you can override by passing `memory_limit` (and optionally `n_workers`) to `scida.init_resources()`. Monitor memory usage via the [dask dashboard](https://docs.dask.org/en/latest/dashboard.html).

	If you need fine-grained control beyond what `scida.init_resources()` provides, you can manually configure the Local distributed scheduler:
	If you need fine-grained control beyond what `scida.init_resources()` (introduced in the Preparation section above) provides, you can manually configure the local distributed scheduler:

	memory_per_worker = total_memory * 0.5
	half_memory = total_memory * 0.5
	memory_per_worker = min(half_memory, 8 * 1024**3)

		# Should be capped at 8GB per worker
		assert result == "8589934592B" # 8GB in bytes

	result = init_resources(use_distributed=False)
	result = init_resources()

	Call `scida.init_resources()` immediately after importing scida and before any other dask operations .
	Call `scida.init_resources()` immediately after importing scida and before any other dask operations.

	scida provides a convenient `init_resources()` function that set up reasonable local resource defaults.
	scida provides a convenient `init_resources()` function that sets up reasonable local resource defaults.

Comments

Conversation

cbyrohl commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cbyrohl commented Aug 7, 2025 •

edited

Loading