Volume manager support #600

mateus-cardoso-reef · 2025-07-21T14:18:58Z

No description provided.

…t info

…job requests and output handling

… settings.py; enhance logging in job_runner.py for volume preparation errors

compute_horde_sdk/README.md

compute_horde_sdk/src/compute_horde_core/volume/_manager.py

executor/app/src/compute_horde_executor/executor/job_runner.py

mpnowacki-reef · 2025-07-30T08:31:22Z

executor/app/src/compute_horde_executor/executor/job_runner.py

+
+        try:
+            logger.debug(f"Requesting volume preparation from Volume Manager for job {job_uuid}")
+            response = await self.volume_manager_client.prepare_volume(


I think the timeout for completing volume download should be passed here. or is it done somewhere implicitly?

The _make_request method in VolumeManagerClient has a default timeout value - which was quite high at 300 seconds - so I’m reducing it to 60. Since you brought it up, I’m also adding the timeout parameter to the prepare_volume method (which calls _make_request), to make it easier to understand for anyone using it.

a job request has a timeout for performing the downloads, and I think that should be passed here.

if you mean the total timeout to download the volumes for the job - as specified by the consumer in job spec - this is achieved with overarching timeout for the whole download stage in the job driver. if that timeout triggers, the task will be cancelled. we don't pass the timeout value around. if the cancellation is somehow swallowed, the validator will give up waiting for the download anyway and there is (or was supposed to be) a "reaper" for the executor process in case it hangs because of this.

@kkowalski-reef so you're saying the the timeout should not be passed here? I mean before it was 300s and it's 60s. what if the total timeout for the download as sent by the consumer is 600s? what if it's 9999999s?

@mpnowacki-reef @kkowalski-reef okay, so I understand the best approach is to remove the explicit timeout from prepare_volume and _make_request, which means that the client.post method will use its default USE_CLIENT_DEFAULT timeout instead. Let me know if you see any issues

okay, let me think about this:

The consumer currently tells us what the expected volume download time should be. This is a combined timeout between when the executor is informed about the volumes to when the volumes are ready and the job container can be started, whatever happens in between.

It makes no sense for them to give us a stupidly high number there, because this is deducted from the allowance. (I'm also pretty sure we have a ~20 minute safety net somewhere?)

Normally we would simply download the volumes and unpack them in that timeframe, volume manger client will instead ask an external service to do the hard work and receive docker mount options in return

This means that it's now the volume manager that will potentially take a significant time to prepare the volumes

... meaning the http response from the manager will likely come back in around the same time it used to take the executor to do the download+unpack. A value that will change depending on the job.

We already wait for that exact amount of time in the job driver a couple of levels above

When that timeout fires, the whole job task is cancelled

If we're currently awaiting on the http response from the manager, that will be the cancellation point

I'd be surprised if HTTPX swallowed the cancellation, so it will likely just disconnect from the manager

So this makes me think that:

/prepare_volume should have no overall timeout for the response from the manager service, or it should be the same as the download stage timeout (redundant IMO), or we can just set the "connect" timeout to some constant low value for safety

the manager service should deal with the executor disconnecting while waiting for the volume - maybe stop and clean up the download, or keep it for some time just in case the job is retried?

Would be useful to also mention this in the docs - the manager service has to not time out itself during the preparation of the volumes.

this is coming from their allowance so if they want to and can afford it...

I mean if you wanna start a streaming job that will last 10h and you'll keep topping it up and it requires a 100gb download to start what can you do? the volume manager client explicitly timing out after 5m or 1m or whatever value other than what the consumer ordered is a straight path to unexpected failures.

Okay, so I will be removing the timeout from volume manager client, leaving just the connect=30.0 for safety. I'm also adding a note in the docs about the issue.

mpnowacki-reef · 2025-07-30T08:32:30Z

compute_horde_sdk/README.md

+{
+  "job_uuid": "7b522daa-e807-4094-8d96-99b9a863f960",
+  "volume": {
+    "volume_type": "huggingface_volume",


so if a user requests several volumes it's (unfortunately, and due to no fault of yours) wrapped in multivolume and will be sent nested to the manager?

Exactly, the volumes will be nested, since it wouldn’t make much sense to send multiple consecutive requests to the manager.

Just a quick update, I double-checked it and it actually always sends the volume wrapped in MultiVolume, that happens in the models.py from the facilitator

local_stack/prepare.sh

mpnowacki-reef · 2025-07-30T08:48:23Z

local_stack/send_huggingface_job.py

@@ -0,0 +1,74 @@
+import asyncio


this script isn't used anywhere in tests or whatever, right? I think it's gonna rot in such case.

I originally implemented this script thinking it might be useful for anyone testing volume downloads, but it turns out it wasn’t mentioned in any README.md and had a fairly narrow use case. So I’m removing the script and replacing it with an argparse --test-volume flag. I’ve also updated the README.md with instructions on how to use it.

mpnowacki-reef · 2025-07-31T13:09:09Z

compute_horde_sdk/src/compute_horde_core/volume/_manager.py

-        return VolumeManagerResponse(mounts=mounts)
+        response_data = await self._make_request(url, payload, "prepare_volume", timeout=timeout)
+
+        # Convert string mount types to MountType objects


I don't see any such conversion here, what am I missing?

mpnowacki-reef · 2025-07-31T13:09:44Z

compute_horde_sdk/src/compute_horde_core/volume/_manager.py

+        mounts = []
+        for mount_data in response_data["mounts"]:
+            mounts.append(mount_data)
+
+        return mounts


Suggested change

mounts = []

for mount_data in response_data["mounts"]:

mounts.append(mount_data)

return mounts

return response_data["mounts"]

You're right. This originally came from the previous version where I was constructing VolumeManagerResponse with VolumeManagerMount objects, the loop now is unnecessary.
Cleaning it up.

mpnowacki-reef · 2025-07-31T13:14:52Z

compute_horde_sdk/tests/unit/core/test_volume.py

-            assert isinstance(result, VolumeManagerResponse)
-            assert len(result.mounts) == 1
-            assert result.mounts[0].type == "bind"
+            assert isinstance(result, list)


is this AI? this looks like an unsupervised agent left for testing. why isn't this just

assert result == [["-v", "/host/path:/container/path"]]

Good point, I tried to retain the original assertion pattern but the code ended up more complex that needed.
Your version is better, updating it

mpnowacki-reef · 2025-07-31T13:20:24Z

compute_horde_sdk/tests/unit/core/test_volume.py

-            assert result.mounts[0].type == "bind"
-            assert result.mounts[0].source == "/host/models"
-            assert result.mounts[0].target == "/volume/models"
+            assert result[0] == ["-v", "/host/models:/volume/models"]


what's happening here?

Same patter as the other test, I over-complicated the assertion logic after the VolumeManagerResponse removal.
I'm also updating this test.

executor/app/src/compute_horde_executor/executor/job_runner.py

mpnowacki-reef · 2025-07-31T13:23:24Z

executor/app/src/compute_horde_executor/executor/job_runner.py

+
+        try:
+            logger.debug(f"Requesting volume preparation from Volume Manager for job {job_uuid}")
+            response = await self.volume_manager_client.prepare_volume(


a job request has a timeout for performing the downloads, and I think that should be passed here.

mpnowacki-reef · 2025-07-31T13:25:21Z

local_stack/send_hello_world_job.py

-    os.system("docker pull backenddevelopersltd/compute-horde-streaming-job-test:v0-latest")
-
-    compute_horde_streaming_job_spec = ComputeHordeJobSpec(
+async def run_volume_test():


I think you should include the POC volume manager in this repo (in the right place, a path explaining that this is a POC), start in this test and then have the test run in CI. End to End

I'll review how to include everything into the repo and get back on you on this tomorrow.

mpnowacki-reef · 2025-08-03T09:46:02Z

compute_horde_sdk/README.md

+
+```bash
+# Authentication header
+export VOLUME_MANAGER_HEADER_AUTHORIZATION='Bearer dupadupakupa'


Suggested change

export VOLUME_MANAGER_HEADER_AUTHORIZATION='Bearer dupadupakupa'

export VOLUME_MANAGER_HEADER_AUTHORIZATION='Bearer tokentokentoken'

Volume Manager example and CI

mateus-cardoso-reef added 9 commits July 14, 2025 07:55

Add Volume Manager integration

c59a017

Add tests for Volume Manager integration

de97e93

Add documentation for Custom Volume Managers in README.md

6634216

Refactor job metadata preparation in JobRunner to use full_job_reques…

491bdc3

…t info

Remove httpx dependency from pyproject.toml and uv.lock files

4b17de6

Update job metadata structure in README.md to include new fields for …

5014b8f

…job requests and output handling

Update environment variable names for Volume Manager in README.md and…

88890a9

… settings.py; enhance logging in job_runner.py for volume preparation errors

Add new test for Huggingface model and volume manager integration

2beb94d

ruff check & format changes

244455c

mpnowacki-reef reviewed Jul 30, 2025

View reviewed changes

mateus-cardoso-reef added 2 commits July 31, 2025 06:25

Enhance Volume Manager integration.

3273301

add type ignore

9df7f24

mpnowacki-reef requested changes Jul 31, 2025

View reviewed changes

mpnowacki-reef reviewed Aug 3, 2025

View reviewed changes

mateus-cardoso-reef added 15 commits August 6, 2025 05:14

Testing Volume Manager CI integration

4eb46c0

ci test

0fa92c1

quick test

d589619

uncomment changes

c132b09

Lint and docs

fcb27e7

comment volume manager var

eb6e13b

last test

e295996

quick CI test

cf50948

Final Ci test for Volume Manager Integration

bf9d7b9

Final changes

c4c1711

Final CI changes

cc1e665

refactor CI test

2215139

Set up CI for Volume Manager integration

6aa8eae

CI final changes for Volume Manager Integration

4b2f394

Merge pull request #1 from mateus-cardoso-reef/ci_test

ff0cad6

Volume Manager example and CI

	export VOLUME_MANAGER_HEADER_AUTHORIZATION='Bearer dupadupakupa'
	export VOLUME_MANAGER_HEADER_AUTHORIZATION='Bearer tokentokentoken'

Volume manager support #600

Are you sure you want to change the base?

Volume manager support #600

Uh oh!

Conversation

mateus-cardoso-reef commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants