Skip to content

Conversation

@mateus-cardoso-reef
Copy link

No description provided.


try:
logger.debug(f"Requesting volume preparation from Volume Manager for job {job_uuid}")
response = await self.volume_manager_client.prepare_volume(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the timeout for completing volume download should be passed here. or is it done somewhere implicitly?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _make_request method in VolumeManagerClient has a default timeout value - which was quite high at 300 seconds - so I’m reducing it to 60. Since you brought it up, I’m also adding the timeout parameter to the prepare_volume method (which calls _make_request), to make it easier to understand for anyone using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a job request has a timeout for performing the downloads, and I think that should be passed here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you mean the total timeout to download the volumes for the job - as specified by the consumer in job spec - this is achieved with overarching timeout for the whole download stage in the job driver. if that timeout triggers, the task will be cancelled. we don't pass the timeout value around. if the cancellation is somehow swallowed, the validator will give up waiting for the download anyway and there is (or was supposed to be) a "reaper" for the executor process in case it hangs because of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkowalski-reef so you're saying the the timeout should not be passed here? I mean before it was 300s and it's 60s. what if the total timeout for the download as sent by the consumer is 600s? what if it's 9999999s?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpnowacki-reef @kkowalski-reef okay, so I understand the best approach is to remove the explicit timeout from prepare_volume and _make_request, which means that the client.post method will use its default USE_CLIENT_DEFAULT timeout instead. Let me know if you see any issues

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, let me think about this:

  • The consumer currently tells us what the expected volume download time should be. This is a combined timeout between when the executor is informed about the volumes to when the volumes are ready and the job container can be started, whatever happens in between.
  • It makes no sense for them to give us a stupidly high number there, because this is deducted from the allowance. (I'm also pretty sure we have a ~20 minute safety net somewhere?)
  • Normally we would simply download the volumes and unpack them in that timeframe, volume manger client will instead ask an external service to do the hard work and receive docker mount options in return
  • This means that it's now the volume manager that will potentially take a significant time to prepare the volumes
    • ... meaning the http response from the manager will likely come back in around the same time it used to take the executor to do the download+unpack. A value that will change depending on the job.
  • We already wait for that exact amount of time in the job driver a couple of levels above
  • When that timeout fires, the whole job task is cancelled
  • If we're currently awaiting on the http response from the manager, that will be the cancellation point
  • I'd be surprised if HTTPX swallowed the cancellation, so it will likely just disconnect from the manager

So this makes me think that:

  • /prepare_volume should have no overall timeout for the response from the manager service, or it should be the same as the download stage timeout (redundant IMO), or we can just set the "connect" timeout to some constant low value for safety
  • the manager service should deal with the executor disconnecting while waiting for the volume - maybe stop and clean up the download, or keep it for some time just in case the job is retried?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to also mention this in the docs - the manager service has to not time out itself during the preparation of the volumes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is coming from their allowance so if they want to and can afford it...

I mean if you wanna start a streaming job that will last 10h and you'll keep topping it up and it requires a 100gb download to start what can you do? the volume manager client explicitly timing out after 5m or 1m or whatever value other than what the consumer ordered is a straight path to unexpected failures.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I will be removing the timeout from volume manager client, leaving just the connect=30.0 for safety. I'm also adding a note in the docs about the issue.

{
"job_uuid": "7b522daa-e807-4094-8d96-99b9a863f960",
"volume": {
"volume_type": "huggingface_volume",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if a user requests several volumes it's (unfortunately, and due to no fault of yours) wrapped in multivolume and will be sent nested to the manager?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, the volumes will be nested, since it wouldn’t make much sense to send multiple consecutive requests to the manager.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick update, I double-checked it and it actually always sends the volume wrapped in MultiVolume, that happens in the models.py from the facilitator

@@ -0,0 +1,74 @@
import asyncio
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this script isn't used anywhere in tests or whatever, right? I think it's gonna rot in such case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally implemented this script thinking it might be useful for anyone testing volume downloads, but it turns out it wasn’t mentioned in any README.md and had a fairly narrow use case. So I’m removing the script and replacing it with an argparse --test-volume flag. I’ve also updated the README.md with instructions on how to use it.

return VolumeManagerResponse(mounts=mounts)
response_data = await self._make_request(url, payload, "prepare_volume", timeout=timeout)

# Convert string mount types to MountType objects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any such conversion here, what am I missing?

Comment on lines 86 to 90
mounts = []
for mount_data in response_data["mounts"]:
mounts.append(mount_data)

return mounts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mounts = []
for mount_data in response_data["mounts"]:
mounts.append(mount_data)
return mounts
return response_data["mounts"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This originally came from the previous version where I was constructing VolumeManagerResponse with VolumeManagerMount objects, the loop now is unnecessary.
Cleaning it up.

assert isinstance(result, VolumeManagerResponse)
assert len(result.mounts) == 1
assert result.mounts[0].type == "bind"
assert isinstance(result, list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this AI? this looks like an unsupervised agent left for testing. why isn't this just

assert result == [["-v", "/host/path:/container/path"]]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I tried to retain the original assertion pattern but the code ended up more complex that needed.
Your version is better, updating it

assert result.mounts[0].type == "bind"
assert result.mounts[0].source == "/host/models"
assert result.mounts[0].target == "/volume/models"
assert result[0] == ["-v", "/host/models:/volume/models"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's happening here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same patter as the other test, I over-complicated the assertion logic after the VolumeManagerResponse removal.
I'm also updating this test.


try:
logger.debug(f"Requesting volume preparation from Volume Manager for job {job_uuid}")
response = await self.volume_manager_client.prepare_volume(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a job request has a timeout for performing the downloads, and I think that should be passed here.

os.system("docker pull backenddevelopersltd/compute-horde-streaming-job-test:v0-latest")

compute_horde_streaming_job_spec = ComputeHordeJobSpec(
async def run_volume_test():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should include the POC volume manager in this repo (in the right place, a path explaining that this is a POC), start in this test and then have the test run in CI. End to End

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review how to include everything into the repo and get back on you on this tomorrow.


```bash
# Authentication header
export VOLUME_MANAGER_HEADER_AUTHORIZATION='Bearer dupadupakupa'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export VOLUME_MANAGER_HEADER_AUTHORIZATION='Bearer dupadupakupa'
export VOLUME_MANAGER_HEADER_AUTHORIZATION='Bearer tokentokentoken'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants