Skip to content

Conversation

@linmx0130
Copy link
Contributor

During my work, I found that model launch can easily fail due to multiple reasons (not enough memory, vLLM not compatible with the model, etc.) but it is confusing since the Python CLI only prints out the model has started and asks users to wait for its ready.

This PR allows run model command to wait for the vLLM container to be ready based on the health check command.

Python CLI will show the command to check container logs if the model container fails:

$ liquidai model run-model-image --name lfm-7b-e --image "liquidai/lfm-7b-e:0.0.1"
Creating volume for model data: lfm-7b-e
Loading model data from image: liquidai/lfm-7b-e:0.0.1
Launching model container: lfm-7b-e
Model 'lfm-7b-e' started successfully
Waiting for model 'lfm-7b-e' to be healthy. This may take a 1-2 minutes...
Container lfm-7b-e is not healthy yet. Status: starting
Container lfm-7b-e is not healthy yet. Status: starting
Container lfm-7b-e is not healthy yet. Status: starting
Error: Model 'lfm-7b-e' failed to start serving requests
Use `docker logs 4d6a1c51978b` to obtain container loggings.

A successful launch looks like

$ liquidai model run-model-image --name "lfm-3b-e" --image "liquidai/lfm-3b-e:0.0.6"
Creating volume for model data: lfm-3b-e
Loading model data from image: liquidai/lfm-3b-e:0.0.6
Launching model container: lfm-3b-e
Model 'lfm-3b-e' started successfully
Waiting for model 'lfm-3b-e' to be healthy. This may take a 1-2 minutes...
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Model 'lfm-3b-e' has started serving requests.

@linmx0130 linmx0130 requested a review from tuliren May 6, 2025 15:03
Copy link
Collaborator

@tuliren tuliren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

@linmx0130 linmx0130 merged commit c61833e into python-cli May 6, 2025
3 checks passed
@linmx0130 linmx0130 deleted the mengxiao/model-run-wait-for-health branch May 6, 2025 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants