Wait for model container to be health in running models #32

linmx0130 · 2025-04-29T22:26:51Z

During my work, I found that model launch can easily fail due to multiple reasons (not enough memory, vLLM not compatible with the model, etc.) but it is confusing since the Python CLI only prints out the model has started and asks users to wait for its ready.

This PR allows run model command to wait for the vLLM container to be ready based on the health check command.

Python CLI will show the command to check container logs if the model container fails:

$ liquidai model run-model-image --name lfm-7b-e --image "liquidai/lfm-7b-e:0.0.1"
Creating volume for model data: lfm-7b-e
Loading model data from image: liquidai/lfm-7b-e:0.0.1
Launching model container: lfm-7b-e
Model 'lfm-7b-e' started successfully
Waiting for model 'lfm-7b-e' to be healthy. This may take a 1-2 minutes...
Container lfm-7b-e is not healthy yet. Status: starting
Container lfm-7b-e is not healthy yet. Status: starting
Container lfm-7b-e is not healthy yet. Status: starting
Error: Model 'lfm-7b-e' failed to start serving requests
Use `docker logs 4d6a1c51978b` to obtain container loggings.

A successful launch looks like

$ liquidai model run-model-image --name "lfm-3b-e" --image "liquidai/lfm-3b-e:0.0.6"
Creating volume for model data: lfm-3b-e
Loading model data from image: liquidai/lfm-3b-e:0.0.6
Launching model container: lfm-3b-e
Model 'lfm-3b-e' started successfully
Waiting for model 'lfm-3b-e' to be healthy. This may take a 1-2 minutes...
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Container lfm-3b-e is not healthy yet. Status: starting
Model 'lfm-3b-e' has started serving requests.

tuliren

Excellent!

linmx0130 added 4 commits April 29, 2025 22:18

Wait for model container to be health in running models

f089901

Error message format fix

0391c89

code format

024e81e

Wait for model container to be health in huggingface/checkpoint run

d6aa9f6

linmx0130 requested a review from tuliren May 6, 2025 15:03

tuliren approved these changes May 6, 2025

View reviewed changes

linmx0130 merged commit c61833e into python-cli May 6, 2025
3 checks passed

linmx0130 deleted the mengxiao/model-run-wait-for-health branch May 6, 2025 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for model container to be health in running models #32

Wait for model container to be health in running models #32

Uh oh!

linmx0130 commented Apr 29, 2025

Uh oh!

tuliren left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Wait for model container to be health in running models #32

Wait for model container to be health in running models #32

Uh oh!

Conversation

linmx0130 commented Apr 29, 2025

Uh oh!

tuliren left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants