-
Notifications
You must be signed in to change notification settings - Fork 157
Open
Description
What I tried
- I tried to run Failure Mode Analysis (TrajFM).
- The pipeline reaches
failure_mode_reduction, then tries to embed ~27k titles usingSentenceTransformer('all-MiniLM-L6-v2'). - But the Docker container cannot see any GPU, so embedding is done on CPU (or it hangs while trying to download/load the model).
Where the issue happens
AssetOpsBench/src/TrajFM/failure_mode_reduction.py, around line 113–120 (SentenceTransformer init + encode).
Evidence / logs
- From the container logs:
torch.cuda.is_available=False- Default HF cache looks like /root/.cache/huggingface/hub (not pre-populated)
- On my host:
nvidia-smi→command not found
So currently, the container cannot access an NVIDIA GPU, and installing/using the embedding model becomes problematic.
Expected behavior
- Failure mode reduction should finish reasonably fast (minutes, not “forever”), ideally using GPU when available.
Actual behavior - Embedding step is extremely slow / appears stuck (likely CPU-only + large batch).
- GPU is not visible from Docker.
Questions / help needed
- What is the recommended way to run TrajFM failure mode reduction with GPU?
- If my environment cannot provide GPU to Docker, what is the recommended workaround (pre-download model into image, mount HF cache, reduce batch, etc.)?
Optional request (if possible)
- If feasible, could we avoid local SentenceTransformer embedding entirely by adding an embedding option to the existing
watsonx_llmflow (so embeddings can be computed on that service side / its GPU), instead of requiring GPU inside Docker?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels