Skip to content

Failure mode reduction is extremely slow / cannot use GPU inside Docker (SentenceTransformer embeddings) #146

@ozatamago

Description

@ozatamago

What I tried

  1. I tried to run Failure Mode Analysis (TrajFM).
  2. The pipeline reaches failure_mode_reduction, then tries to embed ~27k titles using SentenceTransformer('all-MiniLM-L6-v2').
  3. But the Docker container cannot see any GPU, so embedding is done on CPU (or it hangs while trying to download/load the model).

Where the issue happens
AssetOpsBench/src/TrajFM/failure_mode_reduction.py, around line 113–120 (SentenceTransformer init + encode).

Evidence / logs

  • From the container logs:
    • torch.cuda.is_available=False
    • Default HF cache looks like /root/.cache/huggingface/hub (not pre-populated)
  • On my host:
    • nvidia-smicommand not found
      So currently, the container cannot access an NVIDIA GPU, and installing/using the embedding model becomes problematic.

Expected behavior

  • Failure mode reduction should finish reasonably fast (minutes, not “forever”), ideally using GPU when available.
    Actual behavior
  • Embedding step is extremely slow / appears stuck (likely CPU-only + large batch).
  • GPU is not visible from Docker.

Questions / help needed

  • What is the recommended way to run TrajFM failure mode reduction with GPU?
  • If my environment cannot provide GPU to Docker, what is the recommended workaround (pre-download model into image, mount HF cache, reduce batch, etc.)?

Optional request (if possible)

  • If feasible, could we avoid local SentenceTransformer embedding entirely by adding an embedding option to the existing watsonx_llm flow (so embeddings can be computed on that service side / its GPU), instead of requiring GPU inside Docker?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions