Skip to content

Conversation

@dmaniloff
Copy link
Collaborator

@dmaniloff dmaniloff commented Jan 7, 2026

Integration tests vs multiple versions of LLS

Add configurable mocking for Llama Stack inference & new ci job to run vs multiple versions of LLS including main.

New Features:

  • Introduce library client-based integration tests for inline Ragas evaluation using real llama-stack internals with optional mocked inference.
  • Add direct Ragas evaluation tests using remote LLM and embeddings wrappers to validate metric computation end to end.
  • Provide e2e Kubernetes testing assets using Kind, including cluster lifecycle scripts and manifests for LlamaStackDistribution and a vLLM emulator.
    • Will leave this for a separate PR.

Enhancements:

  • Extend test fixtures to support switchable real vs mocked Llama Stack clients and inference, including consistent mock responses for embeddings and completions.

@sourcery-ai

This comment was marked as off-topic.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 5 issues, and left some high level feedback:

  • In conftest.py, the option name is --no-mock-inference but several docstrings and help text still reference --mock-inference; update these references so the documentation matches the actual CLI flag.
  • In tests/e2e/start-kind-cluster.sh, the script prints PORTFORWARD_PID and refers to background port forwarding, but no kubectl port-forward is started and PORTFORWARD_PID is never set; either add the port-forward command and PID capture or remove these messages.
  • The unconditional alias docker="podman" in tests/e2e/start-kind-cluster.sh may break environments that use Docker without Podman; consider detecting which runtime is available or making this alias optional/configurable.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `conftest.py`, the option name is `--no-mock-inference` but several docstrings and help text still reference `--mock-inference`; update these references so the documentation matches the actual CLI flag.
- In `tests/e2e/start-kind-cluster.sh`, the script prints `PORTFORWARD_PID` and refers to background port forwarding, but no `kubectl port-forward` is started and `PORTFORWARD_PID` is never set; either add the port-forward command and PID capture or remove these messages.
- The unconditional `alias docker="podman"` in `tests/e2e/start-kind-cluster.sh` may break environments that use Docker without Podman; consider detecting which runtime is available or making this alias optional/configurable.

## Individual Comments

### Comment 1
<location> `distribution/run.yaml:97-105` </location>
<code_context>
     model_type: llm
+  - metadata: {}
+    model_id: llama-3.1-8b-instruct
+    provider_id: vllm
+    provider_model_id: meta-llama/Llama-3.1-8B-Instruct
+    model_type: llm
+  - metadata:
+      embedding_dimension: 1024
+    model_id: nomic-embed-text
+    provider_id: vllm
+    provider_model_id: nomic-ai/nomic-embed-text-v1.5
+    model_type: embedding
   shields: []
   vector_dbs: []
</code_context>

<issue_to_address>
**issue (bug_risk):** Guard `vllm` models so they don’t reference a provider that may not be configured.

The `vllm` provider is only created when `VLLM_URL` is set, but these `registered_resources` entries are always added. In setups without `VLLM_URL`, this can leave models referencing a non-existent provider. Please either register these models only when `VLLM_URL` is configured, or define a safe fallback to avoid configuration-time failures.
</issue_to_address>

### Comment 2
<location> `distribution/run.yaml:43-48` </location>
<code_context>
         backend: kv_default
   inference:
-    - provider_id: ollama
+    - provider_id: ${env.VLLM_URL:+vllm}
+      provider_type: remote::vllm
+      config:
+        url: ${env.VLLM_URL}
+        max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
+        api_token: ${env.VLLM_API_TOKEN:=fake}
+        tls_verify: ${env.VLLM_TLS_VERIFY:=true}
+    - provider_id: ${env.OLLAMA_URL:+ollama}
</code_context>

<issue_to_address>
**question:** Double-check how an empty `provider_id` entry is handled when `VLLM_URL` is unset.

Using `${env.VLLM_URL:+vllm}` disables this provider when `VLLM_URL` is unset, which is likely intended. However, depending on how the config is rendered, this could still leave a list entry with an empty `provider_id` or an otherwise invalid provider block. Please verify that the templating engine omits the entire list item when `VLLM_URL` is unset, or update the template so the whole provider entry is conditional rather than just the `provider_id` field.
</issue_to_address>

### Comment 3
<location> `tests/test_inline_evaluation.py:274-280` </location>
<code_context>

-    assert hasattr(job, "job_id")
-    assert hasattr(job, "status")
     assert job.job_id is not None
+    assert job.status == "in_progress"
+
+    job = library_client.alpha.eval.jobs.status(
+        benchmark_id=benchmark_id, job_id=job.job_id
+    )
+    assert job.status == "completed"
</code_context>

<issue_to_address>
**issue (testing):** Poll for job completion with timeout instead of assuming a single status call transitions from in_progress to completed

Right after `run_eval`, the job is correctly `in_progress`, but the test assumes the very next `jobs.status` call will already be `completed`. In a real (non-mocked) stack this will be flaky. Please poll `jobs.status` with a short sleep and an overall timeout, and assert that it reaches a terminal state (e.g. `completed`) within that window, so the integration test remains reliable with `--no-mock-inference`.
</issue_to_address>

### Comment 4
<location> `tests/test_inline_evaluation.py:166-172` </location>
<code_context>
+):
+    completion_text = mocked_llm_response
+
+    async def _fake_openai_embeddings(req):  # noqa: ANN001
+        embedding_input = getattr(req, "input", None)
+        n = len(embedding_input) if isinstance(embedding_input, list) else 1
+        data = [
+            SimpleNamespace(embedding=[0.1, 0.2, 0.3], index=i, object="embedding")
+            for i in range(n)
+        ]
</code_context>

<issue_to_address>
**suggestion (testing):** Mocked embedding vectors should respect the configured embedding dimension to avoid hiding dimension-related issues

The `_fake_openai_embeddings` mock always returns 3-dimensional vectors, while the configured model uses `embedding_dimension: 384`. This mismatch can mask dimension-related issues or cause misleading test failures. Please derive the embedding length from the configured model metadata, or at least return vectors of length 384 so tests more accurately reflect real usage.

```suggestion
    async def _fake_openai_embeddings(req):  # noqa: ANN001
        embedding_input = getattr(req, "input", None)
        n = len(embedding_input) if isinstance(embedding_input, list) else 1
        embedding_dim = 384  # match configured embedding dimension
        data = [
            SimpleNamespace(
                embedding=[0.1] * embedding_dim,
                index=i,
                object="embedding",
            )
            for i in range(n)
        ]
```
</issue_to_address>

### Comment 5
<location> `tests/test_remote_wrappers.py:133-135` </location>
<code_context>
-    assert isinstance(result, EvaluationResult)
-    pandas_result = result.to_pandas()
-    logger.info(render_dataframe_as_table(pandas_result))
-    assert metric_to_test.name in pandas_result.columns
-    assert len(pandas_result) == len(evaluation_dataset)
-    assert pandas_result[metric_to_test.name].dtype == float
-
-    # Use small tolerance for floating point comparisons
</code_context>

<issue_to_address>
**issue (testing):** Data type assertion on pandas column is too strict and may fail even when the series is a float dtype

`pandas_result[metric_to_test.name].dtype == float` will usually be false because pandas uses NumPy dtypes (e.g. `float64`), which don’t equal the built-in `float`. To assert the column is floating-point, use `pandas.api.types.is_float_dtype(pandas_result[metric_to_test.name])` or check `pandas_result[metric_to_test.name].dtype.kind == "f"` instead, to avoid spurious test failures.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 97 to 105
provider_id: vllm
provider_model_id: meta-llama/Llama-3.1-8B-Instruct
model_type: llm
- metadata:
embedding_dimension: 1024
model_id: nomic-embed-text
provider_id: vllm
provider_model_id: nomic-ai/nomic-embed-text-v1.5
model_type: embedding
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Guard vllm models so they don’t reference a provider that may not be configured.

The vllm provider is only created when VLLM_URL is set, but these registered_resources entries are always added. In setups without VLLM_URL, this can leave models referencing a non-existent provider. Please either register these models only when VLLM_URL is configured, or define a safe fallback to avoid configuration-time failures.

Comment on lines 43 to 48
- provider_id: ${env.VLLM_URL:+vllm}
provider_type: remote::vllm
config:
url: ${env.VLLM_URL}
max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
api_token: ${env.VLLM_API_TOKEN:=fake}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Double-check how an empty provider_id entry is handled when VLLM_URL is unset.

Using ${env.VLLM_URL:+vllm} disables this provider when VLLM_URL is unset, which is likely intended. However, depending on how the config is rendered, this could still leave a list entry with an empty provider_id or an otherwise invalid provider block. Please verify that the templating engine omits the entire list item when VLLM_URL is unset, or update the template so the whole provider entry is conditional rather than just the provider_id field.

Comment on lines +166 to +172
async def _fake_openai_embeddings(req): # noqa: ANN001
embedding_input = getattr(req, "input", None)
n = len(embedding_input) if isinstance(embedding_input, list) else 1
data = [
SimpleNamespace(embedding=[0.1, 0.2, 0.3], index=i, object="embedding")
for i in range(n)
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Mocked embedding vectors should respect the configured embedding dimension to avoid hiding dimension-related issues

The _fake_openai_embeddings mock always returns 3-dimensional vectors, while the configured model uses embedding_dimension: 384. This mismatch can mask dimension-related issues or cause misleading test failures. Please derive the embedding length from the configured model metadata, or at least return vectors of length 384 so tests more accurately reflect real usage.

Suggested change
async def _fake_openai_embeddings(req): # noqa: ANN001
embedding_input = getattr(req, "input", None)
n = len(embedding_input) if isinstance(embedding_input, list) else 1
data = [
SimpleNamespace(embedding=[0.1, 0.2, 0.3], index=i, object="embedding")
for i in range(n)
]
async def _fake_openai_embeddings(req): # noqa: ANN001
embedding_input = getattr(req, "input", None)
n = len(embedding_input) if isinstance(embedding_input, list) else 1
embedding_dim = 384 # match configured embedding dimension
data = [
SimpleNamespace(
embedding=[0.1] * embedding_dim,
index=i,
object="embedding",
)
for i in range(n)
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant