Integration tests vs upstream lls. #51

dmaniloff · 2026-01-07T21:16:55Z

Integration tests vs multiple versions of LLS

Add configurable mocking for Llama Stack inference & new ci job to run vs multiple versions of LLS including main.

New Features:

Introduce library client-based integration tests for inline Ragas evaluation using real llama-stack internals with optional mocked inference.
Add direct Ragas evaluation tests using remote LLM and embeddings wrappers to validate metric computation end to end.
~~Provide e2e Kubernetes testing assets using Kind, including cluster lifecycle scripts and manifests for LlamaStackDistribution and a vLLM emulator.~~
- Will leave this for a separate PR.

Enhancements:

Extend test fixtures to support switchable real vs mocked Llama Stack clients and inference, including consistent mock responses for embeddings and completions.

…precated remote evaluation tests.

sourcery-ai

Hey - I've found 5 issues, and left some high level feedback:

In conftest.py, the option name is --no-mock-inference but several docstrings and help text still reference --mock-inference; update these references so the documentation matches the actual CLI flag.
In tests/e2e/start-kind-cluster.sh, the script prints PORTFORWARD_PID and refers to background port forwarding, but no kubectl port-forward is started and PORTFORWARD_PID is never set; either add the port-forward command and PID capture or remove these messages.
The unconditional alias docker="podman" in tests/e2e/start-kind-cluster.sh may break environments that use Docker without Podman; consider detecting which runtime is available or making this alias optional/configurable.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `conftest.py`, the option name is `--no-mock-inference` but several docstrings and help text still reference `--mock-inference`; update these references so the documentation matches the actual CLI flag.
- In `tests/e2e/start-kind-cluster.sh`, the script prints `PORTFORWARD_PID` and refers to background port forwarding, but no `kubectl port-forward` is started and `PORTFORWARD_PID` is never set; either add the port-forward command and PID capture or remove these messages.
- The unconditional `alias docker="podman"` in `tests/e2e/start-kind-cluster.sh` may break environments that use Docker without Podman; consider detecting which runtime is available or making this alias optional/configurable.

## Individual Comments

### Comment 1
<location> `distribution/run.yaml:97-105` </location>
<code_context>
     model_type: llm
+  - metadata: {}
+    model_id: llama-3.1-8b-instruct
+    provider_id: vllm
+    provider_model_id: meta-llama/Llama-3.1-8B-Instruct
+    model_type: llm
+  - metadata:
+      embedding_dimension: 1024
+    model_id: nomic-embed-text
+    provider_id: vllm
+    provider_model_id: nomic-ai/nomic-embed-text-v1.5
+    model_type: embedding
   shields: []
   vector_dbs: []
</code_context>

<issue_to_address>
**issue (bug_risk):** Guard `vllm` models so they don’t reference a provider that may not be configured.

The `vllm` provider is only created when `VLLM_URL` is set, but these `registered_resources` entries are always added. In setups without `VLLM_URL`, this can leave models referencing a non-existent provider. Please either register these models only when `VLLM_URL` is configured, or define a safe fallback to avoid configuration-time failures.
</issue_to_address>

### Comment 2
<location> `distribution/run.yaml:43-48` </location>
<code_context>
         backend: kv_default
   inference:
-    - provider_id: ollama
+    - provider_id: ${env.VLLM_URL:+vllm}
+      provider_type: remote::vllm
+      config:
+        url: ${env.VLLM_URL}
+        max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
+        api_token: ${env.VLLM_API_TOKEN:=fake}
+        tls_verify: ${env.VLLM_TLS_VERIFY:=true}
+    - provider_id: ${env.OLLAMA_URL:+ollama}
</code_context>

<issue_to_address>
**question:** Double-check how an empty `provider_id` entry is handled when `VLLM_URL` is unset.

Using `${env.VLLM_URL:+vllm}` disables this provider when `VLLM_URL` is unset, which is likely intended. However, depending on how the config is rendered, this could still leave a list entry with an empty `provider_id` or an otherwise invalid provider block. Please verify that the templating engine omits the entire list item when `VLLM_URL` is unset, or update the template so the whole provider entry is conditional rather than just the `provider_id` field.
</issue_to_address>

### Comment 3
<location> `tests/test_inline_evaluation.py:274-280` </location>
<code_context>

-    assert hasattr(job, "job_id")
-    assert hasattr(job, "status")
     assert job.job_id is not None
+    assert job.status == "in_progress"
+
+    job = library_client.alpha.eval.jobs.status(
+        benchmark_id=benchmark_id, job_id=job.job_id
+    )
+    assert job.status == "completed"
</code_context>

<issue_to_address>
**issue (testing):** Poll for job completion with timeout instead of assuming a single status call transitions from in_progress to completed

Right after `run_eval`, the job is correctly `in_progress`, but the test assumes the very next `jobs.status` call will already be `completed`. In a real (non-mocked) stack this will be flaky. Please poll `jobs.status` with a short sleep and an overall timeout, and assert that it reaches a terminal state (e.g. `completed`) within that window, so the integration test remains reliable with `--no-mock-inference`.
</issue_to_address>

### Comment 4
<location> `tests/test_inline_evaluation.py:166-172` </location>
<code_context>
+):
+    completion_text = mocked_llm_response
+
+    async def _fake_openai_embeddings(req):  # noqa: ANN001
+        embedding_input = getattr(req, "input", None)
+        n = len(embedding_input) if isinstance(embedding_input, list) else 1
+        data = [
+            SimpleNamespace(embedding=[0.1, 0.2, 0.3], index=i, object="embedding")
+            for i in range(n)
+        ]
</code_context>

<issue_to_address>
**suggestion (testing):** Mocked embedding vectors should respect the configured embedding dimension to avoid hiding dimension-related issues

The `_fake_openai_embeddings` mock always returns 3-dimensional vectors, while the configured model uses `embedding_dimension: 384`. This mismatch can mask dimension-related issues or cause misleading test failures. Please derive the embedding length from the configured model metadata, or at least return vectors of length 384 so tests more accurately reflect real usage.

```suggestion
    async def _fake_openai_embeddings(req):  # noqa: ANN001
        embedding_input = getattr(req, "input", None)
        n = len(embedding_input) if isinstance(embedding_input, list) else 1
        embedding_dim = 384  # match configured embedding dimension
        data = [
            SimpleNamespace(
                embedding=[0.1] * embedding_dim,
                index=i,
                object="embedding",
            )
            for i in range(n)
        ]
```
</issue_to_address>

### Comment 5
<location> `tests/test_remote_wrappers.py:133-135` </location>
<code_context>
-    assert isinstance(result, EvaluationResult)
-    pandas_result = result.to_pandas()
-    logger.info(render_dataframe_as_table(pandas_result))
-    assert metric_to_test.name in pandas_result.columns
-    assert len(pandas_result) == len(evaluation_dataset)
-    assert pandas_result[metric_to_test.name].dtype == float
-
-    # Use small tolerance for floating point comparisons
</code_context>

<issue_to_address>
**issue (testing):** Data type assertion on pandas column is too strict and may fail even when the series is a float dtype

`pandas_result[metric_to_test.name].dtype == float` will usually be false because pandas uses NumPy dtypes (e.g. `float64`), which don’t equal the built-in `float`. To assert the column is floating-point, use `pandas.api.types.is_float_dtype(pandas_result[metric_to_test.name])` or check `pandas_result[metric_to_test.name].dtype.kind == "f"` instead, to avoid spurious test failures.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-01-07T21:18:16Z

distribution/run.yaml

+    provider_id: vllm
+    provider_model_id: meta-llama/Llama-3.1-8B-Instruct
+    model_type: llm
+  - metadata:
+      embedding_dimension: 1024
+    model_id: nomic-embed-text
+    provider_id: vllm
+    provider_model_id: nomic-ai/nomic-embed-text-v1.5
+    model_type: embedding


issue (bug_risk): Guard vllm models so they don’t reference a provider that may not be configured.

The vllm provider is only created when VLLM_URL is set, but these registered_resources entries are always added. In setups without VLLM_URL, this can leave models referencing a non-existent provider. Please either register these models only when VLLM_URL is configured, or define a safe fallback to avoid configuration-time failures.

sourcery-ai · 2026-01-07T21:18:16Z

distribution/run.yaml

+    - provider_id: ${env.VLLM_URL:+vllm}
+      provider_type: remote::vllm
+      config:
+        url: ${env.VLLM_URL}
+        max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
+        api_token: ${env.VLLM_API_TOKEN:=fake}


question: Double-check how an empty provider_id entry is handled when VLLM_URL is unset.

Using ${env.VLLM_URL:+vllm} disables this provider when VLLM_URL is unset, which is likely intended. However, depending on how the config is rendered, this could still leave a list entry with an empty provider_id or an otherwise invalid provider block. Please verify that the templating engine omits the entire list item when VLLM_URL is unset, or update the template so the whole provider entry is conditional rather than just the provider_id field.

sourcery-ai · 2026-01-07T21:18:16Z

tests/test_inline_evaluation.py

+    async def _fake_openai_embeddings(req):  # noqa: ANN001
+        embedding_input = getattr(req, "input", None)
+        n = len(embedding_input) if isinstance(embedding_input, list) else 1
+        data = [
+            SimpleNamespace(embedding=[0.1, 0.2, 0.3], index=i, object="embedding")
+            for i in range(n)
+        ]


suggestion (testing): Mocked embedding vectors should respect the configured embedding dimension to avoid hiding dimension-related issues

The _fake_openai_embeddings mock always returns 3-dimensional vectors, while the configured model uses embedding_dimension: 384. This mismatch can mask dimension-related issues or cause misleading test failures. Please derive the embedding length from the configured model metadata, or at least return vectors of length 384 so tests more accurately reflect real usage.

Suggested change

async def _fake_openai_embeddings(req): # noqa: ANN001

embedding_input = getattr(req, "input", None)

n = len(embedding_input) if isinstance(embedding_input, list) else 1

data = [

SimpleNamespace(embedding=[0.1, 0.2, 0.3], index=i, object="embedding")

for i in range(n)

]

async def _fake_openai_embeddings(req): # noqa: ANN001

embedding_input = getattr(req, "input", None)

n = len(embedding_input) if isinstance(embedding_input, list) else 1

embedding_dim = 384 # match configured embedding dimension

data = [

SimpleNamespace(

embedding=[0.1] * embedding_dim,

index=i,

object="embedding",

)

for i in range(n)

]

tests/test_remote_wrappers.py

…_pipeline import to the test function scope.

…e for base URL

…sts.

dmaniloff added 3 commits January 5, 2026 07:44

e2e testing with Kind.

0d934b3

pytest w/ cli options to mock inference.

d2adf14

Refactor LLM mocking in tests: separate client fixtures and remove de…

c164514

…precated remote evaluation tests.

This comment was marked as off-topic.

Sign in to view

sourcery-ai bot reviewed Jan 7, 2026

View reviewed changes

dmaniloff added 12 commits January 8, 2026 14:32

cosmit: whitespace.

6f04472

allow multiple YAML docs in pre-commit.

c05ac2f

adjust test markers.

e7ccd2f

Refactor import structure in integration tests: move ragas_evaluation…

da47322

…_pipeline import to the test function scope.

skip kfp tests if missing env vars.

8473fe3

mock connectivity check and model listing for ollama.

a43dd3e

Refactor remote embedding and LLM fixtures to use environment variabl…

1c86a91

…e for base URL

fix: import Api from compat.

6b6f050

simplify display of models and providers.

5690d45

refine test markers.

c357f70

Refactor imports and organize test markers in Kubeflow integration te…

4de8780

…sts.

leave kind cluster setup out of this for now.

aa4d5cf

dmaniloff requested review from christinaexyou, m-misiura, ruivieira and saichandrapandraju January 9, 2026 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integration tests vs upstream lls. #51

Integration tests vs upstream lls. #51

Uh oh!

dmaniloff commented Jan 7, 2026 •

edited

Loading

Uh oh!

This comment was marked as off-topic.

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Jan 7, 2026

Uh oh!

sourcery-ai bot Jan 7, 2026

Uh oh!

sourcery-ai bot Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Integration tests vs upstream lls. #51

Are you sure you want to change the base?

Integration tests vs upstream lls. #51

Uh oh!

Conversation

dmaniloff commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integration tests vs multiple versions of LLS

Uh oh!

This comment was marked as off-topic.

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dmaniloff commented Jan 7, 2026 •

edited

Loading