Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 117 additions & 5 deletions .devx/3-agent-evaluation/evaluation_data.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,133 @@
# Creating Evaluation Datasets

Effective evaluation metrics require high-quality test data. When creating evaluation datasets, here are some key considerations and best practices to keep in mind:
<img src="_static/robots/operator.png" alt="Dataset Design" style="float:right;max-width:300px;margin:25px;" />

You've learned about the metrics you'll use to evaluate your agents. But how do you measure those metrics in practice?

To produce high-quality, meaningful evaluation metrics, you need a well-designed evaluation dataset. In this lesson, you'll learn how to create evaluation datasets for the two agents you built earlier in this course:

- **IT Help Desk RAG Agent** (Module 2)
- **Report Generation Agent** (Module 1)

We'll start with the RAG agent since it has a simpler evaluation structure, then move to the more complex Report Generation agent.

<!-- fold:break -->

## Dataset Design for Your Agents

Evaluation datasets work best when they align with the following principles:

1. **Cover Diverse Scenarios**: Include common cases, edge cases, and failure modes
2. **Include Ground Truth**: Where possible, provide correct answers for comparison
3. **Represent Real Usage**: Base test cases on actual user interactions
4. **Start Small**: Begin with 20-30 high-quality examples and expand over time
4. **Start Small**: Begin with a small set of high-quality examples and expand over time
5. **Version Control**: Track your datasets alongside your code

In addition to these general principles, you'll need to make sure your data is tailored to your particular evaluation use case. Different agent tasks, and different metrics, require different types of evaluation data.

To understand dataset design, let's look at what you'll need to evaluate the agents you've built.

<!-- fold:break -->

### Dataset for the IT Help Desk RAG Agent (Module 2)

For the IT Help Desk RAG agent from Module 2, each evaluation test case should include:

- **Question**: The user's query (common IT help desk questions)
- **Ground Truth Answer**: The expected response
- **Expected Context Keywords**: Keywords that should appear in retrieved documents
- **Category**: The type of question (password_management, vpn_access, network_issues, etc.)

In your evaluation pipeline, you'll query the RAG agent with each generated test **Question**. The RAG agent's generated response can then be compared to the test data to evaluate multiple dimensions of the agent's performance.

**Want to see an example?** Check out the <button onclick="openOrCreateFileInJupyterLab('data/evaluation/rag_agent_test_cases.json');"><i class="fa-brands fa-python"></i> RAG Agent Evaluation Dataset</button> to see the structure of the starter dataset we've provided.

### Dataset for the Report Generation Agent (Module 1)

For the Report Generation agent from Module 1, each evaluation test case should include:
- **Topic**: The report subject
- **Expected Sections**: Sections that should appear in the report
- **Quality Criteria**: Custom metrics that define what makes a "good" report on this topic

Like for the IT Help Desk agent, in your evaluation pipeline, you'll query the Report Generation agent with each test **Topic**. However, note that the Report Generation dataset does not contain a complete ground truth answer. The length and variability of reports makes this form of evaluation impractical.

Instead, the test cases include expected section names and quality criteria, which we'll use to evaluate each report's structure and content. You'll dive into using these quality criteria in the next lesson, [Running Evaluations](running_evaluations.md).

**Want to see an example?** Check out the <button onclick="openOrCreateFileInJupyterLab('data/evaluation/report_agent_test_cases.json');"><i class="fa-brands fa-python"></i> Report Agent Evaluation Dataset</button> to see the structure of the starter dataset we've provided.

<!-- fold:break -->

## How to Create Your Evaluation Datasets

<img src="_static/robots/study.png" alt="Building Datasets" style="float:right;max-width:300px;margin:25px;" />

Now that you understand what evaluation datasets look like for each agent, let's explore strategies for creating them.

<!-- fold:break -->

TODO: Content + Code Notebooks for Creating Evaluation Datasets with SDG
### Strategy 1: Real-World Data Collection

The first strategy is to use existing, real-world data. This approach is often considered the "gold standard" for realism because it captures authentic user experiences and needs. You can either collect this data yourself or leverage existing datasets from online repositories like HuggingFace.

However, real-world data comes with trade-offs. Manual collection requires substantial human labor and time. Privacy concerns may limit what data you can use, and quality issues, like biased or inappropriate content, can compromise your evaluation. Further, real-world data may simply not exist for your specific use case, particularly for complex or niche applications.

**Use When**: You have access to existing user interactions or logs, need maximum realism and authenticity, have resources for manual curation and privacy review, or when your application is already deployed and collecting real usage data.

<!-- fold:break -->

### Strategy 2: Synthetic Data Generation (SDG)

The second strategy is to generate synthetic data using an LLM. This type of data is highly versatile: it enables developers to rapidly create large datasets, control diversity and coverage, and support use cases where real-world data is difficult or impossible to obtain.

However, synthetic data has its weaknesses as well. It typically requires careful human validation before use, and it may fail to accurately capture real-world behaviors, edge cases, or distributions, which can reduce the reliability of any evaluations that rely on it.

**Use When**: You need to rapidly create test cases, want to control for specific scenarios or edge cases, lack access to real-world data, or need to expand coverage beyond what real data provides.

<!-- fold:break -->

TODO
### Strategy 3: Hybrid Approach

A hybrid approach combines real-world and synthetic data to leverage the strengths of both while mitigating their weaknesses. It allows developers to ground systems in realistic data, then augment coverage with synthetic examples.

**Use When**: You have some real-world data but need more coverage, want to balance realism with controlled test scenarios, need to fill gaps in your real-world dataset, or want the most robust evaluation approach combining authenticity with comprehensive coverage.

<!-- fold:break -->

TODO
## Generate Evaluation Data for Your Agents

<img src="_static/robots/typewriter.png" alt="Tools for Generation" style="float:left;max-width:250px;margin:25px;" />

Now it's time to create evaluation datasets for the agents you built!

For this workshop, we'll use **Synthetic Data Generation** with **NVIDIA NeMo Data Designer**, an open source tool for generating high-quality synthetic data. This approach is ideal for getting started with agent evaluation and learning the fundamentals.

When you're ready, follow the first notebook to generate data for your RAG agent from Module 2 and get familiar with SDG concepts:
<button onclick="openOrCreateFileInJupyterLab('code/3-agent-evaluation/generate_rag_eval_dataset.ipynb');"><i class="fa-solid fa-flask"></i> RAG Evaluation Data Notebook</button>

Next, generate data for your Report Generation agent from Module 1:
<button onclick="openOrCreateFileInJupyterLab('code/3-agent-evaluation/generate_report_eval_dataset.ipynb');"><i class="fa-solid fa-flask"></i> Report Generation Evaluation Data Notebook</button>

<details>
<summary>💡 NEED SOME HELP?</summary>

We recommend generating your own datasets using the notebooks above to get hands-on experience with the synthetic data generation process. However, if you're running into issues or want to move ahead quickly, we've provided starter datasets you can use:

- <button onclick="openOrCreateFileInJupyterLab('data/evaluation/rag_agent_test_cases.json');"><i class="fa-brands fa-python"></i> RAG Agent Test Cases</button> - 12 IT help desk questions across common categories
- <button onclick="openOrCreateFileInJupyterLab('data/evaluation/report_agent_test_cases.json');"><i class="fa-brands fa-python"></i> Report Agent Test Cases</button> - 6 report topics with quality criteria

These pre-made datasets can also serve as reference examples when you create your own.

</details>

<!-- fold:break -->

## What's Next

Once you have your evaluation datasets ready, you're prepared to run comprehensive evaluations!

In the next lesson, [Running Evaluations](running_evaluations.md), you'll:
- Load your evaluation datasets
- Run your agents on test cases
- Use LLM-as-a-Judge to score responses
- Analyze results and identify improvement areas

65 changes: 25 additions & 40 deletions .devx/3-agent-evaluation/running_evaluations.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,46 +2,38 @@

<img src="_static/robots/blueprint.png" alt="Running Evaluations" style="float:left;max-width:300px;margin:25px;" />

Now it's time to put everything together and run comprehensive evaluations on your agents! In this hands-on lesson, you'll build evaluation pipelines for both the RAG agent from Module 2 and the report generation agent from Module 1.
It's time to put everything together and run comprehensive evaluations on your agents! In this hands-on lesson, you'll build evaluation pipelines for both the RAG agent from Module 2 and the report generation agent from Module 1.

<!-- fold:break -->

## Evaluation Architecture
## Prerequisites

A robust evaluation pipeline consists of several key components:
Before starting, ensure you have:

1. **Agent Under Test**: The agent you're evaluating
2. **Judging Mechanism**: The LLM (or human) that will judge the agent
3. **Test Dataset**: Collection of test cases with inputs and expected outputs
4. **Evaluation Prompts and Metrics**: Instructions on how to score agent outputs
5. **Analysis of Results**: Interpret results for areas of improvement
**Evaluation datasets ready** - You should have created or reviewed test datasets from the previous lesson ([Creating Evaluation Datasets](evaluation_data.md)). You can use:
- Your own generated datasets from the `generate_eval_dataset.ipynb` notebook
- The pre-made datasets: `rag_agent_test_cases.json` and `report_agent_test_cases.json`

In Module 1 ("Build an Agent") and Module 2 ("Agentic RAG"), you learned about RAG and agents, building two agents of your own. In this module, you've learned about building trust, defining evaluation metrics, and using LLM-as-a-Judge. Let's put those pieces together and build out your first full evaluation pipeline!
✅ **Agents built** - Your RAG agent from Module 2 and Report Generation agent from Module 1 should be functional

✅ **Metrics selected** - You've learned about evaluation metrics and LLM-as-a-Judge techniques

<!-- fold:break -->
<img src="_static/robots/operator.png" alt="Dataset Design" style="float:right;max-width:300px;margin:25px;" />

In this module, you will work with two evaluation datasets, one for each agent. Each dataset contains a series of test cases that compare the agent’s output to a human-authored “ground truth” or ideal answer, or to clearly defined quality criteria.
Now let's execute the evaluations!

For the IT Help Desk RAG agent from Module 2, each of our evaluation test cases should include:
- Question
- Ground truth answer (if available)
- Expected retrieved contexts (if available)
- Success criteria
<!-- fold:break -->

Check out the <button onclick="openOrCreateFileInJupyterLab('data/evaluation/rag_agent_test_cases.json');"><i class="fa-brands fa-python"></i> RAG Agent Evaluation Dataset</button> to take a look at some examples we will use in our first evaluation pipeline.
## Evaluation Architecture

<!-- fold:break -->
A robust evaluation pipeline consists of several key components:

For the Report Generation task agent from Module 1, each of our evaluation test cases should include:
- Report Topic
- Expected Report Sections
- Output Quality Criteria
1. **Agent Under Test**: The agent you're evaluating
2. **Judging Mechanism**: The LLM that will judge the agent
3. **Test Dataset**: Collection of test cases with inputs and expected outputs (from previous lesson)
4. **Evaluation Prompts and Metrics**: Instructions on how to score agent outputs
5. **Analysis of Results**: Interpret results for areas of improvement

Note that the Report Generation dataset does not contain a complete ground truth answer. The length and variability of generated reports makes this impractical for evaluation. Instead, the test cases include expected section names and quality criteria, which we'll use to evaluate each report's structure and content.
Let's start by crafting effective evaluation prompts!

Check out the <button onclick="openOrCreateFileInJupyterLab('data/evaluation/report_agent_test_cases.json');"><i class="fa-brands fa-python"></i> Report Agent Evaluation Dataset</button> to take a look at some examples we will use in our second evaluation pipeline.

<!-- fold:break -->

Expand Down Expand Up @@ -150,11 +142,9 @@ Let's start by evaluating the IT Help Desk agent from Module 2. Open the <button

### Load the Test Dataset

We'll need to load our IT Help Desk test dataset for use in evaluation. Recall that each test case includes:
- **Question**: The user's query (common IT help desk questions)
- **Ground Truth Answer**: The expected response
- **Expected Contexts**: Documents that should be retrieved
- **Category**: The type of question (password_management, network_access, etc.)
Load your evaluation dataset - either the one you generated in the previous lesson or the pre-made dataset provided.

The dataset should be located at `data/evaluation/rag_agent_test_cases.json` (or use your custom-generated file).

Load the dataset under the <button onclick="goToLineAndSelect('code/3-agent-evaluation/evaluate_rag_agent.ipynb', 'test_dataset =');"><i class="fas fa-code"></i> Load Test Dataset</button> section.

Expand Down Expand Up @@ -263,16 +253,11 @@ Now that you've thoroughly evaluated your IT Help Desk agent using both standard

Open the <button onclick="openOrCreateFileInJupyterLab('code/3-agent-evaluation/evaluate_report_agent.ipynb');"><i class="fa-solid fa-flask"></i> Report Agent Evaluation</button> notebook.

### Define Test Cases

The report generation agent will need a different test dataset than the IT Help Desk agent. Our test cases will include:
- **Topic**: The report subject
- **Expected Sections**: Sections that should appear in the report
- **Quality Criteria**: Custom metrics that define what makes a "good" report on the given topic
### Load Test Cases

We've set up some example test cases for you to use in your evaluation. Run <button onclick="goToLineAndSelect('code/3-agent-evaluation/evaluate_report_agent.ipynb', 'report_test_cases =');"><i class="fas fa-code"></i> Load Test Cases </button> to view details about the test cases.
Load your report generation test dataset - either one you created or the pre-made dataset at `data/evaluation/report_agent_test_cases.json`.

Check which sections we expect the agent to include/exclude, and what quality criteria we've defined.
Run <button onclick="goToLineAndSelect('code/3-agent-evaluation/evaluate_report_agent.ipynb', 'report_test_cases =');"><i class="fas fa-code"></i> Load Test Cases </button> to load and view the test cases.

<!-- fold:break -->

Expand Down Expand Up @@ -397,7 +382,7 @@ Evaluation results should drive improvements. Here's how to act on common findin

## NeMo Evaluator (Optional)

Congratulations! You have successfully ran evaluations manually using Python scripts to test your agents. This works great for development and experimentation. But what if you need to evaluate not just a few, but <i>thousands</i> of responses?
Congratulations! You have successfully run evaluations manually using Python scripts to test your agents. This works great for development and experimentation. But what if you need to evaluate not just a few, but <i>thousands</i> of responses?

Once you’re scoring large volumes of outputs, comparing multiple agent versions, or running evaluations on a schedule in production, you need a more robust, reproducible, and scalable workflow.

Expand Down
Loading