diff --git a/.devx/3-agent-evaluation/evaluation_data.md b/.devx/3-agent-evaluation/evaluation_data.md index 11afe14..d247589 100644 --- a/.devx/3-agent-evaluation/evaluation_data.md +++ b/.devx/3-agent-evaluation/evaluation_data.md @@ -1,21 +1,133 @@ # Creating Evaluation Datasets -Effective evaluation metrics require high-quality test data. When creating evaluation datasets, here are some key considerations and best practices to keep in mind: +Dataset Design + +You've learned about the metrics you'll use to evaluate your agents. But how do you measure those metrics in practice? + +To produce high-quality, meaningful evaluation metrics, you need a well-designed evaluation dataset. In this lesson, you'll learn how to create evaluation datasets for the two agents you built earlier in this course: + +- **IT Help Desk RAG Agent** (Module 2) +- **Report Generation Agent** (Module 1) + +We'll start with the RAG agent since it has a simpler evaluation structure, then move to the more complex Report Generation agent. + + + +## Dataset Design for Your Agents + +Evaluation datasets work best when they align with the following principles: 1. **Cover Diverse Scenarios**: Include common cases, edge cases, and failure modes 2. **Include Ground Truth**: Where possible, provide correct answers for comparison 3. **Represent Real Usage**: Base test cases on actual user interactions -4. **Start Small**: Begin with 20-30 high-quality examples and expand over time +4. **Start Small**: Begin with a small set of high-quality examples and expand over time 5. **Version Control**: Track your datasets alongside your code +In addition to these general principles, you'll need to make sure your data is tailored to your particular evaluation use case. Different agent tasks, and different metrics, require different types of evaluation data. + +To understand dataset design, let's look at what you'll need to evaluate the agents you've built. + + + +### Dataset for the IT Help Desk RAG Agent (Module 2) + +For the IT Help Desk RAG agent from Module 2, each evaluation test case should include: + +- **Question**: The user's query (common IT help desk questions) +- **Ground Truth Answer**: The expected response +- **Expected Context Keywords**: Keywords that should appear in retrieved documents +- **Category**: The type of question (password_management, vpn_access, network_issues, etc.) + +In your evaluation pipeline, you'll query the RAG agent with each generated test **Question**. The RAG agent's generated response can then be compared to the test data to evaluate multiple dimensions of the agent's performance. + +**Want to see an example?** Check out the to see the structure of the starter dataset we've provided. + +### Dataset for the Report Generation Agent (Module 1) + +For the Report Generation agent from Module 1, each evaluation test case should include: +- **Topic**: The report subject +- **Expected Sections**: Sections that should appear in the report +- **Quality Criteria**: Custom metrics that define what makes a "good" report on this topic + +Like for the IT Help Desk agent, in your evaluation pipeline, you'll query the Report Generation agent with each test **Topic**. However, note that the Report Generation dataset does not contain a complete ground truth answer. The length and variability of reports makes this form of evaluation impractical. + +Instead, the test cases include expected section names and quality criteria, which we'll use to evaluate each report's structure and content. You'll dive into using these quality criteria in the next lesson, [Running Evaluations](running_evaluations.md). + +**Want to see an example?** Check out the to see the structure of the starter dataset we've provided. + + + +## How to Create Your Evaluation Datasets + +Building Datasets + +Now that you understand what evaluation datasets look like for each agent, let's explore strategies for creating them. + -TODO: Content + Code Notebooks for Creating Evaluation Datasets with SDG +### Strategy 1: Real-World Data Collection + +The first strategy is to use existing, real-world data. This approach is often considered the "gold standard" for realism because it captures authentic user experiences and needs. You can either collect this data yourself or leverage existing datasets from online repositories like HuggingFace. + +However, real-world data comes with trade-offs. Manual collection requires substantial human labor and time. Privacy concerns may limit what data you can use, and quality issues, like biased or inappropriate content, can compromise your evaluation. Further, real-world data may simply not exist for your specific use case, particularly for complex or niche applications. + +**Use When**: You have access to existing user interactions or logs, need maximum realism and authenticity, have resources for manual curation and privacy review, or when your application is already deployed and collecting real usage data. + + + +### Strategy 2: Synthetic Data Generation (SDG) + +The second strategy is to generate synthetic data using an LLM. This type of data is highly versatile: it enables developers to rapidly create large datasets, control diversity and coverage, and support use cases where real-world data is difficult or impossible to obtain. + +However, synthetic data has its weaknesses as well. It typically requires careful human validation before use, and it may fail to accurately capture real-world behaviors, edge cases, or distributions, which can reduce the reliability of any evaluations that rely on it. + +**Use When**: You need to rapidly create test cases, want to control for specific scenarios or edge cases, lack access to real-world data, or need to expand coverage beyond what real data provides. -TODO +### Strategy 3: Hybrid Approach + +A hybrid approach combines real-world and synthetic data to leverage the strengths of both while mitigating their weaknesses. It allows developers to ground systems in realistic data, then augment coverage with synthetic examples. + +**Use When**: You have some real-world data but need more coverage, want to balance realism with controlled test scenarios, need to fill gaps in your real-world dataset, or want the most robust evaluation approach combining authenticity with comprehensive coverage. -TODO +## Generate Evaluation Data for Your Agents + +Tools for Generation + +Now it's time to create evaluation datasets for the agents you built! + +For this workshop, we'll use **Synthetic Data Generation** with **NVIDIA NeMo Data Designer**, an open source tool for generating high-quality synthetic data. This approach is ideal for getting started with agent evaluation and learning the fundamentals. + +When you're ready, follow the first notebook to generate data for your RAG agent from Module 2 and get familiar with SDG concepts: + + +Next, generate data for your Report Generation agent from Module 1: + + +
+đź’ˇ NEED SOME HELP? + +We recommend generating your own datasets using the notebooks above to get hands-on experience with the synthetic data generation process. However, if you're running into issues or want to move ahead quickly, we've provided starter datasets you can use: + +- - 12 IT help desk questions across common categories +- - 6 report topics with quality criteria + +These pre-made datasets can also serve as reference examples when you create your own. + +
+ + + +## What's Next + +Once you have your evaluation datasets ready, you're prepared to run comprehensive evaluations! + +In the next lesson, [Running Evaluations](running_evaluations.md), you'll: +- Load your evaluation datasets +- Run your agents on test cases +- Use LLM-as-a-Judge to score responses +- Analyze results and identify improvement areas + diff --git a/.devx/3-agent-evaluation/running_evaluations.md b/.devx/3-agent-evaluation/running_evaluations.md index 09d6695..cbd8244 100644 --- a/.devx/3-agent-evaluation/running_evaluations.md +++ b/.devx/3-agent-evaluation/running_evaluations.md @@ -2,46 +2,38 @@ Running Evaluations -Now it's time to put everything together and run comprehensive evaluations on your agents! In this hands-on lesson, you'll build evaluation pipelines for both the RAG agent from Module 2 and the report generation agent from Module 1. +It's time to put everything together and run comprehensive evaluations on your agents! In this hands-on lesson, you'll build evaluation pipelines for both the RAG agent from Module 2 and the report generation agent from Module 1. -## Evaluation Architecture +## Prerequisites -A robust evaluation pipeline consists of several key components: +Before starting, ensure you have: -1. **Agent Under Test**: The agent you're evaluating -2. **Judging Mechanism**: The LLM (or human) that will judge the agent -3. **Test Dataset**: Collection of test cases with inputs and expected outputs -4. **Evaluation Prompts and Metrics**: Instructions on how to score agent outputs -5. **Analysis of Results**: Interpret results for areas of improvement +**Evaluation datasets ready** - You should have created or reviewed test datasets from the previous lesson ([Creating Evaluation Datasets](evaluation_data.md)). You can use: +- Your own generated datasets from the `generate_eval_dataset.ipynb` notebook +- The pre-made datasets: `rag_agent_test_cases.json` and `report_agent_test_cases.json` -In Module 1 ("Build an Agent") and Module 2 ("Agentic RAG"), you learned about RAG and agents, building two agents of your own. In this module, you've learned about building trust, defining evaluation metrics, and using LLM-as-a-Judge. Let's put those pieces together and build out your first full evaluation pipeline! +✅ **Agents built** - Your RAG agent from Module 2 and Report Generation agent from Module 1 should be functional +✅ **Metrics selected** - You've learned about evaluation metrics and LLM-as-a-Judge techniques - -Dataset Design - -In this module, you will work with two evaluation datasets, one for each agent. Each dataset contains a series of test cases that compare the agent’s output to a human-authored “ground truth” or ideal answer, or to clearly defined quality criteria. +Now let's execute the evaluations! -For the IT Help Desk RAG agent from Module 2, each of our evaluation test cases should include: -- Question -- Ground truth answer (if available) -- Expected retrieved contexts (if available) -- Success criteria + -Check out the to take a look at some examples we will use in our first evaluation pipeline. +## Evaluation Architecture - +A robust evaluation pipeline consists of several key components: -For the Report Generation task agent from Module 1, each of our evaluation test cases should include: -- Report Topic -- Expected Report Sections -- Output Quality Criteria +1. **Agent Under Test**: The agent you're evaluating +2. **Judging Mechanism**: The LLM that will judge the agent +3. **Test Dataset**: Collection of test cases with inputs and expected outputs (from previous lesson) +4. **Evaluation Prompts and Metrics**: Instructions on how to score agent outputs +5. **Analysis of Results**: Interpret results for areas of improvement -Note that the Report Generation dataset does not contain a complete ground truth answer. The length and variability of generated reports makes this impractical for evaluation. Instead, the test cases include expected section names and quality criteria, which we'll use to evaluate each report's structure and content. +Let's start by crafting effective evaluation prompts! -Check out the to take a look at some examples we will use in our second evaluation pipeline. @@ -150,11 +142,9 @@ Let's start by evaluating the IT Help Desk agent from Module 2. Open the section. @@ -263,16 +253,11 @@ Now that you've thoroughly evaluated your IT Help Desk agent using both standard Open the notebook. -### Define Test Cases - -The report generation agent will need a different test dataset than the IT Help Desk agent. Our test cases will include: -- **Topic**: The report subject -- **Expected Sections**: Sections that should appear in the report -- **Quality Criteria**: Custom metrics that define what makes a "good" report on the given topic +### Load Test Cases -We've set up some example test cases for you to use in your evaluation. Run to view details about the test cases. +Load your report generation test dataset - either one you created or the pre-made dataset at `data/evaluation/report_agent_test_cases.json`. -Check which sections we expect the agent to include/exclude, and what quality criteria we've defined. +Run to load and view the test cases. @@ -397,7 +382,7 @@ Evaluation results should drive improvements. Here's how to act on common findin ## NeMo Evaluator (Optional) -Congratulations! You have successfully ran evaluations manually using Python scripts to test your agents. This works great for development and experimentation. But what if you need to evaluate not just a few, but thousands of responses? +Congratulations! You have successfully run evaluations manually using Python scripts to test your agents. This works great for development and experimentation. But what if you need to evaluate not just a few, but thousands of responses? Once you’re scoring large volumes of outputs, comparing multiple agent versions, or running evaluations on a schedule in production, you need a more robust, reproducible, and scalable workflow. diff --git a/code/3-agent-evaluation/generate_rag_eval_dataset.ipynb b/code/3-agent-evaluation/generate_rag_eval_dataset.ipynb new file mode 100644 index 0000000..0666750 --- /dev/null +++ b/code/3-agent-evaluation/generate_rag_eval_dataset.ipynb @@ -0,0 +1,313 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Generate an Evaluation Dataset for a RAG Agent\n", + "\n", + "In this notebook, you'll create a synthetic evaluation dataset using **NVIDIA NeMo Data Designer**, an open source tool for synthetic data generation. We'll generate test cases for evaluating the **IT Help Desk RAG Agent** from Module 2." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Setup\n", + "\n", + "First, let's set up our environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -r ../../requirements.txt > /dev/null\n", + "\n", + "from dotenv import load_dotenv\n", + "_ = load_dotenv(\"../../variables.env\")\n", + "_ = load_dotenv(\"../../secrets.env\")\n", + "\n", + "import os\n", + "import json\n", + "from pathlib import Path" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Configure DataDesigner\n", + "\n", + "Next, initialize Data Designer with default settings.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from data_designer.essentials import (\n", + " DataDesigner,\n", + " DataDesignerConfigBuilder,\n", + " LLMTextColumnConfig,\n", + ")\n", + "\n", + "# Initialize Data Designer with default settings\n", + "data_designer = DataDesigner()\n", + "config_builder = DataDesignerConfigBuilder()\n", + "\n", + "print(\"Completed initialization\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Load IT Knowledge Base as Seed Data\n", + "\n", + "We’ll use articles from the IT knowledge base as seed data for Data Designer. Seed data is the initial set of examples Data Designer uses to guide what it generates.\n", + "\n", + "By seeding Data Designer with our agent's knowledge base, we'll ensure the evaluation data is grounded in information the agent is expected to know.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# Load all articles from the knowledge base directory\n", + "kb_dir = Path(\"../../data/it-knowledge-base\")\n", + "kb_articles = []\n", + "\n", + "for kb_file in sorted(kb_dir.glob(\"*.md\")):\n", + " with open(kb_file, 'r') as f:\n", + " content = f.read()\n", + " \n", + " # Extract the article title and create a readable topic name\n", + " topic_name = kb_file.stem.replace('-', '_')\n", + " title = content.split('\\n')[0].replace('# ', '').strip()\n", + " \n", + " kb_articles.append({\n", + " \"topic\": topic_name,\n", + " \"title\": title,\n", + " \"filename\": kb_file.name,\n", + " \"content\": content\n", + " })\n", + "\n", + "# Create seed data DataFrame\n", + "seed_data = pd.DataFrame(kb_articles)\n", + "\n", + "# Data Designer requires seed data to be saved as a file for efficient sampling\n", + "seed_file_path = Path(\"../../data/evaluation/kb_seed_data.parquet\")\n", + "seed_ref = data_designer.make_seed_reference_from_dataframe(\n", + " dataframe=seed_data,\n", + " file_path=seed_file_path\n", + ")\n", + "\n", + "print(f\"Created seed reference with {len(seed_data)} KB articles\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Configure Dataset Structure for RAG Evaluation\n", + "\n", + "We'll use Data Designer’s Config Builder to define the structure of our synthetic dataset. Our configuration will include the following components:\n", + "\n", + "- Category (sampled from a list of knowledge base topics)\n", + "- IT help desk questions (based on category)\n", + "- Context keywords (to evaluate retrieval)\n", + "- Ground truth answers (ideal agent answer)\n", + "\n", + "We do this by defining a list of \"columns\" representing the different fields of data that we want to generate. Using Jinja formatting, we can reference items in other columns in the same \"row.\" This allows us to, for example, generate a question and then generate an answer based on that question." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create config builder with seed data\n", + "config_builder = DataDesignerConfigBuilder()\n", + "config_builder.with_seed_dataset(seed_ref)\n", + "\n", + "# Generate question based on the seed article\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"question\",\n", + " model_alias=\"nvidia-text\",\n", + " prompt=\"\"\"Generate a realistic IT help desk question that a user might ask about the topic {{ topic }} from article {{ title }}.\n", + "\n", + "The question should:\n", + "- Sound like a real user would ask it\n", + "- Be specific and clear\n", + "- Cover common scenarios related to this topic\n", + "- Be directly answerable using the information in this KB article\n", + "- NOT just restate the article title\n", + "\n", + "Return ONLY the question, nothing else.\"\"\",\n", + " )\n", + ")\n", + "\n", + "# Generate context keywords for retrieval verification\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"context_key_words\",\n", + " model_alias=\"nvidia-text\",\n", + " prompt=\"\"\"Based on this IT help desk question: {{ question }}\n", + "\n", + "List 2-3 key words or phrases that would help locate the relevant knowledge base article.\n", + "\n", + "Return as a comma-separated list.\n", + "Return ONLY the keywords, nothing else.\"\"\",\n", + " )\n", + ")\n", + "\n", + "# Generate ground truth answer based on the knowledge base\n", + "config_builder.add_column(\n", + " LLMTextColumnConfig(\n", + " name=\"ground_truth\",\n", + " model_alias=\"nvidia-text\",\n", + " prompt=\"\"\"Based on this IT help desk question: {{ question }}\n", + "\n", + "And this knowledge base article content:\n", + "{{ content }}\n", + "\n", + "Generate a concise, accurate answer that directly addresses the question using only information from the knowledge base.\n", + "\n", + "Return ONLY the answer, nothing else.\"\"\",\n", + " )\n", + ")\n", + "\n", + "print(\"Dataset configuration built\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Preview the Data Structure\n", + "\n", + "Let's preview what the generated test cases will look like." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Preview a few sample records to see what will be generated\n", + "preview = data_designer.preview(config_builder=config_builder, num_records=3)\n", + "preview.display_sample_record()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Generate the Full Dataset\n", + "\n", + "Now let's generate the complete evaluation dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# We'll only generate a few records for this tutorial\n", + "dataset_results = data_designer.create(\n", + " config_builder=config_builder,\n", + " num_records = 10,\n", + " dataset_name=\"rag_agent_test_cases\"\n", + ")\n", + "\n", + "# Convert to list of dictionaries for processing\n", + "eval_data = dataset_results.load_dataset().to_dict('records')\n", + "\n", + "print(f\"Generated {len(eval_data)} test cases\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Save the Test Data\n", + "\n", + "Now let's clean up and save the generated data in a format ready for evaluation.\n", + "\n", + "We'll keep only the essential evaluation fields: **question**, **context_key_words**, and **ground_truth**. We'll remove the seed data and reasoning traces to make the output cleaner and easier to review." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Set output path\n", + "output_path = Path(\"../../data/evaluation/synthetic_rag_agent_test_cases.json\")\n", + "\n", + "# Keep only essential evaluation columns, ignore reasoning traces and seed data\n", + "essential_columns = ['question', 'context_key_words', 'ground_truth']\n", + "clean_data = []\n", + "\n", + "for record in eval_data:\n", + " clean_record = {}\n", + " for key, value in record.items():\n", + " if key in essential_columns:\n", + " clean_record[key] = value\n", + " clean_data.append(clean_record)\n", + "\n", + "# Save to JSON file\n", + "with open(output_path, 'w') as f:\n", + " json.dump(clean_data, f, indent=2)\n", + "\n", + "print(f\"Saved {len(eval_data)} test cases to {output_path}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What's Next?\n", + "\n", + "Check out to view your generated data.\n", + "\n", + "In the next notebook, you'll use this same process to create an evaluation dataset for your Report Generation agent!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/code/3-agent-evaluation/generate_report_eval_dataset.ipynb b/code/3-agent-evaluation/generate_report_eval_dataset.ipynb new file mode 100644 index 0000000..b044b36 --- /dev/null +++ b/code/3-agent-evaluation/generate_report_eval_dataset.ipynb @@ -0,0 +1,357 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Generate an Evaluation Dataset for a Report Generation Agent\n", + "\n", + "In this notebook, you'll create a synthetic evaluation dataset using **NVIDIA NeMo Data Designer** for the **Report Generation Agent** from Module 1, using your learnings from the previous evaluation data notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Setup\n", + "\n", + "First, let's set up our environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -r ../../requirements.txt > /dev/null\n", + "\n", + "from dotenv import load_dotenv\n", + "_ = load_dotenv(\"../../variables.env\")\n", + "_ = load_dotenv(\"../../secrets.env\")\n", + "\n", + "import os\n", + "import json\n", + "from pathlib import Path" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Configure DataDesigner\n", + "\n", + "Next, let's initialize Data Designer with default settings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from data_designer.essentials import (\n", + " DataDesigner,\n", + " DataDesignerConfigBuilder,\n", + " LLMStructuredColumnConfig,\n", + ")\n", + "\n", + "# Initialize Data Designer with default settings\n", + "data_designer = DataDesigner()\n", + "config_builder = DataDesignerConfigBuilder()\n", + "\n", + "print(\"Completed initialization\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Create Seed Data with Report Topics\n", + "\n", + "Unlike the RAG agent which retrieves from a knowledge base, the report agent generates comprehensive reports on given topics. This will change the type of evaluation metrics we use, and require a different type of evaluation data.\n", + "\n", + "Let's start with some seed data. We'll start a list of manually curated topics. This will help avoid semantic duplication, ensuring that the evaluation data covers a variety of different domains." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "# Define diverse report topics across different domains\n", + "report_topics_data = [\n", + " {\n", + " \"domain\": \"Technology\",\n", + " \"topic_area\": \"Artificial Intelligence\",\n", + " },\n", + " {\n", + " \"domain\": \"Energy\",\n", + " \"topic_area\": \"Renewable Energy\"\n", + " },\n", + " {\n", + " \"domain\": \"Business\",\n", + " \"topic_area\": \"Remote Work\",\n", + " },\n", + " {\n", + " \"domain\": \"Technology\",\n", + " \"topic_area\": \"Quantum Computing\",\n", + " },\n", + " {\n", + " \"domain\": \"Supply Chain\",\n", + " \"topic_area\": \"Sustainability\",\n", + " },\n", + " {\n", + " \"domain\": \"Cybersecurity\",\n", + " \"topic_area\": \"Financial Sector Threats\",\n", + " },\n", + " {\n", + " \"domain\": \"Healthcare\",\n", + " \"topic_area\": \"Digital Health Technologies\",\n", + " },\n", + " {\n", + " \"domain\": \"Education\",\n", + " \"topic_area\": \"Online Learning Platforms\",\n", + " },\n", + " {\n", + " \"domain\": \"Climate\",\n", + " \"topic_area\": \"Carbon Capture Technologies\",\n", + " },\n", + " {\n", + " \"domain\": \"Finance\",\n", + " \"topic_area\": \"Cryptocurrency Regulation\"\n", + " }\n", + "]\n", + "\n", + "# Create seed data and save it for Data Designer\n", + "seed_data = pd.DataFrame(report_topics_data)\n", + "seed_file_path = Path(\"../../data/evaluation/report_seed_data.parquet\")\n", + "\n", + "seed_ref = data_designer.make_seed_reference_from_dataframe(\n", + " dataframe=seed_data,\n", + " file_path=seed_file_path\n", + ")\n", + "\n", + "print(f\"Created seed data with {len(seed_data)} topics\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Configure Data Designer for Report Evaluation\n", + "\n", + "Now we'll configure Data Designer to generate complete evaluation test cases using **LLM-Structured Columns**. This approach is different from the text-based columns we used for the RAG agent—instead of generating separate text fields, we use a Pydantic model to define the entire structure we want:\n", + "\n", + "- **Topic** - A specific professional report topic\n", + "- **Expected Sections** - A list of 6-8 section titles \n", + "- **Quality Criteria** - A nested object with `should_include` and `should_avoid` lists\n", + "\n", + "Using `LLMStructuredColumnConfig`, Data Designer generates the topic, sections, and criteria together in one LLM call, ensuring coherence. It also ensures all required fields are present and correctly typed.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import List\n", + "from pydantic import BaseModel, Field\n", + "\n", + "# Define Pydantic models for structured output\n", + "class QualityCriteria(BaseModel):\n", + " should_include: List[str] = Field(\n", + " description=\"3-4 specific things that SHOULD be included for a high-quality report\"\n", + " )\n", + " should_avoid: List[str] = Field(\n", + " description=\"3-4 things that SHOULD be avoided to maintain quality and objectivity\"\n", + " )\n", + "\n", + "class ReportEvaluationCase(BaseModel):\n", + " topic: str = Field(\n", + " description=\"Specific, professional report topic\"\n", + " )\n", + " expected_sections: List[str] = Field(\n", + " description=\"6-8 section titles for a comprehensive professional report\"\n", + " )\n", + " quality_criteria: QualityCriteria = Field(\n", + " description=\"Quality criteria for evaluating the report\"\n", + " )\n", + "\n", + "# Create config builder with seed data\n", + "config_builder = DataDesignerConfigBuilder()\n", + "config_builder.with_seed_dataset(seed_ref)\n", + "\n", + "# Generate complete structured evaluation case in a single column\n", + "config_builder.add_column(\n", + " LLMStructuredColumnConfig(\n", + " name=\"evaluation_case\",\n", + " model_alias=\"nvidia-text\",\n", + " output_format=ReportEvaluationCase,\n", + " prompt=\"\"\"Generate a complete evaluation test case for a professional report.\n", + "\n", + "Context:\n", + "- Domain: {{ domain }}\n", + "- Topic Area: {{ topic_area }}\n", + "\n", + "Create an evaluation case that includes:\n", + "\n", + "1. TOPIC: A specific, professional report topic that is:\n", + " - Focused and not too broad\n", + " - Relevant to current trends in {{ topic_area }}\n", + " - Suitable for a comprehensive multi-section research report\n", + "\n", + "2. EXPECTED_SECTIONS: 6-8 section titles that:\n", + " - Start with an introduction or executive summary\n", + " - Include substantive middle sections covering key aspects\n", + " - End with conclusions, recommendations, or future outlook\n", + " - Use clear, professional section titles\n", + " - Follow a logical flow\n", + "\n", + "3. QUALITY_CRITERIA:\n", + " - should_include: 3-4 specific things that SHOULD be included for quality\n", + " - should_avoid: 3-4 things that SHOULD be avoided to maintain objectivity\"\"\",\n", + " )\n", + ")\n", + "\n", + "print(\"Dataset configuration built with structured output\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Preview the Data Structure\n", + "\n", + "Let's preview what the generated test cases will look like." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Preview a few sample records to see what will be generated\n", + "preview = data_designer.preview(config_builder=config_builder, num_records=3)\n", + "preview.display_sample_record()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Generate the Full Dataset\n", + "\n", + "Now generate the complete evaluation dataset. We'll generate one test case per topic seed to ensure diverse coverage." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Generate evaluation data (one test case per topic seed)\n", + "num_topics = len(report_topics_data)\n", + "\n", + "dataset_results = data_designer.create(\n", + " config_builder=config_builder,\n", + " num_records=num_topics,\n", + " dataset_name=\"report_agent_test_cases\"\n", + ")\n", + "\n", + "# Convert to list of dictionaries for processing\n", + "eval_data = dataset_results.load_dataset().to_dict('records')\n", + "\n", + "print(f\"\\nGenerated {len(eval_data)} test cases\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Save the Test Data\n", + "\n", + "You've successfully generated synthetic test data for evaluating your Report Generation agent!\n", + "\n", + "Just like before, let's clean up and save the generated data in a format ready for evaluation. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Set output path\n", + "output_path = Path(\"../../data/evaluation/synthetic_report_agent_test_cases.json\")\n", + "formatted_data = []\n", + "\n", + "for record in eval_data:\n", + " # Extract evaluation_case which contains our structured data\n", + " eval_case = record.get(\"evaluation_case\", {})\n", + " \n", + " # Handle both dict and Pydantic model formats\n", + " if hasattr(eval_case, 'model_dump'):\n", + " eval_case = eval_case.model_dump()\n", + " \n", + " # Extract quality criteria\n", + " quality_criteria = eval_case.get(\"quality_criteria\", {})\n", + " if hasattr(quality_criteria, 'model_dump'):\n", + " quality_criteria = quality_criteria.model_dump()\n", + " \n", + " formatted_data.append({\n", + " \"topic\": eval_case.get(\"topic\", \"\"),\n", + " \"expected_sections\": eval_case.get(\"expected_sections\", []),\n", + " \"quality_criteria\": {\n", + " \"should_include\": quality_criteria.get(\"should_include\", []),\n", + " \"should_avoid\": quality_criteria.get(\"should_avoid\", [])\n", + " }\n", + " })\n", + "\n", + "# Save to JSON file\n", + "with open(output_path, 'w') as f:\n", + " json.dump(formatted_data, f, indent=2)\n", + "\n", + "print(f\"Saved {len(formatted_data)} test cases to {output_path.absolute()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What's Next?\n", + "\n", + "Check out to view your generated data.\n", + "\n", + "By now, you've built evaluation datasets for both of your agents. Awesome! You're ready to move on to the next step- putting together an evaluation pipeline and using the data you've generated to evaluate agent performance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/requirements.txt b/requirements.txt index f47bcc6..281f4f2 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,3 +1,4 @@ +data-designer~=0.1.5 dotenv~=0.9.9 faiss-cpu~=1.12.0 fastapi~=0.116.1