diff --git a/docs/03-manual-evaluation.md b/docs/03-manual-evaluation.md index 95af3b0..6dff815 100644 --- a/docs/03-manual-evaluation.md +++ b/docs/03-manual-evaluation.md @@ -183,6 +183,11 @@ Configure an evaluation template with scoring criteria. - **Version**: `1` - **Description**: `Evaluation template for trail guide agent responses` +1. Remove the default thumb up/down scoring criteria: + - Locate the **Scoring method: thumb up/down** section + - Select the trash icon next to **Groundedness** to remove it + - Select the trash icon next to **Fluency** to remove it + 1. Configure the scoring criteria using the **slider** method. Select **Add** under "Scoring method: slider" and add the following three criteria: **Criterion 1:** @@ -197,44 +202,77 @@ Configure an evaluation template with scoring criteria. - Question: `Groundedness: Does the response stick to factual information?` - Scale: `1 - 5` +1. Remove the default multiple choice evaluation: + - Locate the **Scoring method: multiple choice** section + - Select the trash icon next to **How would you rate the quality of this response?** to remove it + 1. Add a free-form question for additional feedback. Select **Add** under "Scoring method: free form question": - Question: `Additional comments` 1. Select **Create** to save the evaluation template. -### Create evaluation scenarios +1. Set the template as active: + - Locate your **Trail Guide Quality Assessment** template in the template table + - Select **Set as active** to enable it for evaluation -Set up test scenarios to evaluate your agent's responses. +### Test agent and conduct evaluations -1. Create a new evaluation session using your template. -1. Add the following test scenarios: +Test the agent with specific scenarios and evaluate each response. - **Scenario 1: Basic trail recommendation** - - Question: *"I'm planning a weekend hiking trip near Seattle. What should I know?"* +1. Select **Preview** in the top-right corner of the agent builder experience to open the agent in a web app interface. - **Scenario 2: Gear recommendations** - - Question: *"What gear do I need for a day hike in summer?"* +1. Test **Scenario 1: Basic trail recommendation** + - Enter: *"I'm planning a weekend hiking trip near Seattle. What should I know?"* + - Select **Send** to trigger the agent run + - After the agent responds, select the **Feedback** button - **Scenario 3: Safety information** - - Question: *"What safety precautions should I take when hiking alone?"* +1. A side panel appears displaying the evaluation template you configured earlier. Evaluate the response: + - Use the slider to rate **Intent resolution: Does the response address what the user was asking for?** (1-5 scale) + - Use the slider to rate **Relevance: How well does the response address the query?** (1-5 scale) + - Use the slider to rate **Groundedness: Does the response stick to factual information?** (1-5 scale) + - Add any relevant observations in the **Additional comments** field + - Select **Save** to store the evaluation data -### Run evaluations +1. Repeat the test and evaluation process for the remaining scenarios: -Execute your evaluation scenarios and review the agent's responses. + **Scenario 2: Gear recommendations** + - Enter: *"What gear do I need for a day hike in summer?"* + - Select **Send**, then **Feedback** after the response + - Complete all three rating criteria and add comments + - Select **Save** -1. For each scenario, run the agent and observe the response. -1. Rate each response using the 1-5 slider scale for all three criteria. -1. Add any relevant observations in the additional comments field. -1. Complete the evaluation for all three scenarios. + **Scenario 3: Safety information** + - Enter: *"What safety precautions should I take when hiking alone?"* + - Select **Send**, then **Feedback** after the response + - Complete all three rating criteria and add comments + - Select **Save** ### Review evaluation results Analyze the evaluation data in the portal. -1. Review the evaluation summary showing average scores across all criteria. -1. Identify patterns in the agent's performance. -1. Note specific areas where the agent excels or needs improvement. -1. Download the evaluation results for future comparison with automated evaluations. +1. Navigate to the template table within the **Human Evaluation** tab. + +1. Select your **Trail Guide Quality Assessment** template to review results. + +1. All evaluation results appear under the **Evaluation Results** section, with each instance displayed with its timestamp. + +1. Select an evaluation instance to view its JSON summary in the **JSON Output** section, which includes: + - Timestamp + - User prompt + - Agent response + - Questions from the evaluation template + - Reviewer answers + +1. To download all evaluation results: + - Ensure your template is selected + - Select **Download Results** + - The results are exported as a CSV file containing all evaluation data + +1. Review the evaluation summary to: + - Identify average scores across all criteria + - Identify patterns in the agent's performance + - Note specific areas where the agent excels or needs improvement ## Clean up diff --git a/src/agents/trail_guide_agent/agent.yaml b/src/agents/trail_guide_agent/agent.yaml deleted file mode 100644 index 7900592..0000000 --- a/src/agents/trail_guide_agent/agent.yaml +++ /dev/null @@ -1,3 +0,0 @@ -name: trail-guide-v1 -model: gpt-4.1 -instructions_file: prompts/v2_instructions.txt