- Python 3.7 or higher
- pip package manager
- Create a virtual environment using
.venv:
python -m venv .venv- Activate the virtual environment:
On macOS/Linux:
source .venv/bin/activateOn Windows:
.venv\Scripts\activate- Install dependencies:
pip install -r requirements_parallel.txtBefore running the evaluation, you need to configure the YAML files based on whether you're using local GPU or API services.
- src/configs/parallel_evaluation.yaml - Main evaluation configuration
- src/configs/parallel_evaluation_prompt_generator.yaml - LLM generated attack configuration
If you're using local GPU with vLLM, configure each agent in the YAML files as follows:
agent:
tutor:
model_name: "Qwen/Qwen2.5-7B-Instruct"
use_vllm: true
vllm_port: 8000
base_url: "http://localhost:8000/v1"
device: "cuda:0" # Specify GPU device
# ... other parametersKey parameters:
use_vllm: Set totruevllm_port: Port where vLLM server is running (e.g., 8000, 8001, etc.)base_url: vLLM server URL (e.g.,http://localhost:8000/v1)device: CUDA device (e.g.,cuda:0,cuda:1,cuda:2)
If you're using API services, configure each agent as follows:
agent:
judge:
model_name: "meta-llama/Llama-3.3-70B-Instruct-bfloat16"
use_vllm: true
vllm_port: 8000 # random setting
base_url: API endpoint
device: "cuda:2". # random settingand set the api key in src/agents/llm_agent.py
self.client = OpenAI(
api_key="EMPTY",
base_url=vllm_base_url,
)The script directory contains bash scripts for running different evaluation scenarios:
script/base_adversarial_agent.sh
Tests basic adversarial student attacks against different tutor configurations:
- Base adversarial student vs. base in-context tutor
- Base adversarial student vs. tutor with reasoning
- Base adversarial student vs. multi-agent tutor
bash script/base_adversarial_agent.shscript/student_with_reasoning.sh
Evaluates student agents with reasoning capabilities against various tutors:
- Student with reasoning vs. base in-context tutor
- Student with reasoning vs. tutor with reasoning
- Student with reasoning vs. multi-agent tutor
bash script/student_with_reasoning.shTests multi-agent student attacks with reflection capabilities:
- Multi-agent student vs. base in-context tutor
- Multi-agent student vs. tutor with reasoning
- Multi-agent student vs. multi-agent tutor
bash script/multi_agent_student.shscript/llm_generated_attacks.sh
Runs LLM-generated adversarial prompts against all tutor types. This script iterates through multiple attack strategies:
- contextual_manipulation
- direct_request
- emotional_threat
- interpersonal_influence
- intentional_wrong_answer
- request_shaping
bash script/llm_generated_attacks.shscript/manually_defined_attacks.sh
Evaluates manually refined adversarial prompts from the collected_prompts directory:
- Manual prompts vs. base in-context tutor
- Manual prompts vs. tutor with reasoning
- Manual prompts vs. multi-agent tutor
bash script/manually_defined_attacks.sh- All scripts use Hydra configuration override syntax
- Results are logged to Weights & Biases (wandb)
- Make sure to configure your vLLM ports and model paths before running
- Each script can be run independently based on your evaluation needs
In this repositroy, we provided the examples of the prompts and the dataset for finetuning. We plan to release the full data upon the paper acceptance.