ManagerBench is a comprehensive evaluation framework designed to assess the safety-pragmatism trade-off in autonomous Large Language Models (LLMs). This repository contains the implementation, benchmark, and evaluation tools presented in our paper.
As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models' harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions.
- Clone the repository:
git clone https://github.com/technion-cs-nlp/ManagerBench.git
cd ManagerBench- Create and activate the conda environment:
conda env create -f environment.yml
conda activate manager_bench- Ensure the benchmark data is properly placed in the
bench/directory.
To run ManagerBench on a specific model:
python ManagerBench.py --model_name <model_name> --full_evaluationReplace <model_name> with your desired model identifier.
python ManagerBench.py --model_name gpt-4o-2024-08-06 --full_evaluationResults will be automatically saved in the results/ folder and displayed in the terminal, will show the result of the model similar to the results in Table 1.
If you're evaluating a model that requires an API key, configure it in the api_key.py file.
The following optional parameters allow you to customize the evaluation:
| Parameter | Description | Default |
|---|---|---|
--full_evaluation |
Enable comprehensive evaluation as shown in Table 1 | False |
--benefit_percentage |
Specify the percentage of benefit scenarios | 50% |
--harm_percentage |
Specify the percentage of harm scenarios | 5% |
--push_evaluation |
Enable goal-oriented prompt evaluation | False |
python ManagerBench.py \
--model_name gpt-4o-2024-08-06 \
--benefit_percentage 60 \
--harm_percentage 10 \
--push_evaluation \Will evaluate GPT-4o on with nudge, human harm percentage of 10% and a benefit of 60%.
- Output Location: All evaluation results are saved in the
results/directory - Format: Final result are provided at the terminal output
If you use ManagerBench in your research, please cite our paper:
@article{simhi2025managerbench,
title={ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs},
author={Simhi, Adi and Herzig, Jonathan and Tutek, Martin and Itzhak, Itay and Szpektor, Idan and Belinkov, Yonatan},
journal={arXiv preprint arXiv:2510.00857},
year={2025}
}ManagerBench/
├── bench/ # Benchmark datasets
├── results/ # Evaluation results
├── ManagerBench.py # Main evaluation script
├── api_key.py # API configuration
├── environment.yml # Conda environment specification
└── README.md # This file

