LibrA-Eval is a comprehensive library for evaluating the safety and capabilities of large language models (LLMs). It powers the Libra-Leaderboard, where you can compare different LLMs across various safety and robustness metrics. For details about our evaluation tasks and datasets, check out our documentation.
-
Installation
git clone https://github.com/LibrAIResearch/libra-eval cd libra-eval pip install -e .
-
Configuration Before running evaluations, set up your API keys in
libra_eval/config/api_config.jsonbased on your needs:API Key Requirements:
- LIBRAI_API_KEY: Required for all evaluations since most evaluators are hosted on Libra-prompter. Get your key here
- OPENAI_API_KEY: Only required if you want to evaluate OpenAI-hosted models
- NEXT_API_KEY: Only required if you want to evaluate models hosted on OpenAI-Next. Get your key here
- Local Models: If you're evaluating locally hosted models, no API keys are required except LIBRAI_API_KEY
-
Basic Usage
# Run a single evaluation task (need openai_api_key) python -m libra_eval.run_eval \ --client openai \ --models gpt-4o-mini-2024-07-18 \ --tasks do_not_answer \ --debug
--client: LLM provider ("openai", "local", "next")--models: Model names (comma-separated, or "all")--tasks: Task names (comma-separated, or "all")--debug: Enable debug mode--rewrite_cache: Force new evaluation instead of using cached results--n_samples_per_task: Number of samples to evaluate--output_dir: Output directory (default: "./outputs")--mode: Execution mode- "full": Complete pipeline (default)
- "inference": Only generate responses
- "evaluation": Only evaluate existing responses
# To run the following evaluation with Next client, openai-next-key is required
python -m libra_eval.run_eval \
--client next \
--models grok-3,qwen-max,deepseek-chat \
--tasks advbench,ruozhibench \
--mode fullResults are organized in three directories:
outputs/evaluations/: Evaluation resultsoutputs/responses/: Model responsesoutputs/results/: Processed final results
# Collect and summarize results
python -m libra_eval.utils.results [--output_dir ./outputs] [--models all] [--tasks all]
# List available models
python -m libra_eval.utils.models
# List available tasks
python -m libra_eval.utils.tasks- Never commit API keys
- Don't include
config/andoutputs/folders in commits
If you use LibrA-Eval in your research, please cite:
@misc{li2024libraleaderboardresponsibleaibalanced,
title={Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability},
author={Haonan Li and Xudong Han and [... other authors]},
year={2024},
eprint={2412.18551},
archivePrefix={arXiv}
}- Email: team@librai.tech
- GitHub: https://github.com/LibrAIResearch/libra-eval
- We welcome contributions! See our contributing guidelines for details.
Licensed under the LIBRAI TECHNOLOGIES LTD Software License Agreement. See LICENSE.md for details.