Skip to content

LibrAIResearch/libra-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LibrA-Eval

Introduction

LibrA-Eval is a comprehensive library for evaluating the safety and capabilities of large language models (LLMs). It powers the Libra-Leaderboard, where you can compare different LLMs across various safety and robustness metrics. For details about our evaluation tasks and datasets, check out our documentation.

Quick Start

  1. Installation

    git clone https://github.com/LibrAIResearch/libra-eval
    cd libra-eval
    pip install -e .
  2. Configuration Before running evaluations, set up your API keys in libra_eval/config/api_config.json based on your needs:

    API Key Requirements:

    • LIBRAI_API_KEY: Required for all evaluations since most evaluators are hosted on Libra-prompter. Get your key here
    • OPENAI_API_KEY: Only required if you want to evaluate OpenAI-hosted models
    • NEXT_API_KEY: Only required if you want to evaluate models hosted on OpenAI-Next. Get your key here
    • Local Models: If you're evaluating locally hosted models, no API keys are required except LIBRAI_API_KEY
  3. Basic Usage

    # Run a single evaluation task (need openai_api_key)
    python -m libra_eval.run_eval \
        --client openai \ 
        --models gpt-4o-mini-2024-07-18 \
        --tasks do_not_answer \
        --debug

Detailed Usage Guide

Command Line Options

  • --client: LLM provider ("openai", "local", "next")
  • --models: Model names (comma-separated, or "all")
  • --tasks: Task names (comma-separated, or "all")
  • --debug: Enable debug mode
  • --rewrite_cache: Force new evaluation instead of using cached results
  • --n_samples_per_task: Number of samples to evaluate
  • --output_dir: Output directory (default: "./outputs")
  • --mode: Execution mode
    • "full": Complete pipeline (default)
    • "inference": Only generate responses
    • "evaluation": Only evaluate existing responses

Example with Multiple Models

# To run the following evaluation with Next client, openai-next-key is required
python -m libra_eval.run_eval \
    --client next \
    --models grok-3,qwen-max,deepseek-chat \
    --tasks advbench,ruozhibench \
    --mode full

Output Structure

Results are organized in three directories:

  • outputs/evaluations/: Evaluation results
  • outputs/responses/: Model responses
  • outputs/results/: Processed final results

Analyzing Results

# Collect and summarize results
python -m libra_eval.utils.results [--output_dir ./outputs] [--models all] [--tasks all]

# List available models
python -m libra_eval.utils.models

# List available tasks
python -m libra_eval.utils.tasks

Important Notes

⚠️ Security Warnings:

  • Never commit API keys
  • Don't include config/ and outputs/ folders in commits

Additional Resources

Citation

If you use LibrA-Eval in your research, please cite:

@misc{li2024libraleaderboardresponsibleaibalanced,
    title={Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability},
    author={Haonan Li and Xudong Han and [... other authors]},
    year={2024},
    eprint={2412.18551},
    archivePrefix={arXiv}
}

Contact & Contributing

License

Licensed under the LIBRAI TECHNOLOGIES LTD Software License Agreement. See LICENSE.md for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 18