Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 97 additions & 93 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,147 +1,151 @@
# OpenEuroLLM CLI (oellm)

A package for running OELLM CLI workflows across multiple HPC clusters using SLURM job arrays and Singularity containers.
A lightweight CLI for scheduling LLM evaluations across multiple HPC clusters using SLURM job arrays and Singularity containers.

## Currently supported workflows
- Schedule evaluations on multiple models and tasks on all clusters ✅ `oellm schedule-eval ...`
- Restart failed evaluations (e.g., due to node failures) ✅ `oellm collect-results ... --reschedule true`
## Features

## Planned workflows
- Sync and download evaluation results from all clusters via a shared data layer
- Schedule training jobs on all clusters
- Schedule conversions from MegatronLM to HuggingFace
- **Schedule evaluations** on multiple models and tasks: `oellm schedule-eval`
- **Collect results** and check for missing evaluations: `oellm collect-results`
- **Task groups** for pre-defined evaluation suites with automatic dataset pre-downloading
- **Multi-cluster support** with auto-detection (Leonardo, LUMI, JURECA)

## Quick Example
## Quick Start

**Prerequisites:**
- install [uv](https://docs.astral.sh/uv/#installation)
**Prerequisites:** Install [uv](https://docs.astral.sh/uv/#installation)

```bash
# Install the package
uv tool install -p 3.12 git+https://github.com/OpenEuroLLM/oellm-cli.git

# Run evaluations on multiple models and tasks
# Run evaluations using a task group (recommended)
oellm schedule-eval \
--models "microsoft/DialoGPT-medium,EleutherAI/pythia-160m" \
--task_groups "open-sci-0.01"

# Or specify individual tasks
oellm schedule-eval \
--models "EleutherAI/pythia-160m" \
--tasks "hellaswag,mmlu" \
--n_shot 5
```

This will automatically:
- Detect your current HPC cluster (Leonardo, LUMI, or JURECA)
- Download and cache the specified models and datasets
- Generate a SLURM job array to evaluate all model-task combinations
- Submit the jobs with appropriate cluster-specific resource allocations
- Download and cache the specified models
- Pre-download datasets for known tasks (see warning below)
- Generate and submit a SLURM job array with appropriate cluster-specific resources

## Task Groups

Task groups are pre-defined evaluation suites in [`task-groups.yaml`](oellm/resources/task-groups.yaml). Each group specifies tasks, their n-shot settings, and HuggingFace dataset mappings.

Available task groups:
- `open-sci-0.01` - Standard benchmarks (COPA, MMLU, HellaSwag, ARC, etc.)
- `belebele-eu-5-shot` - Belebele European language tasks
- `flores-200-eu-to-eng` / `flores-200-eng-to-eu` - Translation tasks
- `global-mmlu-eu` - Global MMLU in EU languages
- `mgsm-eu` - Multilingual GSM benchmarks
- `generic-multilingual` - XWinograd, XCOPA, XStoryCloze
- `include` - INCLUDE benchmarks

In case you meet HuggingFace quotas issues, make sure you are logged in by setting your `HF_TOKEN` and that you are part of [OpenEuroLLM](https://huggingface.co/OpenEuroLLM) organization.
Super groups combine multiple task groups:
- `oellm-multilingual` - All multilingual benchmarks combined

You can also directly schedule using a CSV file:
```bash
oellm schedule-eval --eval_csv_path custom_evals.csv
# Use a task group
oellm schedule-eval --models "model-name" --task_groups "open-sci-0.01"

# Use multiple task groups
oellm schedule-eval --models "model-name" --task_groups "belebele-eu-5-shot,global-mmlu-eu"

# Use a super group
oellm schedule-eval --models "model-name" --task_groups "oellm-multilingual"
```

## Installation
## ⚠️ Dataset Pre-Download Warning

### JURECA/JSC Specifics
**Datasets are only automatically pre-downloaded for tasks defined in [`task-groups.yaml`](oellm/resources/task-groups.yaml).**

If you use custom tasks via `--tasks` that are not in the task groups registry, the CLI will attempt to look them up but **cannot guarantee the datasets will be cached**. This may cause failures on compute nodes that don't have network access.

**Recommendation:** Use `--task_groups` when possible, or ensure your custom task datasets are already cached in `$HF_HOME` before scheduling.

Due to the limit space in `$HOME` on JSC clusters, you must set these `uv` specific environment variables to avoid running out of space:
## Collecting Results

After evaluations complete, collect results into a CSV:

```bash
export UV_CACHE_DIR="<some-workspace-dir>/.cache/uv-cache"
export UV_INSTALL_DIR="<some-workspace>/.local"
export UV_PYTHON_INSTALL_DIR="<some-workspace>/.local/share/uv/python"
export UV_TOOL_DIR="<some-workspace-dir>/.cache/uv-tool-cache"
# Basic collection
oellm collect-results /path/to/eval-output-dir

# Check for missing evaluations and create a CSV for re-running them
oellm collect-results /path/to/eval-output-dir --check --output_csv results.csv
```

You can set these variables in your `.bashrc` or `.zshrc` file, depending on your shell of preference.
The `--check` flag compares completed results against `jobs.csv` and outputs a `results_missing.csv` that can be used to re-schedule failed jobs:

E.g., I have a user-folder in the `synthlaion` project, so I set the following variables:
```bash
export UV_CACHE_DIR="/p/project1/synthlaion/$USER/.cache/uv-cache"
export UV_INSTALL_DIR="/p/project1/synthlaion/$USER/.local"
export UV_PYTHON_INSTALL_DIR="/p/project1/synthlaion/$USER/.local/share/uv/python"
export UV_TOOL_DIR="/p/project1/synthlaion/$USER/.cache/uv-tool-cache"
oellm schedule-eval --eval_csv_path results_missing.csv
```

### General Installation
## CSV-Based Scheduling

Install directly from the git repository using uv:
For full control, provide a CSV file with columns: `model_path`, `task_path`, `n_shot`, and optionally `eval_suite`:

```bash
uv tool install -p 3.12 git+https://github.com/OpenEuroLLM/oellm-cli.git
oellm schedule-eval --eval_csv_path custom_evals.csv
```

This makes the `oellm` command available globally in your shell.
## Installation

### General Installation

If you've already installed the package, you can run the following command to update it:
```bash
uv tool upgrade oellm
uv tool install -p 3.12 git+https://github.com/OpenEuroLLM/oellm-cli.git
```

If you had previously installed the package from a different source and would like to overwrite it, you can run the following command:
Update to latest:
```bash
uv tool install -p 3.12 git+https://github.com/OpenEuroLLM/oellm-cli.git --force
uv tool upgrade oellm
```

## High-Level Evaluation Workflow

The `oellm` package orchestrates distributed LLM evaluations through the following workflow:

### 1. **Cluster Auto-Detection**
- Automatically detects the current HPC cluster based on hostname patterns
- Loads cluster-specific configurations from [`clusters.yaml`](oellm/resources/clusters.yaml) including:
- SLURM partition and account settings
- Shared storage paths for models, datasets, and results
- GPU allocation and queue limits
- Singularity container specifications

### 2. **Resource Preparation**
- **Model Handling**: Processes both local model checkpoints and Hugging Face Hub models
- For local paths: Automatically discovers and expands training checkpoint directories
- For HF models: Pre-downloads to shared cache (`$HF_HOME`) for offline access on compute nodes
- **Dataset Caching**: Pre-downloads all evaluation datasets using lm-evaluation-harness TaskManager
- **Container Management**: Ensures the appropriate Singularity container is available for the target cluster

### 3. **Job Generation & Scheduling**
- Creates a comprehensive CSV manifest of all model-task-shot combinations
- Generates a SLURM batch script from a template with cluster-specific parameters
- Submits a job array where each array task processes a subset of evaluations
- Respects queue limits and current user load to avoid overwhelming the scheduler

### 4. **Distributed Execution**
- Each SLURM array job runs in a Singularity container with:
- GPU access (NVIDIA CUDA or AMD ROCm as appropriate)
- Mounted shared storage for models, datasets, and output
- Offline execution using pre-cached resources
- Uses `lm-evaluation-harness` for the actual model evaluation
- Outputs results as JSON files

### 5. **Output Organization**
Results are organized in timestamped directories under `$EVAL_OUTPUT_DIR/$USER/`:
```
2024-01-15-14-30-45/
├── jobs.csv # Complete evaluation manifest
├── submit_evals.sbatch # Generated SLURM script
├── slurm_logs/ # SLURM output/error logs
└── results/ # Evaluation JSON outputs
```
### JURECA/JSC Specifics

Due to limited space in `$HOME` on JSC clusters, set these environment variables:

## Supported Clusters
```bash
export UV_CACHE_DIR="/p/project1/<project>/$USER/.cache/uv-cache"
export UV_INSTALL_DIR="/p/project1/<project>/$USER/.local"
export UV_PYTHON_INSTALL_DIR="/p/project1/<project>/$USER/.local/share/uv/python"
export UV_TOOL_DIR="/p/project1/<project>/$USER/.cache/uv-tool-cache"
```

Currently supports three HPC clusters:
## Supported Clusters:
We support: Leonardo, Lumi, and Jureca

- **LEONARDO** - NVIDIA A100 GPUs (CUDA)
- **LUMI** - AMD MI250X GPUs (ROCm)
- **JURECA** - NVIDIA A100 GPUs (CUDA)
## CLI Options

Each cluster has pre-configured:
- Shared evaluation directories with appropriate quotas
- Optimized Singularity containers with evaluation dependencies
- Account and partition settings for the OpenEuroLLM project
```bash
oellm schedule-eval --help
```

## Development and Testing
Run in download-only mode to prepare resources without submitting jobs:
## Development

```bash
oellm schedule-eval --models "EleutherAI/pythia-160m" --tasks "hellaswag" --n_shot 0 --download_only True
# Clone and install in dev mode
git clone https://github.com/OpenEuroLLM/oellm-cli.git
cd oellm-cli
uv sync --extra dev

# Run dataset validation tests
uv run pytest tests/test_datasets.py -v

# Download-only mode for testing
uv run oellm schedule-eval --models "EleutherAI/pythia-160m" --task_groups "open-sci-0.01" --download_only
```

## Troubleshooting

**HuggingFace quota issues**: Ensure you're logged in with `HF_TOKEN` and are part of the [OpenEuroLLM](https://huggingface.co/OpenEuroLLM) organization.

**Dataset download failures on compute nodes**: Use `--task_groups` for automatic dataset caching, or pre-download datasets manually before scheduling.
46 changes: 27 additions & 19 deletions oellm/main.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import logging
import math
import os
import re
import subprocess
Expand All @@ -8,20 +9,21 @@
from pathlib import Path
from string import Template

import numpy as np
import pandas as pd
from jsonargparse import auto_cli

from oellm.task_cache import clear_task_cache
from oellm.task_groups import _expand_task_groups
from oellm.task_groups import (
_collect_dataset_specs,
_expand_task_groups,
_lookup_dataset_specs_for_tasks,
)
from oellm.utils import (
_ensure_singularity_image,
_expand_local_model_paths,
_filter_warnings,
_load_cluster_env,
_num_jobs_in_queue,
_pre_download_lighteval_datasets,
_pre_download_task_datasets,
_pre_download_datasets_from_specs,
_process_model_paths,
_setup_logging,
capture_third_party_output_from_kwarg,
Expand Down Expand Up @@ -199,17 +201,24 @@ def schedule_evals(
# Ensure that all datasets required by the tasks are cached locally to avoid
# network access on compute nodes.
if not skip_checks:
lm_eval_tasks = df[df["eval_suite"].isin({"lm-eval-harness"})][
"task_path"
].unique()
if len(lm_eval_tasks) > 0:
_pre_download_task_datasets(
lm_eval_tasks, trust_remote_code=trust_remote_code
dataset_specs = []
if task_groups:
dataset_specs = _collect_dataset_specs(
[g.strip() for g in task_groups.split(",")]
)
else:
# Look up individual tasks in task groups registry
all_tasks = df["task_path"].unique().tolist()
dataset_specs = _lookup_dataset_specs_for_tasks(all_tasks)
if not dataset_specs:
logging.info(
"No dataset specs found for tasks; skipping dataset pre-download"
)

if dataset_specs:
_pre_download_datasets_from_specs(
dataset_specs, trust_remote_code=trust_remote_code
)
# Pre-download LightEval datasets (best-effort, incremental support)
light_eval_tasks = df[df["eval_suite"].isin({"light-eval"})]["task_path"].unique()
if len(light_eval_tasks) > 0:
_pre_download_lighteval_datasets(light_eval_tasks)
else:
logging.info("Skipping dataset pre-download (--skip-checks enabled)")

Expand Down Expand Up @@ -254,15 +263,15 @@ def schedule_evals(
minutes_per_eval = 10 # Budget 10 minutes per eval
total_minutes = total_evals * minutes_per_eval
max_minutes_per_job = 18 * 60 # 18 hours
min_array_size_for_time = max(1, int(np.ceil(total_minutes / max_minutes_per_job)))
min_array_size_for_time = max(1, int(math.ceil(total_minutes / max_minutes_per_job)))
desired_array_size = min(128, total_evals) if total_evals >= 128 else total_evals
if desired_array_size < min_array_size_for_time:
desired_array_size = min_array_size_for_time
actual_array_size = min(remaining_queue_capacity, desired_array_size, total_evals)
evals_per_job = max(1, int(np.ceil(total_evals / actual_array_size)))
evals_per_job = max(1, int(math.ceil(total_evals / actual_array_size)))
minutes_per_job = evals_per_job * minutes_per_eval
minutes_with_margin = int(minutes_per_job * 1.2)
hours_with_margin = max(1, int(np.ceil(minutes_with_margin / 60)))
hours_with_margin = max(1, int(math.ceil(minutes_with_margin / 60)))
hours_with_margin = max(hours_with_margin, 3)
hours_with_margin = min(hours_with_margin, 23)
time_limit = f"{hours_with_margin:02d}:59:00"
Expand Down Expand Up @@ -633,7 +642,6 @@ def main():
{
"schedule-eval": schedule_evals,
"collect-results": collect_results,
"clean-cache": lambda: clear_task_cache(),
},
as_positional=False,
description="OELLM: Multi-cluster evaluation tool for language models",
Expand Down
Loading
Loading