Skip to content

Conversation

@shreyashkar-ml
Copy link
Collaborator

@shreyashkar-ml shreyashkar-ml commented Aug 10, 2025

This commit adds support for the lm-eval library, and moves infer quantization and model context window extraction to the common utils module.

  1. Added adapter for transforming lm-eval outputs to the unified schema format.
  2. Added converter for running lm-eval and dumping outputs to the unified schema format.
  3. Added test for the adapter and converter, with test config for the lm-eval library in config/lm_eval_test_config.yaml.
  4. Added _infer_quantization and _extract_context_window_from_config functions to the common utils module.

Complete pipeline: YAML → LMEvalRunner → lm-eval → LMEvalAdapter → Unified Schema

@damian1996
Copy link
Collaborator

damian1996 commented Aug 14, 2025

The changes in naming (for example helm to eval_helm) are unneeded because Andrew's PR will clean it enough.

Copy link
Collaborator

@damian1996 damian1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments for now, probably I will add more for adapter file for extracting scores and samples.


# Helpers

def _infer_quantization(model_name_or_path: str) -> tuple[BitPrecision, Method]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think to move this question to common/utils.py? There usually are issues with extracting quantization info directly from evaluation logs, so this can be useful as a way to extract this info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is used in more than a few occasion, I will probably move this and get_context_size part to utils.py

pyproject.toml Outdated

[tool.setuptools.packages.find]
include = ["helm*", "schema*", "common*", "config*"]
include = ["eval_helm*", "eval_lmeval*", "schema*", "common*", "config*"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove eval_ prefixes

}

method = method_map.get(method_key, Method.None_)
return precision, method
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding quant type (gptq, awq, ...) to output from this function and handle it in our schema as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do this, would you recommend editing Method class in eval_types.py or adding a new class for mapping from quant type to Method (existing class)?


class LMEvalAdapter(BaseEvaluationAdapter):

CONFIG_FILE = "config.yaml"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you always have an access to these four files in lm-eval logs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file name may change, I will add some checks to ensure that naming consistency is maintained, but lm_eval works with yaml file (so unless it's terminal arguments entry), yes

if not dir_path.is_dir():
raise FileNotFoundError(f"Directory {dir_path} does not exist")

cfg_path = dir_path / self.CONFIG_FILE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested it? I received TypeError: unsupported operand type(s) for /: 'str' and 'str'
Maybe use:
cfg_path = os.path.join(dir_path, self.CONFIG_FILE)
or
cfg_path = f'{dir_path}/{self.CONFIG_FILE}'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will change to ensure robustness, was working fine in my system so didn't think of it much that time


# Load task-level metrics
task_scores: Dict[str, Dict[str, float]] = {}
results_path = dir_path / self.RESULTS_FILE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to check as above

@@ -0,0 +1,66 @@
from __future__ import annotations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a point of this file? We probably don't need it because we won't run any experiments. We only converting eval logs from users to our unified schema.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove

# generative tasks often expose `exact_match` / `bleu` - handled ad-hoc
}

def detect_prompt_class(task_name: str) -> PromptClass:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider to fill this up later in the adapter file? Even if we don't have a field like in HELM (about question type), we still can confirm if task is multiple_choice when we reading responses for questions from the eval log.

@@ -0,0 +1,340 @@
import pytest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this file, because we don't run anything here.

@@ -0,0 +1,10 @@
model: hf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you use it?

@shreyashkar-ml
Copy link
Collaborator Author

Thanks @damian1996 , I will wait till andrew's PR for file cleanup and name change is merged, then will commit as per your recommendations to avoid merge conflicts.

…ntization and model context window extraction to the common utils module.

1. Added adapter for transforming lm-eval outputs to the unified schema format.
2. Added converter for running lm-eval and dumping outputs to the unified schema format.
3. Added test for the adapter and converter, with test config for the lm-eval library in config/lm_eval_test_config.yaml.
4. Added _infer_quantization and _extract_context_window_from_config functions to the common utils module.
@shreyashkar-ml shreyashkar-ml force-pushed the shreyashkar/lm-eval-implementation branch from ca040da to b3dd337 Compare September 6, 2025 08:56
@shreyashkar-ml
Copy link
Collaborator Author

@damian1996 , I have updated the lm-eval integration accordingly, kindly check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants