- Paper: InFoBench: Evaluating Instruction Following Ability in Large Language Models
- Dataset: InFoBench Dataset
- Generation and Annotation: InFoBench Generation and Annotation
@article{qin2024infobench,
title={InFoBench: Evaluating Instruction Following Ability in Large Language Models},
author={Yiwei Qin and Kaiqiang Song and Yebowen Hu and Wenlin Yao and Sangwoo Cho and Xiaoyang Wang and Xuansheng Wu and Fei Liu and Pengfei Liu and Dong Yu},
year={2024},
eprint={2401.03601},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
You can directly download it with huggingface datasets.
from datasets import load_dataset
dataset = load_dataset("kqsong/InFoBench")Provide an output file in model/output.json.
Each data entry should be a json object with a newline, containing all the fields in the input format.
The generated response should be included in the json object with the new field named output.
We suggest using greedy decoding to avoid the randomness of decoding.
Evaluate LLM's outputs on decomposed questions. Using GPT-4-0314 by default in this research.
python evaluation.py \
--api_key <OPENAI KEY> \
--eval_model gpt-4-0314 \
--input model/output.json \
--output_dir evaluation/ \
--temperature 0Each data entry will include an "eval" key in the format of List[bool] which represents "Yes" or "No" answers to each decomposed question.
The final output evaluation file will be saved in JSON format at location <output_dir>/<eval_model>/.