-
Notifications
You must be signed in to change notification settings - Fork 253
[Cookbook Recipe of Using Rubric] #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
d27b7c7
a very simple example based on rubric grading
ruiqi-zhong aa0c6de
b
ruiqi-zhong 40e5374
b
ruiqi-zhong c51822e
b
ruiqi-zhong cf812c7
b
ruiqi-zhong 163d816
b
ruiqi-zhong f370ab8
b
ruiqi-zhong 4292086
b
ruiqi-zhong 56f9beb
adding readme
ruiqi-zhong 4213c9f
adding readme
ruiqi-zhong 456a03c
b
ruiqi-zhong d073bf8
b
ruiqi-zhong 843fce5
Merge branch 'main' into rubric_example
ruiqi-zhong 0fafd14
b
ruiqi-zhong 07a54ef
b
ruiqi-zhong 99feeba
.
joschu 74a2eb8
.
joschu 29a6fc3
.
joschu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| # Rubric-based Grading for LLMs | ||
|
|
||
| - [`data.py`](./data.py) contains the definition for the datapoint class. Each datapoint consists of a conversation prefix and a list of rubric items. | ||
| - [`generate_data.py`](./generate_data.py) generates some example datapoints if you want to run our demo on addition. | ||
| - [`env.py`](./env.py) determines what each rollout will do. It will let the policy read the prefix, generate a response, ask a grader LLM to grade based on a list of rubric items, and finally provide a reward by summing the response of each grader. | ||
| - [`train.py`](./train.py) allows you to train LLMs on any dataset saved in our format (specified in `data.py`). The default script will train on the addition task, whose data is generated by `generate_data.py`. | ||
| - [`prometheus_experimental.py`](./prometheus_experimental.py) contains a script to train the LLMs based on the rubrics from the [`prometheus-eval/Feedback-Collection`](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/viewer/default/train?row=0&views%5B%5D=train) dataset. It is experimental though -- even though the reward goes up, there is no guarantee that the model is actually better. We hope our script serves as a starting point, and more research is needed. | ||
|
|
||
|
|
||
| ## A simple example of using a grader LLM with rubrics | ||
|
|
||
| We show how to use a rubric-based LLM to provide a reward for an addition task. E.g. | ||
|
|
||
| ``` | ||
| **User**: What's 233 + 100? | ||
| **Assistant**: 333 | ||
| ``` | ||
|
|
||
| Usually, this could be graded by matching the number to the ground truth 333 without needing an LLM. However, for pedagogical purposes, we will grade the response using a language model with a rubric. That is, we will ask a language model "Does the assistant answer 333?" | ||
|
|
||
| ### Generate an example dataset | ||
|
|
||
| To run this, first generate a dataset: | ||
|
|
||
| ``` | ||
| python -m tinker_cookbook.recipes.rubric.generate_data | ||
| ``` | ||
|
|
||
| Then you will see two `jsonl` files generated, one for training, one for testing. For example, if you look into `tinker_cookbook/example_data/example_rubric_train.jsonl`, each datapoint consists of | ||
| - a convo (the conversation prefix that the policy sees) | ||
| - rubric_items: a list of rubric items that specify what is a good response, how the grader should format the response, and how the grading result should be extracted. | ||
|
|
||
| ``` | ||
| { | ||
| "convo": [ | ||
| { | ||
| "role": "user", | ||
| "content": "What is 4 + 5?" | ||
| }, | ||
| { | ||
| "role": "assistant", | ||
| "content": "9" | ||
| }, | ||
| { | ||
| "role": "user", | ||
| "content": "What is 122 + 12?" | ||
| } | ||
| ], | ||
| "rubric_items": [ | ||
| { | ||
| "rubric_str": "Does the chatbot correctly get the answer 134?", | ||
| "extraction_regex": "<score>(.*)</score>", | ||
| "grader_output_format_instruction": "Please output your score between 0 and 1 wrapped in <score> ... </score>" | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| ### Debugging and Printing What Happens During Rollouts | ||
|
|
||
| Run | ||
| ``` | ||
| python -m tinker_cookbook.recipes.rubric.debug_env | ||
| ``` | ||
|
|
||
| You can see the message that the policy sees, its response, the grader input, and the grader output. | ||
|
|
||
| <img width="1168" height="771" alt="Debug output showing the conversation context, policy response, grader prompt, and extracted score" src="https://github.com/user-attachments/assets/9f4e3c89-f21e-49b0-96d6-e2f27bd21b43" /> | ||
|
|
||
|
|
||
| ### An example training run | ||
|
|
||
| To train the LLM to add with a rubric-based LLM, run | ||
| ``` | ||
| python -m tinker_cookbook.recipes.rubric.train | ||
| ``` | ||
|
|
||
| You can see the reward quickly goes up. | ||
|
|
||
| <img width="705" height="279" alt="Training metrics showing reward increasing over training steps for the addition task" src="https://github.com/user-attachments/assets/2f825805-20a7-4cf3-8d06-55d5e9a98098" /> | ||
|
|
||
| ### A more realistic dataset | ||
|
|
||
| We take the `prometheus-eval/Feedback-Collection` dataset from [Hugging Face](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/), which contains rubrics to grade general chat responses. Run the following to kick off training: | ||
|
|
||
| ``` | ||
| python -m tinker_cookbook.recipes.rubric.prometheus_experimental | ||
| ``` | ||
|
|
||
| We can see that the reward climbs up steadily. | ||
|
|
||
| <img width="1086" height="514" alt="Training metrics showing reward climbing steadily over training steps for the Prometheus dataset" src="https://github.com/user-attachments/assets/8877ea6c-b9ea-46da-b995-046bbd3e7c80" /> | ||
|
|
||
| Note that this training recipe is experimental -- to make the performance better we may need to fine-tune the grader LLM as well. We hope our code serves as a starting point for you to improve rubric-based grading for training LLMs! | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,201 @@ | ||
| import json | ||
| import os | ||
| import re | ||
| from dataclasses import dataclass | ||
| from typing import Any, Sequence, TypeAlias | ||
|
|
||
| import chz | ||
|
|
||
| from tinker_cookbook.renderers import ( | ||
| Message, | ||
| Role, | ||
| ) | ||
|
|
||
| Conversation: TypeAlias = list[Message] | ||
|
|
||
|
|
||
| @dataclass | ||
| class Rubric: | ||
| """ | ||
| A rubric should specify 1) what counts as a good response, 2) how the grader language model should output the score, and 3) how to extract the score from the grader's response. | ||
| """ | ||
|
|
||
| rubric_str: str | ||
| extraction_regex: str = r"<score>(.*)</score>" | ||
| grader_output_format_instruction: str = ( | ||
| "Please output your score between 0 and 1 wrapped in <score> ... </score>" | ||
| ) | ||
|
|
||
| def _convert_role(self, role: Role) -> str: | ||
| return "Human" if role in ("user", "system") else "Chatbot" | ||
|
|
||
| def _flatten_convo(self, convo: Conversation) -> str: | ||
| """ | ||
| Convert the whole conversation (user's turns + assistant's turns) into a single string. E.g. | ||
| \n\nHuman: .... | ||
| \n\nChatbot: ... | ||
| \n\nHuman: ... | ||
| \n\nChatbot: ... | ||
| """ | ||
| return "\n\n".join( | ||
| [f"{self._convert_role(message['role'])}: {message['content']}" for message in convo] | ||
| ) | ||
|
|
||
| def get_grader_prompt(self, convo: Conversation) -> Conversation: | ||
| """ | ||
| Create a prompt for the grader to grade the conversation based on the rubric. | ||
| The prompt separates the context (prior turns) from the completion (last assistant message) | ||
| so the grader focuses on grading the most recent response. | ||
| """ | ||
| # Separate context from the completion to grade | ||
| context = convo[:-1] | ||
| completion = convo[-1] | ||
|
|
||
| lines = [ | ||
| "I will show you a conversation context, a chatbot completion to grade, and a rubric.", | ||
| "Please grade the chatbot's completion based on the rubric.", | ||
| "", | ||
| "<context>", | ||
| self._flatten_convo(context) if context else "(No prior context)", | ||
| "</context>", | ||
| "", | ||
| "<completion_to_grade>", | ||
| f"Chatbot: {completion['content']}", | ||
| "</completion_to_grade>", | ||
| "", | ||
| "<rubric>", | ||
| self.rubric_str, | ||
| "</rubric>", | ||
| "", | ||
| f"Please grade the chatbot's completion based on the rubric. {self.grader_output_format_instruction}", | ||
| ] | ||
| return [ | ||
| { | ||
| "role": "user", | ||
| "content": "\n".join(lines), | ||
| } | ||
| ] | ||
|
|
||
| def extract_score(self, response: str) -> float: | ||
| match = re.search(self.extraction_regex, response, re.DOTALL) | ||
| if match is not None: | ||
| try: | ||
| return float(match.group(1)) | ||
| except ValueError: | ||
| print(f"Warning: Failed to extract score from grader response: {response}") | ||
| return 0.0 | ||
| else: | ||
| print(f"Warning: Failed to extract score from grader response: {response}") | ||
| return 0.0 | ||
|
|
||
| def to_dict(self) -> dict[str, str]: | ||
| return { | ||
| "rubric_str": self.rubric_str, | ||
| "extraction_regex": self.extraction_regex, | ||
| "grader_output_format_instruction": self.grader_output_format_instruction, | ||
| } | ||
|
|
||
| def to_json(self) -> str: | ||
| return json.dumps(self.to_dict()) | ||
|
|
||
| @staticmethod | ||
| def from_dict(d: dict[str, str]) -> "Rubric": | ||
| return Rubric( | ||
| rubric_str=d["rubric_str"], | ||
| extraction_regex=d["extraction_regex"], | ||
| grader_output_format_instruction=d["grader_output_format_instruction"], | ||
| ) | ||
|
|
||
| @staticmethod | ||
| def from_json(json_str: str) -> "Rubric": | ||
| return Rubric.from_dict(json.loads(json_str)) | ||
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class RubricBasedDatapoint: | ||
| """ | ||
| A rubric-based datapoint contains a conversation and a rubric. | ||
| In this task, the policy model sees the conversation, create a response, and then the grader language model grades the response based on the rubric. | ||
| """ | ||
|
|
||
| convo: Conversation | ||
| rubric_items: Sequence[Rubric] | ||
|
|
||
| def to_json(self) -> str: | ||
| return json.dumps( | ||
| { | ||
| "convo": self.convo, | ||
| "rubric_items": [rubric.to_dict() for rubric in self.rubric_items], | ||
| } | ||
| ) | ||
|
|
||
| @staticmethod | ||
| def from_json(json_str: str) -> "RubricBasedDatapoint": | ||
| d = json.loads(json_str) | ||
| return RubricBasedDatapoint( | ||
| convo=d["convo"], | ||
| rubric_items=[Rubric.from_dict(rubric) for rubric in d["rubric_items"]], | ||
| ) | ||
|
|
||
|
|
||
| @chz.chz | ||
| class RubricDatapointListBuilder: | ||
| def __call__(self) -> Sequence[RubricBasedDatapoint]: | ||
| """Load and return a sequence of rubric-based datapoints.""" | ||
| raise NotImplementedError("Subclass must implement this method") | ||
|
|
||
|
|
||
| @chz.chz | ||
| class RubricDatapointListBuilderFromJsonl(RubricDatapointListBuilder): | ||
| jsonl_path: str | ||
|
|
||
| def __call__(self) -> Sequence[RubricBasedDatapoint]: | ||
| if not os.path.exists(self.jsonl_path): | ||
| raise FileNotFoundError( | ||
| f"Data file not found: {self.jsonl_path}\n" | ||
| f"Please generate the example data first by running:\n" | ||
| f" python -m tinker_cookbook.recipes.rubric.generate_data" | ||
| ) | ||
| datapoints = [] | ||
| with open(self.jsonl_path, "r") as f: | ||
| for line in f: | ||
| datapoints.append(RubricBasedDatapoint.from_json(line)) | ||
| return datapoints | ||
|
|
||
|
|
||
| @chz.chz | ||
| class PrometheusDatapointListBuilder(RubricDatapointListBuilder): | ||
| data_path: str = "prometheus-eval/Feedback-Collection" | ||
|
|
||
| def __call__(self) -> Sequence[RubricBasedDatapoint]: | ||
| from datasets import load_dataset | ||
|
|
||
| train_dataset = load_dataset(self.data_path)["train"] | ||
| return [self.build_rubric_datapoint(item) for item in train_dataset] # type: ignore | ||
|
|
||
| def build_rubric_datapoint(self, item: dict[str, Any]) -> RubricBasedDatapoint: | ||
| convo: Conversation = [ | ||
| {"role": "user", "content": item["orig_instruction"]}, | ||
| ] | ||
|
|
||
| rubric_lines = [ | ||
| f"Your job is to evaluate the following: {item['orig_criteria']}. Your response should be a score between 1 to 5.", | ||
| "Here is the calibration for each score:", | ||
| ] | ||
| for i in range(1, 6): | ||
| rubric_lines.append(f"<score>{i}.0</score>: {item[f'orig_score{i}_description']}") | ||
| rubric_lines.append( | ||
| f"Here is a reference response that achieved a score of 5: {item['orig_reference_answer']}" | ||
| ) | ||
| rubric_text = "\n".join(rubric_lines) | ||
|
|
||
| rubric = Rubric( | ||
| rubric_str=rubric_text, | ||
| extraction_regex=r"<score>(.*)</score>", | ||
| grader_output_format_instruction="Please output your score between 1 and 5 wrapped in <score> ... </score>", | ||
| ) | ||
|
|
||
| return RubricBasedDatapoint( | ||
| convo=convo, | ||
| rubric_items=[rubric], | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| from tinker_cookbook import model_info | ||
| from tinker_cookbook.recipes.rubric.env import RubricGradedEnv, RubricBasedDatapoint, Rubric | ||
| from tinker_cookbook.completers import TinkerMessageCompleter, TinkerTokenCompleter | ||
| from tinker_cookbook.renderers import get_renderer | ||
| from tinker_cookbook.tokenizer_utils import get_tokenizer | ||
| import tinker | ||
| from tinker_cookbook.rl.rollouts import do_single_rollout | ||
| import asyncio | ||
|
|
||
|
|
||
| def get_addition_datapoint() -> RubricBasedDatapoint: | ||
| datapoint = RubricBasedDatapoint( | ||
| convo=[ | ||
| {"role": "user", "content": "What is 4 + 5?"}, | ||
| {"role": "assistant", "content": "9"}, | ||
| {"role": "user", "content": "What is 125 + 311?"}, | ||
| ], | ||
| rubric_items=[ | ||
| Rubric(rubric_str="Does the chatbot correctly get the answer 436?"), | ||
| Rubric(rubric_str="Does the chatbot provide an answer without saying anything else?"), | ||
| ], | ||
| ) | ||
|
|
||
| return datapoint | ||
|
|
||
|
|
||
| def get_prometheus_datapoint() -> RubricBasedDatapoint: | ||
| from tinker_cookbook.recipes.rubric.data import PrometheusDatapointListBuilder | ||
|
|
||
| datapoint = PrometheusDatapointListBuilder()() | ||
| datapoint = datapoint[0] | ||
| return datapoint | ||
|
|
||
|
|
||
| async def main(datapoint: RubricBasedDatapoint): | ||
| # Configuration parameters | ||
| policy_name = "meta-llama/Llama-3.1-8B-Instruct" | ||
| grader_name = "Qwen/Qwen3-30B-A3B-Instruct-2507" | ||
| policy_max_tokens = 64 | ||
| grader_max_tokens = 64 | ||
|
|
||
| service_client = tinker.ServiceClient() | ||
| policy = TinkerTokenCompleter( | ||
| sampling_client=service_client.create_sampling_client(base_model=policy_name), | ||
| max_tokens=policy_max_tokens, | ||
| ) | ||
| policy_renderer = get_renderer( | ||
| model_info.get_recommended_renderer_name(policy_name), get_tokenizer(policy_name) | ||
| ) | ||
| grader = TinkerMessageCompleter( | ||
| sampling_client=service_client.create_sampling_client(base_model=grader_name), | ||
| renderer=get_renderer( | ||
| model_info.get_recommended_renderer_name(grader_name), get_tokenizer(grader_name) | ||
| ), | ||
| max_tokens=grader_max_tokens, | ||
| ) | ||
|
|
||
| env = RubricGradedEnv( | ||
| renderer=policy_renderer, | ||
| datapoint=datapoint, | ||
| grader_llm=grader, | ||
| debug=True, | ||
| ) | ||
|
|
||
| await do_single_rollout(policy, env) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| dataset = "addition" | ||
|
|
||
| if dataset == "addition": | ||
| datapoint = get_addition_datapoint() | ||
| asyncio.run(main(datapoint)) | ||
| elif dataset == "prometheus": | ||
| datapoint = get_prometheus_datapoint() | ||
| asyncio.run(main(datapoint)) | ||
| else: | ||
| raise ValueError(f"Unknown dataset: {dataset}") |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to show that the model is learning something meaningful