Skip to content
94 changes: 94 additions & 0 deletions tinker_cookbook/recipes/rubric/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Rubric-based Grading for LLMs

- [`data.py`](./data.py) contains the definition for the datapoint class. Each datapoint consists of a conversation prefix and a list of rubric items.
- [`generate_data.py`](./generate_data.py) generates some example datapoints if you want to run our demo on addition.
- [`env.py`](./env.py) determines what each rollout will do. It will let the policy read the prefix, generate a response, ask a grader LLM to grade based on a list of rubric items, and finally provide a reward by summing the response of each grader.
- [`train.py`](./train.py) allows you to train LLMs on any dataset saved in our format (specified in `data.py`). The default script will train on the addition task, whose data is generated by `generate_data.py`.
- [`prometheus_experimental.py`](./prometheus_experimental.py) contains a script to train the LLMs based on the rubrics from the [`prometheus-eval/Feedback-Collection`](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/viewer/default/train?row=0&views%5B%5D=train) dataset. It is experimental though -- even though the reward goes up, there is no guarantee that the model is actually better. We hope our script serves as a starting point, and more research is needed.


## A simple example of using a grader LLM with rubrics

We show how to use a rubric-based LLM to provide a reward for an addition task. E.g.

```
**User**: What's 233 + 100?
**Assistant**: 333
```

Usually, this could be graded by matching the number to the ground truth 333 without needing an LLM. However, for pedagogical purposes, we will grade the response using a language model with a rubric. That is, we will ask a language model "Does the assistant answer 333?"

### Generate an example dataset

To run this, first generate a dataset:

```
python -m tinker_cookbook.recipes.rubric.generate_data
```

Then you will see two `jsonl` files generated, one for training, one for testing. For example, if you look into `tinker_cookbook/example_data/example_rubric_train.jsonl`, each datapoint consists of
- a convo (the conversation prefix that the policy sees)
- rubric_items: a list of rubric items that specify what is a good response, how the grader should format the response, and how the grading result should be extracted.

```
{
"convo": [
{
"role": "user",
"content": "What is 4 + 5?"
},
{
"role": "assistant",
"content": "9"
},
{
"role": "user",
"content": "What is 122 + 12?"
}
],
"rubric_items": [
{
"rubric_str": "Does the chatbot correctly get the answer 134?",
"extraction_regex": "<score>(.*)</score>",
"grader_output_format_instruction": "Please output your score between 0 and 1 wrapped in <score> ... </score>"
}
]
}
```

### Debugging and Printing What Happens During Rollouts

Run
```
python -m tinker_cookbook.recipes.rubric.debug_env
```

You can see the message that the policy sees, its response, the grader input, and the grader output.

<img width="1168" height="771" alt="Debug output showing the conversation context, policy response, grader prompt, and extracted score" src="https://github.com/user-attachments/assets/9f4e3c89-f21e-49b0-96d6-e2f27bd21b43" />


### An example training run

To train the LLM to add with a rubric-based LLM, run
```
python -m tinker_cookbook.recipes.rubric.train
```

You can see the reward quickly goes up.

<img width="705" height="279" alt="Training metrics showing reward increasing over training steps for the addition task" src="https://github.com/user-attachments/assets/2f825805-20a7-4cf3-8d06-55d5e9a98098" />

### A more realistic dataset

We take the `prometheus-eval/Feedback-Collection` dataset from [Hugging Face](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/), which contains rubrics to grade general chat responses. Run the following to kick off training:

```
python -m tinker_cookbook.recipes.rubric.prometheus_experimental
```

We can see that the reward climbs up steadily.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to show that the model is learning something meaningful


<img width="1086" height="514" alt="Training metrics showing reward climbing steadily over training steps for the Prometheus dataset" src="https://github.com/user-attachments/assets/8877ea6c-b9ea-46da-b995-046bbd3e7c80" />

Note that this training recipe is experimental -- to make the performance better we may need to fine-tune the grader LLM as well. We hope our code serves as a starting point for you to improve rubric-based grading for training LLMs!
201 changes: 201 additions & 0 deletions tinker_cookbook/recipes/rubric/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
import json
import os
import re
from dataclasses import dataclass
from typing import Any, Sequence, TypeAlias

import chz

from tinker_cookbook.renderers import (
Message,
Role,
)

Conversation: TypeAlias = list[Message]


@dataclass
class Rubric:
"""
A rubric should specify 1) what counts as a good response, 2) how the grader language model should output the score, and 3) how to extract the score from the grader's response.
"""

rubric_str: str
extraction_regex: str = r"<score>(.*)</score>"
grader_output_format_instruction: str = (
"Please output your score between 0 and 1 wrapped in <score> ... </score>"
)

def _convert_role(self, role: Role) -> str:
return "Human" if role in ("user", "system") else "Chatbot"

def _flatten_convo(self, convo: Conversation) -> str:
"""
Convert the whole conversation (user's turns + assistant's turns) into a single string. E.g.
\n\nHuman: ....
\n\nChatbot: ...
\n\nHuman: ...
\n\nChatbot: ...
"""
return "\n\n".join(
[f"{self._convert_role(message['role'])}: {message['content']}" for message in convo]
)

def get_grader_prompt(self, convo: Conversation) -> Conversation:
"""
Create a prompt for the grader to grade the conversation based on the rubric.
The prompt separates the context (prior turns) from the completion (last assistant message)
so the grader focuses on grading the most recent response.
"""
# Separate context from the completion to grade
context = convo[:-1]
completion = convo[-1]

lines = [
"I will show you a conversation context, a chatbot completion to grade, and a rubric.",
"Please grade the chatbot's completion based on the rubric.",
"",
"<context>",
self._flatten_convo(context) if context else "(No prior context)",
"</context>",
"",
"<completion_to_grade>",
f"Chatbot: {completion['content']}",
"</completion_to_grade>",
"",
"<rubric>",
self.rubric_str,
"</rubric>",
"",
f"Please grade the chatbot's completion based on the rubric. {self.grader_output_format_instruction}",
]
return [
{
"role": "user",
"content": "\n".join(lines),
}
]

def extract_score(self, response: str) -> float:
match = re.search(self.extraction_regex, response, re.DOTALL)
if match is not None:
try:
return float(match.group(1))
except ValueError:
print(f"Warning: Failed to extract score from grader response: {response}")
return 0.0
else:
print(f"Warning: Failed to extract score from grader response: {response}")
return 0.0

def to_dict(self) -> dict[str, str]:
return {
"rubric_str": self.rubric_str,
"extraction_regex": self.extraction_regex,
"grader_output_format_instruction": self.grader_output_format_instruction,
}

def to_json(self) -> str:
return json.dumps(self.to_dict())

@staticmethod
def from_dict(d: dict[str, str]) -> "Rubric":
return Rubric(
rubric_str=d["rubric_str"],
extraction_regex=d["extraction_regex"],
grader_output_format_instruction=d["grader_output_format_instruction"],
)

@staticmethod
def from_json(json_str: str) -> "Rubric":
return Rubric.from_dict(json.loads(json_str))


@dataclass(frozen=True)
class RubricBasedDatapoint:
"""
A rubric-based datapoint contains a conversation and a rubric.
In this task, the policy model sees the conversation, create a response, and then the grader language model grades the response based on the rubric.
"""

convo: Conversation
rubric_items: Sequence[Rubric]

def to_json(self) -> str:
return json.dumps(
{
"convo": self.convo,
"rubric_items": [rubric.to_dict() for rubric in self.rubric_items],
}
)

@staticmethod
def from_json(json_str: str) -> "RubricBasedDatapoint":
d = json.loads(json_str)
return RubricBasedDatapoint(
convo=d["convo"],
rubric_items=[Rubric.from_dict(rubric) for rubric in d["rubric_items"]],
)


@chz.chz
class RubricDatapointListBuilder:
def __call__(self) -> Sequence[RubricBasedDatapoint]:
"""Load and return a sequence of rubric-based datapoints."""
raise NotImplementedError("Subclass must implement this method")


@chz.chz
class RubricDatapointListBuilderFromJsonl(RubricDatapointListBuilder):
jsonl_path: str

def __call__(self) -> Sequence[RubricBasedDatapoint]:
if not os.path.exists(self.jsonl_path):
raise FileNotFoundError(
f"Data file not found: {self.jsonl_path}\n"
f"Please generate the example data first by running:\n"
f" python -m tinker_cookbook.recipes.rubric.generate_data"
)
datapoints = []
with open(self.jsonl_path, "r") as f:
for line in f:
datapoints.append(RubricBasedDatapoint.from_json(line))
return datapoints


@chz.chz
class PrometheusDatapointListBuilder(RubricDatapointListBuilder):
data_path: str = "prometheus-eval/Feedback-Collection"

def __call__(self) -> Sequence[RubricBasedDatapoint]:
from datasets import load_dataset

train_dataset = load_dataset(self.data_path)["train"]
return [self.build_rubric_datapoint(item) for item in train_dataset] # type: ignore

def build_rubric_datapoint(self, item: dict[str, Any]) -> RubricBasedDatapoint:
convo: Conversation = [
{"role": "user", "content": item["orig_instruction"]},
]

rubric_lines = [
f"Your job is to evaluate the following: {item['orig_criteria']}. Your response should be a score between 1 to 5.",
"Here is the calibration for each score:",
]
for i in range(1, 6):
rubric_lines.append(f"<score>{i}.0</score>: {item[f'orig_score{i}_description']}")
rubric_lines.append(
f"Here is a reference response that achieved a score of 5: {item['orig_reference_answer']}"
)
rubric_text = "\n".join(rubric_lines)

rubric = Rubric(
rubric_str=rubric_text,
extraction_regex=r"<score>(.*)</score>",
grader_output_format_instruction="Please output your score between 1 and 5 wrapped in <score> ... </score>",
)

return RubricBasedDatapoint(
convo=convo,
rubric_items=[rubric],
)
78 changes: 78 additions & 0 deletions tinker_cookbook/recipes/rubric/debug_env.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
from tinker_cookbook import model_info
from tinker_cookbook.recipes.rubric.env import RubricGradedEnv, RubricBasedDatapoint, Rubric
from tinker_cookbook.completers import TinkerMessageCompleter, TinkerTokenCompleter
from tinker_cookbook.renderers import get_renderer
from tinker_cookbook.tokenizer_utils import get_tokenizer
import tinker
from tinker_cookbook.rl.rollouts import do_single_rollout
import asyncio


def get_addition_datapoint() -> RubricBasedDatapoint:
datapoint = RubricBasedDatapoint(
convo=[
{"role": "user", "content": "What is 4 + 5?"},
{"role": "assistant", "content": "9"},
{"role": "user", "content": "What is 125 + 311?"},
],
rubric_items=[
Rubric(rubric_str="Does the chatbot correctly get the answer 436?"),
Rubric(rubric_str="Does the chatbot provide an answer without saying anything else?"),
],
)

return datapoint


def get_prometheus_datapoint() -> RubricBasedDatapoint:
from tinker_cookbook.recipes.rubric.data import PrometheusDatapointListBuilder

datapoint = PrometheusDatapointListBuilder()()
datapoint = datapoint[0]
return datapoint


async def main(datapoint: RubricBasedDatapoint):
# Configuration parameters
policy_name = "meta-llama/Llama-3.1-8B-Instruct"
grader_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
policy_max_tokens = 64
grader_max_tokens = 64

service_client = tinker.ServiceClient()
policy = TinkerTokenCompleter(
sampling_client=service_client.create_sampling_client(base_model=policy_name),
max_tokens=policy_max_tokens,
)
policy_renderer = get_renderer(
model_info.get_recommended_renderer_name(policy_name), get_tokenizer(policy_name)
)
grader = TinkerMessageCompleter(
sampling_client=service_client.create_sampling_client(base_model=grader_name),
renderer=get_renderer(
model_info.get_recommended_renderer_name(grader_name), get_tokenizer(grader_name)
),
max_tokens=grader_max_tokens,
)

env = RubricGradedEnv(
renderer=policy_renderer,
datapoint=datapoint,
grader_llm=grader,
debug=True,
)

await do_single_rollout(policy, env)


if __name__ == "__main__":
dataset = "addition"

if dataset == "addition":
datapoint = get_addition_datapoint()
asyncio.run(main(datapoint))
elif dataset == "prometheus":
datapoint = get_prometheus_datapoint()
asyncio.run(main(datapoint))
else:
raise ValueError(f"Unknown dataset: {dataset}")
Loading