thinking-machines-lab · joschu · Dec 20, 2025 · Dec 13, 2025 · Dec 13, 2025 · Dec 13, 2025
diff --git a/tinker_cookbook/recipes/rubric/README.md b/tinker_cookbook/recipes/rubric/README.md
@@ -0,0 +1,94 @@
+# Rubric-based Grading for LLMs
+
+- [`data.py`](./data.py) contains the definition for the datapoint class. Each datapoint consists of a conversation prefix and a list of rubric items.
+- [`generate_data.py`](./generate_data.py) generates some example datapoints if you want to run our demo on addition.
+- [`env.py`](./env.py) determines what each rollout will do. It will let the policy read the prefix, generate a response, ask a grader LLM to grade based on a list of rubric items, and finally provide a reward by summing the response of each grader.
+- [`train.py`](./train.py) allows you to train LLMs on any dataset saved in our format (specified in `data.py`). The default script will train on the addition task, whose data is generated by `generate_data.py`.
+- [`prometheus_experimental.py`](./prometheus_experimental.py) contains a script to train the LLMs based on the rubrics from the [`prometheus-eval/Feedback-Collection`](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/viewer/default/train?row=0&views%5B%5D=train) dataset. It is experimental though -- even though the reward goes up, there is no guarantee that the model is actually better. We hope our script serves as a starting point, and more research is needed.
+
+
+## A simple example of using a grader LLM with rubrics
+
+We show how to use a rubric-based LLM to provide a reward for an addition task. E.g.
+
+```
+**User**: What's 233 + 100?
+**Assistant**: 333
+```
+
+Usually, this could be graded by matching the number to the ground truth 333 without needing an LLM. However, for pedagogical purposes, we will grade the response using a language model with a rubric. That is, we will ask a language model "Does the assistant answer 333?"
+
+### Generate an example dataset
+
+To run this, first generate a dataset:
+
+```
+python -m tinker_cookbook.recipes.rubric.generate_data
+```
+
+Then you will see two `jsonl` files generated, one for training, one for testing. For example, if you look into `tinker_cookbook/example_data/example_rubric_train.jsonl`, each datapoint consists of
+- a convo (the conversation prefix that the policy sees)
+- rubric_items: a list of rubric items that specify what is a good response, how the grader should format the response, and how the grading result should be extracted.
+
+```
+{
+  "convo": [
+    {
+      "role": "user",
+      "content": "What is 4 + 5?"
+    },
+    {
+      "role": "assistant",
+      "content": "9"
+    },
+    {
+      "role": "user",
+      "content": "What is 122 + 12?"
+    }
+  ],
+  "rubric_items": [
+    {
+      "rubric_str": "Does the chatbot correctly get the answer 134?",
+      "extraction_regex": "<score>(.*)</score>",
+      "grader_output_format_instruction": "Please output your score between 0 and 1 wrapped in <score> ... </score>"
+    }
+  ]
+}
+```
+
+### Debugging and Printing What Happens During Rollouts
+
+Run
+```
+python -m tinker_cookbook.recipes.rubric.debug_env
+```
+
+You can see the message that the policy sees, its response, the grader input, and the grader output.
+
+<img width="1168" height="771" alt="Debug output showing the conversation context, policy response, grader prompt, and extracted score" src="https://github.com/user-attachments/assets/9f4e3c89-f21e-49b0-96d6-e2f27bd21b43" />
+
+
+### An example training run
+
+To train the LLM to add with a rubric-based LLM, run
+```
+python -m tinker_cookbook.recipes.rubric.train
+```
+
+You can see the reward quickly goes up.
+
+<img width="705" height="279" alt="Training metrics showing reward increasing over training steps for the addition task" src="https://github.com/user-attachments/assets/2f825805-20a7-4cf3-8d06-55d5e9a98098" />
+
+### A more realistic dataset
+
+We take the `prometheus-eval/Feedback-Collection` dataset from [Hugging Face](https://huggingface.co/datasets/prometheus-eval/Feedback-Collection/), which contains rubrics to grade general chat responses. Run the following to kick off training:
+
+```
+python -m tinker_cookbook.recipes.rubric.prometheus_experimental
+```
+
+We can see that the reward climbs up steadily.
+
+<img width="1086" height="514" alt="Training metrics showing reward climbing steadily over training steps for the Prometheus dataset" src="https://github.com/user-attachments/assets/8877ea6c-b9ea-46da-b995-046bbd3e7c80" />
+
+Note that this training recipe is experimental -- to make the performance better we may need to fine-tune the grader LLM as well. We hope our code serves as a starting point for you to improve rubric-based grading for training LLMs!
diff --git a/tinker_cookbook/recipes/rubric/data.py b/tinker_cookbook/recipes/rubric/data.py
@@ -0,0 +1,201 @@
+import json
+import os
+import re
+from dataclasses import dataclass
+from typing import Any, Sequence, TypeAlias
+
+import chz
+
+from tinker_cookbook.renderers import (
+    Message,
+    Role,
+)
+
+Conversation: TypeAlias = list[Message]
+
+
+@dataclass
+class Rubric:
+    """
+    A rubric should specify 1) what counts as a good response, 2) how the grader language model should output the score, and 3) how to extract the score from the grader's response.
+    """
+
+    rubric_str: str
+    extraction_regex: str = r"<score>(.*)</score>"
+    grader_output_format_instruction: str = (
+        "Please output your score between 0 and 1 wrapped in <score> ... </score>"
+    )
+
+    def _convert_role(self, role: Role) -> str:
+        return "Human" if role in ("user", "system") else "Chatbot"
+
+    def _flatten_convo(self, convo: Conversation) -> str:
+        """
+        Convert the whole conversation (user's turns + assistant's turns) into a single string. E.g.
+        \n\nHuman: ....
+        \n\nChatbot: ...
+        \n\nHuman: ...
+        \n\nChatbot: ...
+        """
+        return "\n\n".join(
+            [f"{self._convert_role(message['role'])}: {message['content']}" for message in convo]
+        )
+
+    def get_grader_prompt(self, convo: Conversation) -> Conversation:
+        """
+        Create a prompt for the grader to grade the conversation based on the rubric.
+        The prompt separates the context (prior turns) from the completion (last assistant message)
+        so the grader focuses on grading the most recent response.
+        """
+        # Separate context from the completion to grade
+        context = convo[:-1]
+        completion = convo[-1]
+
+        lines = [
+            "I will show you a conversation context, a chatbot completion to grade, and a rubric.",
+            "Please grade the chatbot's completion based on the rubric.",
+            "",
+            "<context>",
+            self._flatten_convo(context) if context else "(No prior context)",
+            "</context>",
+            "",
+            "<completion_to_grade>",
+            f"Chatbot: {completion['content']}",
+            "</completion_to_grade>",
+            "",
+            "<rubric>",
+            self.rubric_str,
+            "</rubric>",
+            "",
+            f"Please grade the chatbot's completion based on the rubric. {self.grader_output_format_instruction}",
+        ]
+        return [
+            {
+                "role": "user",
+                "content": "\n".join(lines),
+            }
+        ]
+
+    def extract_score(self, response: str) -> float:
+        match = re.search(self.extraction_regex, response, re.DOTALL)
+        if match is not None:
+            try:
+                return float(match.group(1))
+            except ValueError:
+                print(f"Warning: Failed to extract score from grader response: {response}")
+                return 0.0
+        else:
+            print(f"Warning: Failed to extract score from grader response: {response}")
+            return 0.0
+
+    def to_dict(self) -> dict[str, str]:
+        return {
+            "rubric_str": self.rubric_str,
+            "extraction_regex": self.extraction_regex,
+            "grader_output_format_instruction": self.grader_output_format_instruction,
+        }
+
+    def to_json(self) -> str:
+        return json.dumps(self.to_dict())
+
+    @staticmethod
+    def from_dict(d: dict[str, str]) -> "Rubric":
+        return Rubric(
+            rubric_str=d["rubric_str"],
+            extraction_regex=d["extraction_regex"],
+            grader_output_format_instruction=d["grader_output_format_instruction"],
+        )
+
+    @staticmethod
+    def from_json(json_str: str) -> "Rubric":
+        return Rubric.from_dict(json.loads(json_str))
+
+
+@dataclass(frozen=True)
+class RubricBasedDatapoint:
+    """
+    A rubric-based datapoint contains a conversation and a rubric.
+    In this task, the policy model sees the conversation, create a response, and then the grader language model grades the response based on the rubric.
+    """
+
+    convo: Conversation
+    rubric_items: Sequence[Rubric]
+
+    def to_json(self) -> str:
+        return json.dumps(
+            {
+                "convo": self.convo,
+                "rubric_items": [rubric.to_dict() for rubric in self.rubric_items],
+            }
+        )
+
+    @staticmethod
+    def from_json(json_str: str) -> "RubricBasedDatapoint":
+        d = json.loads(json_str)
+        return RubricBasedDatapoint(
+            convo=d["convo"],
+            rubric_items=[Rubric.from_dict(rubric) for rubric in d["rubric_items"]],
+        )
+
+
+@chz.chz
+class RubricDatapointListBuilder:
+    def __call__(self) -> Sequence[RubricBasedDatapoint]:
+        """Load and return a sequence of rubric-based datapoints."""
+        raise NotImplementedError("Subclass must implement this method")
+
+
+@chz.chz
+class RubricDatapointListBuilderFromJsonl(RubricDatapointListBuilder):
+    jsonl_path: str
+
+    def __call__(self) -> Sequence[RubricBasedDatapoint]:
+        if not os.path.exists(self.jsonl_path):
+            raise FileNotFoundError(
+                f"Data file not found: {self.jsonl_path}\n"
+                f"Please generate the example data first by running:\n"
+                f"  python -m tinker_cookbook.recipes.rubric.generate_data"
+            )
+        datapoints = []
+        with open(self.jsonl_path, "r") as f:
+            for line in f:
+                datapoints.append(RubricBasedDatapoint.from_json(line))
+        return datapoints
+
+
+@chz.chz
+class PrometheusDatapointListBuilder(RubricDatapointListBuilder):
+    data_path: str = "prometheus-eval/Feedback-Collection"
+
+    def __call__(self) -> Sequence[RubricBasedDatapoint]:
+        from datasets import load_dataset
+
+        train_dataset = load_dataset(self.data_path)["train"]
+        return [self.build_rubric_datapoint(item) for item in train_dataset]  # type: ignore
+
+    def build_rubric_datapoint(self, item: dict[str, Any]) -> RubricBasedDatapoint:
+        convo: Conversation = [
+            {"role": "user", "content": item["orig_instruction"]},
+        ]
+
+        rubric_lines = [
+            f"Your job is to evaluate the following: {item['orig_criteria']}. Your response should be a score between 1 to 5.",
+            "Here is the calibration for each score:",
+        ]
+        for i in range(1, 6):
+            rubric_lines.append(f"<score>{i}.0</score>: {item[f'orig_score{i}_description']}")
+        rubric_lines.append(
+            f"Here is a reference response that achieved a score of 5: {item['orig_reference_answer']}"
+        )
+        rubric_text = "\n".join(rubric_lines)
+
+        rubric = Rubric(
+            rubric_str=rubric_text,
+            extraction_regex=r"<score>(.*)</score>",
+            grader_output_format_instruction="Please output your score between 1 and 5 wrapped in <score> ... </score>",
+        )
+
+        return RubricBasedDatapoint(
+            convo=convo,
+            rubric_items=[rubric],
+        )
diff --git a/tinker_cookbook/recipes/rubric/debug_env.py b/tinker_cookbook/recipes/rubric/debug_env.py
@@ -0,0 +1,78 @@
+from tinker_cookbook import model_info
+from tinker_cookbook.recipes.rubric.env import RubricGradedEnv, RubricBasedDatapoint, Rubric
+from tinker_cookbook.completers import TinkerMessageCompleter, TinkerTokenCompleter
+from tinker_cookbook.renderers import get_renderer
+from tinker_cookbook.tokenizer_utils import get_tokenizer
+import tinker
+from tinker_cookbook.rl.rollouts import do_single_rollout
+import asyncio
+
+
+def get_addition_datapoint() -> RubricBasedDatapoint:
+    datapoint = RubricBasedDatapoint(
+        convo=[
+            {"role": "user", "content": "What is 4 + 5?"},
+            {"role": "assistant", "content": "9"},
+            {"role": "user", "content": "What is 125 + 311?"},
+        ],
+        rubric_items=[
+            Rubric(rubric_str="Does the chatbot correctly get the answer 436?"),
+            Rubric(rubric_str="Does the chatbot provide an answer without saying anything else?"),
+        ],
+    )
+
+    return datapoint
+
+
+def get_prometheus_datapoint() -> RubricBasedDatapoint:
+    from tinker_cookbook.recipes.rubric.data import PrometheusDatapointListBuilder
+
+    datapoint = PrometheusDatapointListBuilder()()
+    datapoint = datapoint[0]
+    return datapoint
+
+
+async def main(datapoint: RubricBasedDatapoint):
+    # Configuration parameters
+    policy_name = "meta-llama/Llama-3.1-8B-Instruct"
+    grader_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
+    policy_max_tokens = 64
+    grader_max_tokens = 64
+
+    service_client = tinker.ServiceClient()
+    policy = TinkerTokenCompleter(
+        sampling_client=service_client.create_sampling_client(base_model=policy_name),
+        max_tokens=policy_max_tokens,
+    )
+    policy_renderer = get_renderer(
+        model_info.get_recommended_renderer_name(policy_name), get_tokenizer(policy_name)
+    )
+    grader = TinkerMessageCompleter(
+        sampling_client=service_client.create_sampling_client(base_model=grader_name),
+        renderer=get_renderer(
+            model_info.get_recommended_renderer_name(grader_name), get_tokenizer(grader_name)
+        ),
+        max_tokens=grader_max_tokens,
+    )
+
+    env = RubricGradedEnv(
+        renderer=policy_renderer,
+        datapoint=datapoint,
+        grader_llm=grader,
+        debug=True,
+    )
+
+    await do_single_rollout(policy, env)
+
+
+if __name__ == "__main__":
+    dataset = "addition"
+
+    if dataset == "addition":
+        datapoint = get_addition_datapoint()
+        asyncio.run(main(datapoint))
+    elif dataset == "prometheus":
+        datapoint = get_prometheus_datapoint()
+        asyncio.run(main(datapoint))
+    else:
+        raise ValueError(f"Unknown dataset: {dataset}")