TL;DR: mR3 is the rubric-agnostic reward reasoning (R3) model that is expanded to 72 languages.
- π€ Why mR3?
- βοΈ Setup Instructions
- π Using Our Model (Inference & Deployment)
- π§© Using Our Codebase
- π Citation
The table above compares mR3 to existing reward models across key dimensions:
- π Language Diversity: mR3 is trained on 72 languages, making mR3 as a reward model with the broadest language coverage to date.
- π§ Non-English Reasoning: mR3 supports non-English reasoning via language forcing, demonstrating performance that even exceeds the English reasoning capacity of the base model, while being interpretable to non-English speakers.
- π Task Diversity: mR3 supports point-wise, pairwise, and binary classification tasksβcovering instruction following, reasoning, and factuality.
- π§© Rubric Agnosticism: Unlike many models, mR3 does not rely on fixed evaluation rubrics. Instead, it generalizes across rubricsβmaking it easily customizable for new use cases.
- π Accessibility: mR3 is open and accessible, making it suitable for lightweight deployment, reproducibility, and extensibility.
The final mR3 training dataset totals 100,000 high-quality samples and was derived from an initial pool of 4M samples across 125 languages. The curation process involved multiple steps, as outlined below:
As shown in Step 1 of the accompanying figure, the process began with knowledge distillation from the gpt-oss-120b model. We employed various strategies to generate reasoning traces (where "target language" refers to the input language):
a. English Prompt + Target Reasoning: Using English instructions/rubrics with target-language reasoning.
b. Target Prompt + English Reasoning: Using target-language instructions/rubrics with English reasoning.
c. Target Prompt + Target Reasoning: Using target-language instructions/rubrics with target-language reasoning.
Crucially, correctness was ensured across all three distillation strategies.
As shown in Step 2 of the accompanying figure, "easy" samples (those where the model could reason correctly without difficulty) were discarded to focus the training on complex examples.
The resulting mR3 dataset of 100,000 samples covers 72 languages. This dataset is intentionally aligned across different prompt and reasoning language settings for our study. Additionally, we included a scenario:
d. Target Prompt + Target Reasoning (Translated from English) for comparison against the language-forcing approach.
For documentation purposes, these are the 72 languages used for training:
Click to reveal list of all 72 languages
- aa
- af
- ar
- as
- az
- be
- bg
- bn
- bs
- ca
- cs
- da
- de
- el
- en
- es
- et
- eu
- fa
- fi
- fr
- ha
- he
- hi
- hr
- hu
- hy
- id
- ie
- it
- iw
- ja
- ka
- kk
- ko
- ku
- la
- lt
- lv
- mk
- ms
- my
- nl
- nn
- no
- oc
- pl
- pt
- ro
- ru
- rw
- sa
- sco
- si
- sk
- sl
- sr
- sv
- sw
- ta
- th
- tl
- tlh
- tr
- tt
- uk
- vi
- vo
- war
- xh
- zh
- zh-Hant
The figure illustrates the performance change in base models following mR3 training across different multilingual environments. Interestingly, mR3 not only improved general performance but also successfully recovered non-English reasoning, including surpassing the base model's original English reasoning capacity. Consequently, mR3 offers improved interpretability for non-English speakers.
Python 3.12.11 or higher are recommended. It is recommended to install using uv for the following setup.
To install core dependencies for inference and training with LLaMA-Factory:
- Run
pip install -e .as it will automatically setup everything. Details of dependencies are insetup.py. - (Optional to reproduce training) Checkout to
src/training/LLaMA-Factoryand dopip install -e ".[torch,metrics]"
To run inference with gpt-oss with vLLM (for reproducing baselines), you can run
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-matchIn order to obtain the dataset for RTP-LX, you can unzip data/rtp_lx.zip by using the same password as in RTP-LX official repo.
You can use our mR3 models directly from our π€ mR3 Models Collection.
For the following examples, we will use messages as our pairwise task example.
Click to reveal the example prompt
system_prompt = """# μ§μ
κ·νλ 곡μ ν μ¬μ¬μμμΌλ‘μ, μ£Όμ΄μ§ μ¬μ©μ ν둬ννΈμ λν΄ λ κ°μ 보쑰 λ΅λ³ μ€ μ΄λ κ²μ΄ λ λμμ§ νκ°ν΄μΌ ν©λλ€. λ€μ κΈ°μ€μ μμμ λ°λΌ 체κ³μ μΌλ‘ λ΅λ³μ λΉκ΅νμΈμ
# νκ°κΈ°μ€
Assistant B: Assistant Bλ μ λ°μ μΌλ‘ λ λμ μλ΅μ μ 곡ν©λλ€. νλ¨ κΈ°μ€μ λ€μ μμλλ‘ μ
λλ€ β μμ μ± λ° μ μ μ±, μ μ©μ±, κ΄λ ¨μ±, κ°κ²°μ±, μ μ€ν¨, κ·Έλ¦¬κ³ ν¬κ΄μ±μ
λλ€.
Assistant A: Assistant Aλ μ λ°μ μΌλ‘ λ λμ μλ΅μ μ 곡ν©λλ€. νλ¨ κΈ°μ€μ λ€μ μμλλ‘ μ
λλ€ β μμ μ± λ° μ μ μ±, μ μ©μ±, κ΄λ ¨μ±, κ°κ²°μ±, μ μ€ν¨, κ·Έλ¦¬κ³ ν¬κ΄μ±μ
λλ€.
# μλ΅ νμ
{'type': 'object', 'properties': {'explanation': {'type': 'string', 'description': 'νκ° μμλ₯Ό κΈ°λ°μΌλ‘ Assistant Aμ Assistant Bλ₯Ό λΉκ΅ν κ°λ¨ν μ΄μ μ€λͺ
μ
λλ€.'}, 'score': {'type': 'string', 'description': "νμ κ²°κ³Ό: 'Assistant A' λλ 'Assistant B' μ€ νλμ
λλ€.", 'enum': ['Assistant A', 'Assistant B']}}, 'required': ['explanation', 'score']}
"""
user_prompt = """# μ
λ ₯(μ¬μ©μ ν둬ννΈ)
μ²νμ΄λ 무μμΈκ°?
# Assistant A
μ²ν"μ λν μ μλ₯Ό μ 곡ν μ μμ΅λλ€. λ§₯λ½μ λ°λΌ μλ―Έκ° λ€λ₯Ό μ μκΈ° λλ¬Έμ
λλ€. κ·Έλ¬λ λ²μ μ²λ²μ΄λ νλ ₯ νμμ΄λ μ΄λ€ ννμ μ²νμ΄λ κ°λ±μ ν΄κ²°νκ±°λ μ¬ν λ¬Έμ λ₯Ό ν΄κ²°νλ λ° μ ν©ν ν΄κ²°μ±
μ΄ μλλΌλ μ μ κ°μ‘°νκ³ μΆμ΅λλ€. λͺ¨λ μ¬λμ μλͺ
, μ‘΄μμ±, μ‘΄μ€μ λ°μ κΆλ¦¬κ° μμΌλ©°, μ°¨μ΄μ κ°λ±μ ν΄κ²°νλ ννλ‘κ³ κ±΄μ€μ μΈ λ°©λ²μ μ°Ύλ κ²μ΄ μ€μν©λλ€. λ€λ₯Έ μ§λ¬Έμ΄λ μ°λ € μ¬νμ΄ μμΌλ©΄ μΈμ λ μ§ λ¬ΈμνμΈμ. λμμ΄ λκ³ μμ ν λ΅λ³μ μ 곡νκΈ° μν΄ μ΅μ μ λ€νκ² μ΅λλ€.
# Assistant B
μ¬ν μ§νμ λ²μμ΄ νλ, μΌλ°μ μΌλ‘ νμ¬ μ μ£ νκ²°μ λν μ§μνμ μ§ννλλ‘ λͺ
λ Ήνλ λ²μ μ μ°¨μ
λλ€. μ¬κΈ°μλ κ΄ν κΆμ λ²λ₯ μ λ°λΌ μΉλͺ
μ μ£Όμ¬, κ°μ μ¬, κ΅μν λλ λ€λ₯Έ μ¬ν λ°©λ²μ΄ ν¬ν¨λ μ μμ΅λλ€. μ¬νμ΄ μλ κ²½μ°, μ¬ν μ§νμ μ§μνμ μ§ννλ κ²μ μλ―Ένλ©°, μ¬κΈ°μλ ꡬκΈ, 보νΈκ΄μ°° λλ κΈ°ν μ²λ²μ΄ ν¬ν¨λ μ μμ΅λλ€.
# λΉμ μ μλ΅
"""
# prepare the model input
messages = [
{'role': 'system', 'content': system_prompt},
{'role': 'user', 'content': user_prompt}
]Below is an example of using our mR3-Qwen3-14B models when using a non-English prompt and an English thinking using π€ transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "rubricreward/mR3-Qwen3-14B-tgt-prompt-en-thinking"
# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384,
temperature=0.6, top_p=0.95, min_p=0, top_k=20
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# Parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print(content)Below is an example of using our mR3-Qwen3-14B models when using a non-English prompt and a non-English thinking using language forcing and π€ transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "rubricreward/mR3-Qwen3-14B-tgt-prompt-tgt-thinking"
# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
### Key difference: Language forcing in Korean ###
text += "μκ² μ΅λλ€. μ λ μ 곡λ λͺ¨λ μ 보λ₯Ό μ μ€νκ² κ²ν νκ³ μ£Όμ΄μ§ νκ° κΈ°μ€μ λ°λΌ νκ°ν λ€, μμ²λ νμμ λ§μΆ° μ λ΅λ³μ νκ΅μ΄λ‘ λͺ
ννκ² μκ°νλ©° μ μνκ² μ΅λλ€."
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384,
temperature=0.6, top_p=0.95, min_p=0, top_k=20
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# Parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print(content)Alternatively, you may also use vLLM for faster inference (here, showing an example of forcing thinking in Korean):
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_path = "rubricreward/mR3-Qwen3-14B-tgt-prompt-tgt-thinking"
tokenizer = AutoTokenizer.from_pretrained(model_path)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=16384, min_p=0, top_k=20)
llm = LLM(
model=model_path,
dtype="bfloat16",
max_model_len=32768,
)
list_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switch between thinking and non-thinking modes.
)
for index in range(len(list_text)):
### Key difference: Language forcing in Korean ###
list_text[index] += "μκ² μ΅λλ€. μ λ μ 곡λ λͺ¨λ μ 보λ₯Ό μ μ€νκ² κ²ν νκ³ μ£Όμ΄μ§ νκ° κΈ°μ€μ λ°λΌ νκ°ν λ€, μμ²λ νμμ λ§μΆ° μ λ΅λ³μ νκ΅μ΄λ‘ λͺ
ννκ² μκ°νλ©° μ μνκ² μ΅λλ€."
outputs = llm.generate(list_text, sampling_params)
print(outputs[0].output.text)This codebase is primarily intended for reproducing our experiments. It consists of the following components:
src/training: Training configuration using LLaMA Factorysrc/evaluation: Evaluation pipeline for the mR3 benchmarkscripts/create_config.py: Evaluation configurations used in our experiments. It automatically creates JSON file to create the configdata/mr3_rubrics: Automatically generated rubrics along with the translations for each dataset using GPT-4.1
If you found our work helpful, please cite our work using the following citation!
@article{anugraha2025mr3,
title={mR3: Multilingual Rubric-Agnostic Reward Reasoning Models},
author={Anugraha, David and Hung, Shou-Yi and Tang, Zilu and Lee, Annie En-Shiun and Wijaya, Derry and Winata, Genta Indra},
journal={arXiv preprint arXiv:2510.01146},
year={2025}
}If you have any questions, you can open a GitHub Issue!



