From 3c332dab58d187fb3a642ea3618c6bbf24c4fea0 Mon Sep 17 00:00:00 2001
From: sky-bro <sky_io@outlook.com>
Date: Wed, 3 Dec 2025 01:28:25 +0800
Subject: [PATCH] fix grpo target function

---
 chapters/en/chapter12/3b.mdx | 2 +-
 chapters/my/chapter12/3a.mdx | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/chapters/en/chapter12/3b.mdx b/chapters/en/chapter12/3b.mdx
index a849c0b9d..e4d16a275 100644
--- a/chapters/en/chapter12/3b.mdx
+++ b/chapters/en/chapter12/3b.mdx
@@ -84,7 +84,7 @@ The final step is to use these advantage values to update our model so that it b
 
 The target function for policy update is:
 
-$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref})$$
+$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref})$$
 
 This formula might look intimidating at first, but it's built from several components that each serve an important purpose. Let's break them down one by one.
 
diff --git a/chapters/my/chapter12/3a.mdx b/chapters/my/chapter12/3a.mdx
index 1a3512452..f295a82ea 100644
--- a/chapters/my/chapter12/3a.mdx
+++ b/chapters/my/chapter12/3a.mdx
@@ -84,7 +84,7 @@ $$A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \ldots, r_G\})}{\text{std}(\{r_1, r_
 
 policy update အတွက် target function က...
 
-$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref})$$
+$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref})$$
 
 ဒီ formula က အစပိုင်းမှာ ကြောက်စရာကောင်းတယ်လို့ ထင်ရပေမယ့်၊ အရေးကြီးတဲ့ ရည်ရွယ်ချက်တစ်ခုစီကို ဆောင်ရွက်ပေးတဲ့ အစိတ်အပိုင်းများစွာနဲ့ တည်ဆောက်ထားတာပါ။ တစ်ခုချင်းစီကို ဖော်ပြပေးပါမယ်။