From 3c332dab58d187fb3a642ea3618c6bbf24c4fea0 Mon Sep 17 00:00:00 2001 From: sky-bro Date: Wed, 3 Dec 2025 01:28:25 +0800 Subject: [PATCH] fix grpo target function --- chapters/en/chapter12/3b.mdx | 2 +- chapters/my/chapter12/3a.mdx | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/chapters/en/chapter12/3b.mdx b/chapters/en/chapter12/3b.mdx index a849c0b9d..e4d16a275 100644 --- a/chapters/en/chapter12/3b.mdx +++ b/chapters/en/chapter12/3b.mdx @@ -84,7 +84,7 @@ The final step is to use these advantage values to update our model so that it b The target function for policy update is: -$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref})$$ +$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref})$$ This formula might look intimidating at first, but it's built from several components that each serve an important purpose. Let's break them down one by one. diff --git a/chapters/my/chapter12/3a.mdx b/chapters/my/chapter12/3a.mdx index 1a3512452..f295a82ea 100644 --- a/chapters/my/chapter12/3a.mdx +++ b/chapters/my/chapter12/3a.mdx @@ -84,7 +84,7 @@ $$A_i = \frac{r_i - \text{mean}(\{r_1, r_2, \ldots, r_G\})}{\text{std}(\{r_1, r_ policy update အတွက် target function က... -$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref})$$ +$$J_{GRPO}(\theta) = \left[\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} A_i, \text{clip}\left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right)\right]- \beta D_{KL}(\pi_{\theta} \|\| \pi_{ref})$$ ဒီ formula က အစပိုင်းမှာ ကြောက်စရာကောင်းတယ်လို့ ထင်ရပေမယ့်၊ အရေးကြီးတဲ့ ရည်ရွယ်ချက်တစ်ခုစီကို ဆောင်ရွက်ပေးတဲ့ အစိတ်အပိုင်းများစွာနဲ့ တည်ဆောက်ထားတာပါ။ တစ်ခုချင်းစီကို ဖော်ပြပေးပါမယ်။