Implement BoN for training and eval #528

Dahoas · 2023-07-18T12:21:51Z

No description provided.

…N sampling

trlx/trainer/accelerate_base_trainer.py

trlx/trainer/accelerate_ppo_trainer.py

maxreciprocate · 2023-07-24T11:42:21Z

examples/ppo_redemption.py

I think this example got here by inertia from the previous PR

trlx/trainer/accelerate_base_trainer.py

maxreciprocate · 2023-07-24T11:54:43Z

trlx/trainer/accelerate_ppo_trainer.py

            else:
                scores = all_scores[0].clone().detach()
+            # Best-of-N Sampling.
+            scores_mask = scores != -1


I think we need to merge changes from your last PR in

maxreciprocate

Looks slick!

(One thing I've noticed that while score increases much faster on the training set with increase of num_return_sequences this doesn't necessarly yield better score on the test set. Do you have perhaps an example or parameter setting where it does so?)

trlx/models/modeling_ppo.py

trlx/trainer/accelerate_base_trainer.py

trlx/trainer/accelerate_ppo_trainer.py

maxreciprocate · 2023-08-11T12:16:42Z

trlx/trainer/accelerate_ppo_trainer.py

        self.push_to_store(ppo_rl_elements)
+
+    @staticmethod
+    def get_topk_indices(input_tensor, window_size: int, k: int, device):


Nit: maybe docstring should be added specifying that this isn't the same as regular topk but rather a topk overw window_size

Dahoas · 2023-08-21T11:02:23Z

Looks slick!

(One thing I've noticed that while score increases much faster on the training set with increase of num_return_sequences this doesn't necessarly yield better score on the test set. Do you have perhaps an example or parameter setting where it does so?)

Good point, the benefit of BoN trainings seems to be problem dependent. I've seen the most benefit during training on problems where the model has a low pass@1 score.

Dahoas · 2023-08-28T10:03:02Z

@maxreciprocate If you're happy with this do you want to merge today?

maxreciprocate · 2023-09-01T15:47:48Z

@Dahoas There are some run differences when using the default config without BoN sampling, most notably for the randomwalks case:
https://wandb.ai/sorry/trlx-references/reports/BoN-v-main--Vmlldzo1MjkwMzA5
Probably some minor implementation detail, have to recheck

Dahoas · 2023-09-04T16:30:44Z

@Dahoas There are some run differences when using the default config without BoN sampling, most notably for the randomwalks case: https://wandb.ai/sorry/trlx-references/reports/BoN-v-main--Vmlldzo1MjkwMzA5 Probably some minor implementation detail, have to recheck

Let me look into why.

maxreciprocate · 2023-09-05T15:00:10Z

@Dahoas Not sure if that's the issue however, see: https://wandb.ai/sorry/trlx/reports/Difference-due-to-the-change-in-base_trainer-decode--Vmlldzo1MzE2OTg4 (+ some non-determinism)

Dahoas and others added 21 commits July 18, 2023 12:00

Implementing support for dense rewards

f04446d

added "num_return_sequences" param which corresponds to n in Best-of-…

13a01fc

…N sampling

updates to "num_return_sequences" param

5421a73

BoN implementation

2f3ac28

Changed back to default.

2f1dace

TopK sampling instead of Top1

f58170d

summed along dim=1

be8bc1a

Generating samples in chunks

608d812

added gen_chunk_size parameter

d8557e7

chunking in forward prop

8ef9c36

chunking generations in train and eval

4c1d82d

Implementing support for dense rewards

ecd5107

Fix distributed ref_mean, ref_var bug for dense rewards

4071604

Make generation respect max seq length

5f41413

Make experience before first round of training

22ae83f

Refactoring .generate/.generate_eval

7d0a4be

Fix BoN metric support

b79dd19

Enforce chunk_size param for eval generation when present

cb49dc5

Fix: Don't shuffle prompt dataset

e290412

Move inputs to device

391d04c

Fix style

8de84e4

Dahoas requested review from PhungVanDuy and maxreciprocate July 21, 2023 10:10

Fix chunked generation

3d7e0d5

maxreciprocate reviewed Jul 22, 2023

View reviewed changes

trlx/trainer/accelerate_base_trainer.py Outdated Show resolved Hide resolved

fix(accelerate_base_trainer): order of keyword arguments

1fda0ce

maxreciprocate reviewed Jul 24, 2023

View reviewed changes

Dahoas added 3 commits August 7, 2023 15:24

Merging main

4ac1707

Merge branch 'BoN' of https://github.com/CarperAI/trlx into BoN

de3d854

Removing old example

3ce3c2b

Dahoas added 7 commits August 7, 2023 15:29

Fix: remove extraneous method args

2635de5

Fix: Always set generate_experience_kwargs

1be2c3c

Fix: Remove mask from RunningMoments update call

3cba0db

Fix: style

0cb91c4

Fix: rename 'gen_chunk_size' to 'chunk_size'

cc92911

Fix: generated samples padding

4297f98

Remove prints

36f06af

maxreciprocate reviewed Aug 11, 2023

View reviewed changes

Dahoas added 5 commits August 21, 2023 11:10

Rename 'num_train_sequences' to 'num_topk_samples'

a2980dd

Address nits

3d5a639

Fix: style

87837b6

Set 'num_return_sequences' to 1 by default

ed93be8

Fix: typo

24925c8

maxreciprocate added 2 commits September 1, 2023 11:36

Merge branch 'main' into BoN

a022d3f

Merge branch 'main' into bon-x

9680c9f

Implement BoN for training and eval #528

Are you sure you want to change the base?

Implement BoN for training and eval #528

Uh oh!

Conversation

Dahoas commented Jul 18, 2023

Uh oh!

Uh oh!

Uh oh!

maxreciprocate Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxreciprocate Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

maxreciprocate left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maxreciprocate Aug 11, 2023

Choose a reason for hiding this comment

Uh oh!

Dahoas commented Aug 21, 2023

Uh oh!

Dahoas commented Aug 28, 2023

Uh oh!

maxreciprocate commented Sep 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dahoas commented Sep 4, 2023

Uh oh!

maxreciprocate commented Sep 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maxreciprocate commented Sep 1, 2023 •

edited

Loading