Integrating Ray for Hyperparameter Tuning #10

xpkcs · 2025-05-07T20:17:02Z

This PR creates a child class of CLTTrainer called CLTTrainerRay that enables training the model using Ray. It primarily modifies the train() func to report metrics to Ray and uses Ray's approach to checkpointing. Also, it adds a new script in the root dir called tune_clt_local_ray.py that mirrors train_clt_local.py but runs CLTTrainerRay in a function that can be used by a Ray Tuner to run training runs in parallel for different hyperparameter combinations.

When tune_clt_local_ray.py is run, it will create a working dir in the specified args.output_dir, within which, for each hyperparameter combination (a "trial" in Ray terminology), Ray will create a working dir for that trial, marked by the hyperparameters for that trial. In the trial working dir, checkpoints will be saved according to args.checkpoint_interval for the model and the activation store at that training step. Reported metrics for each trial can be viewed at progress.csv and result.json in the trial's working dir. See photo below:

Some notes:

Currently, hyperparameters cannot be passed through the command line - the hps dict in tune_clt_local_ray.py must be manually edited
Wandb is not enabled for CLTTrainerRay

This code was tested by first generating activations for 1000 tokens from the monology/pile-uncopyrighted dataset for the EleutherAI/pythia-70m model. Then, the script was run as follows:
python tune_clt_local_ray.py --activation-path ./tutorial_activations_local_1k_pythia/EleutherAI/pythia-70m/pile-uncopyrighted_train --model-name EleutherAI/pythia-70m --num-features 2048 --training-steps 5 --n-workers 2

To visualize results, you can use tensorboard like so:
tensorboard --logdir ./clt_train_local_1746647195

…tune

Andy Kim added 11 commits May 6, 2025 16:00

copying train_clt_local.py so we can see git diff easier

edf5394

adding ray tune changes

05e9ebe

adding option to CLTTrainer for ray reporting for use with ray train/…

da015dc

…tune

passing configs to worker

d3fe55e

moved configs into train_loop_per_worker func

ca9dce7

updated model checkpointing to work with ray

01d9b5e

created separate trainer class for Ray

f24dab4

restoring non-ray clt trainer code back to before ray changes

9bae990

making use_ray param required

74d1c75

explicitly setting cpu per worker and example hps

7ac3a91

cleaning up init func

9424595

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrating Ray for Hyperparameter Tuning #10

Integrating Ray for Hyperparameter Tuning #10

Uh oh!

xpkcs commented May 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Integrating Ray for Hyperparameter Tuning #10

Are you sure you want to change the base?

Integrating Ray for Hyperparameter Tuning #10

Uh oh!

Conversation

xpkcs commented May 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants