Integrating Ray for Hyperparameter Tuning #10
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR creates a child class of
CLTTrainercalledCLTTrainerRaythat enables training the model using Ray. It primarily modifies the train() func to report metrics to Ray and uses Ray's approach to checkpointing. Also, it adds a new script in the root dir calledtune_clt_local_ray.pythat mirrorstrain_clt_local.pybut runsCLTTrainerRayin a function that can be used by a Ray Tuner to run training runs in parallel for different hyperparameter combinations.When
tune_clt_local_ray.pyis run, it will create a working dir in the specifiedargs.output_dir, within which, for each hyperparameter combination (a "trial" in Ray terminology), Ray will create a working dir for that trial, marked by the hyperparameters for that trial. In the trial working dir, checkpoints will be saved according toargs.checkpoint_intervalfor the model and the activation store at that training step. Reported metrics for each trial can be viewed atprogress.csvandresult.jsonin the trial's working dir. See photo below:Some notes:
hpsdict intune_clt_local_ray.pymust be manually editedCLTTrainerRayThis code was tested by first generating activations for 1000 tokens from the
monology/pile-uncopyrighteddataset for theEleutherAI/pythia-70mmodel. Then, the script was run as follows:python tune_clt_local_ray.py --activation-path ./tutorial_activations_local_1k_pythia/EleutherAI/pythia-70m/pile-uncopyrighted_train --model-name EleutherAI/pythia-70m --num-features 2048 --training-steps 5 --n-workers 2To visualize results, you can use tensorboard like so:
tensorboard --logdir ./clt_train_local_1746647195