Skip to content

Train with bin files#11

Closed
ayeganov wants to merge 10 commits intomainfrom
feat/train_with_bin_files
Closed

Train with bin files#11
ayeganov wants to merge 10 commits intomainfrom
feat/train_with_bin_files

Conversation

@ayeganov
Copy link
Contributor

@ayeganov ayeganov commented Sep 1, 2025

This PR adds a few optimizations:

  1. only read the data once
  2. once it is tokenized - saved cached version in experiments folder and reuse it in the future

@ayeganov ayeganov self-assigned this Sep 1, 2025
@ayeganov ayeganov added the enhancement New feature or request label Sep 1, 2025
@ayeganov ayeganov changed the base branch from main to feat/large_file_handling September 3, 2025 01:29
default="cuda",
choices=["cuda", "cpu"],
)
parser.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the split should go in the config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"--dtype",
type=str,
default=None,
help="NumPy dtype for pre-tokenized .bin files (e.g., 'uint16'). Required if using a .bin file.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one know the numpy type?
If using the cached bin file is favorable, it should be default behavior, and generate a new one if it's not present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work exactly like you are expecting here. Preprocess data function ensures the right dtype is chosen based on the tokenizer. In case you are starting the initial training with a bin file then it won't know the type and needs the user to say what it is.

@dariocazzani
Copy link
Contributor

If we use a config, in general I try to avoid argparse "settings" as much as possible

Base automatically changed from feat/large_file_handling to main September 4, 2025 13:52
@ayeganov
Copy link
Contributor Author

ayeganov commented Sep 5, 2025

These changes have been subsumed by #12.

@ayeganov ayeganov closed this Sep 5, 2025
@ayeganov ayeganov deleted the feat/train_with_bin_files branch September 5, 2025 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants