Skip to content

Conversation

@mtrofin
Copy link
Collaborator

@mtrofin mtrofin commented Dec 18, 2025

This is the initial step. It currently still uses the checked in gin files and it doesn't yet fix a git hash for this repo (the latter because we need at least this commit to land first). Also, pre-training (warmstart) isn't implemented yet.

see also this RFC

Issue #540

@mtrofin mtrofin marked this pull request as ready for review December 18, 2025 14:54
@mtrofin
Copy link
Collaborator Author

mtrofin commented Dec 18, 2025

@DataCorrupted @ioghiban

Copy link
Collaborator

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'm assuming the plan is to write a README at some point on how to run everything locally?

Do we have performance numbers yet for how well ES training works on this dataset and what sort of performance people should expect?

cmake -B /work/llvm-corpus \
-GNinja \
-DCMAKE_EXPORT_COMPILE_COMMANDS=On \
-DCMAKE_BUILD_TYPE=Release \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be MinSizeRel to ensure we're compiling the modules with -Os or -Oz (I forget which one it is)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should be "MinSizeRel" if you are training for inlining

@petrhosek
Copy link
Collaborator

What's the reason for breaking each step into its own script (which is basically a one-liner)?

@mtrofin
Copy link
Collaborator Author

mtrofin commented Jan 5, 2026

What's the reason for breaking each step into its own script (which is basically a one-liner)?

Would let someone skip / redo steps. If it's all in a file, there's a tendency that the file would evolve into a monolithic unit, with variables and reusable things and whatnot, and in this case, I want to avoid this.

@mtrofin
Copy link
Collaborator Author

mtrofin commented Jan 5, 2026

LGTM.

I'm assuming the plan is to write a README at some point on how to run everything locally?

eventually.

Do we have performance numbers yet for how well ES training works on this dataset and what sort of performance people should expect?

No, and we don't yet have its own gin files either.

@mtrofin mtrofin merged commit 1fdf038 into google:main Jan 5, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants