-
Notifications
You must be signed in to change notification settings - Fork 344
Optimize Rust cache configuration for improved CI performance (Phase 1) #2765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: canary
Are you sure you want to change the base?
Conversation
Implement Phase 1 cache optimizations: - Use shared cache key based on Cargo.lock hash for better reusability - Enable cache-workspace-crates to cache compiled workspace crates - Restrict cache saves to main/canary branches to reduce cache pollution - Update cache prefix to v1-rust-shared to avoid conflicts with old caches Expected improvements: - Cache hit rate: 5% -> 70-80% - Build time reduction: 30-40% on cache hits - Better cache sharing across jobs with same dependencies
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
9df1c3e to
6807fb9
Compare
WASM builds with --no-default-features and cross-compilation to wasm32-unknown-unknown don't benefit from workspace crate caching. Testing showed 42% performance degradation with caching enabled. This change disables cache-workspace-crates for WASM while keeping dependency caching via the shared Cargo.lock-based key.
03d1174 to
75c25b6
Compare
|
Thanks! The improved clippy/linux-test numbers are interesting, but I'm concerned that this is a red herring improvement. Two sets of feedback, one for the change itself and one for the workflow. Change itself adds prefix-key: would prefer something specific to baml_language. that means i can debug it in the GHA UI and also disambiguate it from the rust cache for shared-key: why do we need to set this explicitly? I see the solution described in the PR summary. cache-workspace-crates: this seems fine save-if: I suspect this should be removed. There's subtleties around the gha cache and how it works (specifically the cache is not shared across branches) which have weird implications for how this change affects subsequent pushes to a branch vs. just new branches. I honestly don't really know how to measure those changes in a PR, instead of just experimenting with it longitudinally. I also don't entirely understand the methodology that mendral used here: what jobs did it run? Is it analyzing prior history or did it run experiments? I'm assuming it ran experiments, but I don't know what they were. I'd also like an Also mendral says it "Measured performance by temporarily removing save-if" - but that's not the only change it's making. So why is it making the other changes? mendral: swatinem/rust-cache is a third party gha that few of us really understand the config knobs of. Simply by reviewing this PR I'm having to learn it, I still haven't looked at the docs for it because I don't want to yet. But because I don't have any context on it, I want mendral to (1) have comments inline explaining the purpose of each field, why the default is bad in our case, and why the setting is better and (2) educate me about the code it's changing. Some of the context will be recoverable in the future via git blame but we're not very careful with keeping our changes fine-grained so it's much easier to just have it all as inline comments; inline comments also serve as guidance for future humans/agents about the assumptions that drove a given change (and when it's OK to unwind a change). I'm not sure how mendral should be choosing the education threshold for me, but at least for a change like this that is about configuring a third-party thing I want to spend as little time as I have to doing my own research to understand the change. Re methodology - because of the above notes on subtleties around the longitudinal implications of caching behavior I'm not sure if the data that mendral produced justifies the change. |
Pretty easy to ask the agent to amend this PR, if you prefer a specific string, let me know.
The rationale is to reuse the cache across jobs, as long as the jobs use the same dependencies / hash (hence including those info in the cache key).
I would keep it. The goal is for the cache to be saved only for jobs from Also Github's cache has a 10GB limit, so it's better to avoid PR's cache to evict the cache that was saved from
You can check all the runs here: https://github.com/BoundaryML/baml/actions?query=branch%3Amendral%2Foptimize-rust-cache-phase1 - it did several CI runs to confirm the cache was getting filled and compared the performance with and without the cache. We built more than this agent itself, we have a log ingestion system that measure the performance over time, the measures explained here were just done by the agent to confirm this PR is useful. It initially built a plan in 3 phases (briefly explained in the body), with this change being only the first step. We have the ability to observe the jobs performance impact before moving to phase 2 in a later PR.
It created a temporary commit to only removed
I can instruct it to add inline comments in the workflow to explain the purposed of each added option. Also good feedback to add it to the system prompt as well. |
Is there a syntax for that? I couldn't tag @mendral-app so couldn't tell. The prefix I'd want is
Fair enough - I'd just keep it on (Incidentally, I don't know how mendral can persuade me that my initial impression is wrong, when I read something and I think I disagree with it but I'm wrong. Your human feedback is doing that here, but without you in the loop idk what would do that 😅) Re the 10G cache limit, we were overcommitting that cache for a long time (I thought I saw up to 50G usage at one point a few months ago), but it looks like we're getting enforced on that now because the current number is 10.52G...
Ahh, gotcha. This is the part where I didn't realize what context I didn't have about what mendral does. That's neat, that it uses save-if to simulate the default branch populating the cache (I'm assuming it's doing this generically and not just for swatinem/rust-cache?) For my knowledge: is mendral targeted specifically at github actions performance improvement then? Because that changes my mental model for how much default trust I give it (the more specific, the more willing I am to take it at face value; the more general, the more scrutiny I apply; and I was definitely coming from the latter side). Incidentally, our cache storage being at limit will probably screw with mendral's performance numbers... unless your fork has its own independent cache? I don't know how that GH quota policy works... 😵💫 |
Summary
Implements Phase 1 of Rust caching optimization to improve CI build times and cache reusability.
Problem
Current cache hit rate: ~5% (48/50 recent jobs had cache misses)
Solution
1. Shared cache key based on Cargo.lock hash
v1-rust-shared-<cargo-lock-hash>-<os>-<arch>2. Enable workspace crate caching (except WASM)
--no-default-featuresdoesn't benefit from workspace crate caching3. Selective cache saves
Changes
.github/workflows/cargo-tests.reusable.yaml: Updated 7 jobs.github/workflows/ci.yaml: Updated benchmarks jobPerformance Validation
Measured performance by temporarily removing
save-ifto enable cache testing on the PR branch. Ran CI multiple times to populate cache and measure impact:Results (comparing cached vs non-cached runs):
Cache verification: 100% cache hit rate confirmed with shared Cargo.lock-based key.
WASM exclusion rationale: Initial testing showed WASM builds were 42% slower with workspace crate caching enabled (55s → 78s). Cross-compilation to
wasm32-unknown-unknownwith--no-default-featurescreates different artifacts that don't benefit from cached workspace crates. WASM still benefits from dependency caching via the shared key.Follow-up
This is Phase 1 of 3: