Skip to content

Add real-js subset that passes the parser and bench the first 10 files.#338

Draft
zbraniecki wants to merge 1 commit intomozilla-spidermonkey:masterfrom
zbraniecki:criterion-real-js
Draft

Add real-js subset that passes the parser and bench the first 10 files.#338
zbraniecki wants to merge 1 commit intomozilla-spidermonkey:masterfrom
zbraniecki:criterion-real-js

Conversation

@zbraniecki
Copy link
Contributor

I'm not yet arguing whether this is worth merging, but let's talk about it!

I pulled https://github.com/nbp/real-js-samples/ and then looped over direct files in 20190416 folder (no subfolders) testing for which of them parse without errors.

It left me with ~600 files which I pulled into benches/real-js.

Unfortunately, attempting to parse all of them results once takes 544.38 ms on my machine.

I can conclude the test in ~30s (10 samples) with:

let mut c = Criterion::default().sample_size(10);
Benchmarking parser_real_js_bench: Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 30.0s.
Benchmarking parser_real_js_bench: Collecting 10 samples in estimated 30.016 s (55 iterations)
Benchmarking parser_real_js_bench: Analyzing
parser_real_js_bench    time:   [542.29 ms 544.38 ms 545.86 ms]
                        change: [-0.6367% +1.5335% +3.4020%] (p = 0.17 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

or, can instead .take(10) paths and conclude the test in ~10s (5050 samples):

Benchmarking parser_real_js_bench: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.8s or reduce sample count to 50.
parser_real_js_bench    time:   [1.9195 ms 1.9214 ms 1.9236 ms]                                  
                        change: [-99.647% -99.645% -99.643%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high severe

10 samples is likely way to low to establish any reasonable significance tests, while testing 10 files doesn't give us the whole picture, of course.

My recommendation would be to land it with the 10 files model first, and as the parser improves increase the number of files we incorporate.
This should give us a good statistical significance of a subset of real-world JS files and allow the engineers to verify that their changes affect/not-affect that sample.

@nbp
Copy link
Collaborator

nbp commented Mar 2, 2020

I have a stupid question, independent or what JS source is being benchmarked. Why warming up?

Except for lazy parsing (second parsing), the syntax parse (first parse) would always be a cold parse, the source input just barely made it to memory, but might be consumed by a different thread, and the parser code likely got evicted from the L1 and maybe L2 caches since other code ran before.

@zbraniecki
Copy link
Contributor Author

I think it's a good question!

Here's the explanation of analysis from the criterion crate: https://github.com/bheisler/criterion.rs/blob/master/book/src/analysis.md#warmup

Quote:

In this phase, the routine is executed repeatedly to give the OS, CPU and JIT time to adapt to the new workload. This helps prevent things like cold caches and JIT compilation time from throwing off the measurements later.

the syntax parse (first parse) would always be a cold parse, the source input just barely made it to memory, but might be consumed by a different thread, and the parser code likely got evicted from the L1 and maybe L2 caches since other code ran before.

That's what I would expect to happen as well in the real world. For that reason, no microbenchmark can be representative of real world usage.

The value of a microbenchmark lies in its ability to reproduce the same result multiple times and provide feedback on the impact of a single change on the performance of the code in the microbenchmark scenario.

My interpretation of the explanation in application for our use case is that warm up allows us to minimize several external factors that would make statistical significance analysis impossible.
If you run the sample 5000 times but the first 100 are before higher level CPU profile kicks in, or before OS prioritizes the thread, then you'll always get high stdev and unrealiable results.
Warm up maximizes the odds that all measuring samples are captured in the same environment and their stdev should remain low and their p-value end up quite high.

That allows you to reason about the impact of your patch on that particular benchmark and criterion tries to help by providing information on whether the change explains the variance sufficiently.

To bring my thinking to more layterms - criterion doesn't tell you how fast your code is. Criterion tells you whether the change you're making makes your parser faster.
In my experience vast majority of decisions when writing lexer/parser are not affected by L1/L2, cold or warm, they're affected by tradeoffs - will a peek improve or slow down this scenario? should I match on bytes of chars? how to write skip_whitespace to make it fast?
criterion I think is really good at answering that within 5 seconds which stimulates quick write/test loop.

@nbp
Copy link
Collaborator

nbp commented Mar 4, 2020

Ok, so Criterion is good for measuring the impact of changes on the throughput of code which is expected to be executed frequently and remains hot in the parser.

Therefore I do not think the real-js-samples would satisfy these conditions and provide us with useful data when executed with Criterion, as it would focus on the throughput of code which might not be limited the same way once used in production. For example, for improving the benchmarks, we might switch to SIMD instructions (fast execution), instead of making the code smaller (fast load-time). It might still be useful, if we were to be careful about the limiting factor, which in practice we do not do.

Another reason unrelated with benchmarking, is that real-js-samples is a bunch of extracted code which have various licenses. Thus we should probably not embed these files in this repository. @Yoric might know better.

For example, git added code to trash the disk cache to make their benchmark more reliable. Maybe there is another way to warm-up specifics about the computer, such as the CPU frequency, and preload the binary containing the code, to avoid disk accesses, and also ensure that the code got evicted from the L1 instruction cache and so-on.

@jorendorff
Copy link
Contributor

@nbp Interesting, I haven't seen anything like this critique of Criterion before.

An upside of using Criterion is, of course, that it provides very steady, reproducible numbers... it's actionable data. But if those numbers don't represent anything relevant to end users then there's no point.

@Yoric
Copy link

Yoric commented Mar 12, 2020

We do not have the license from the files inside real-js-samples. I don't think anybody would sue us for using their publicly available files for testing, but we'd need at least the approval of legal.

@zbraniecki
Copy link
Contributor Author

I agree that the model of a microbenchmark is not perfectly applicable to "real life", but I think that for vast majority of developer use cases criterion benchmark is a useful approximation of the impact of the change on the code performance.

@Yoric
Copy link

Yoric commented Mar 13, 2020

@zbraniecki I think that this should be decided by running actual tests :)

Regardless of Legal, we can have a non-public machine running benchmarks based on real-js-samples. We can have a public machine running micro-benchmarks. If trends match, we can eventually get rid of the non-public machine. If they don't, we need to get authorization to publish the real tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants