Add real-js subset that passes the parser and bench the first 10 files.#338
Add real-js subset that passes the parser and bench the first 10 files.#338zbraniecki wants to merge 1 commit intomozilla-spidermonkey:masterfrom
Conversation
|
I have a stupid question, independent or what JS source is being benchmarked. Why warming up? Except for lazy parsing (second parsing), the syntax parse (first parse) would always be a cold parse, the source input just barely made it to memory, but might be consumed by a different thread, and the parser code likely got evicted from the L1 and maybe L2 caches since other code ran before. |
|
I think it's a good question! Here's the explanation of analysis from the criterion crate: https://github.com/bheisler/criterion.rs/blob/master/book/src/analysis.md#warmup Quote:
That's what I would expect to happen as well in the real world. For that reason, no microbenchmark can be representative of real world usage. The value of a microbenchmark lies in its ability to reproduce the same result multiple times and provide feedback on the impact of a single change on the performance of the code in the microbenchmark scenario. My interpretation of the explanation in application for our use case is that warm up allows us to minimize several external factors that would make statistical significance analysis impossible. That allows you to reason about the impact of your patch on that particular benchmark and criterion tries to help by providing information on whether the change explains the variance sufficiently. To bring my thinking to more layterms - criterion doesn't tell you how fast your code is. Criterion tells you whether the change you're making makes your parser faster. |
|
Ok, so Criterion is good for measuring the impact of changes on the throughput of code which is expected to be executed frequently and remains hot in the parser. Therefore I do not think the real-js-samples would satisfy these conditions and provide us with useful data when executed with Criterion, as it would focus on the throughput of code which might not be limited the same way once used in production. For example, for improving the benchmarks, we might switch to SIMD instructions (fast execution), instead of making the code smaller (fast load-time). It might still be useful, if we were to be careful about the limiting factor, which in practice we do not do. Another reason unrelated with benchmarking, is that real-js-samples is a bunch of extracted code which have various licenses. Thus we should probably not embed these files in this repository. @Yoric might know better. For example, git added code to trash the disk cache to make their benchmark more reliable. Maybe there is another way to warm-up specifics about the computer, such as the CPU frequency, and preload the binary containing the code, to avoid disk accesses, and also ensure that the code got evicted from the L1 instruction cache and so-on. |
|
@nbp Interesting, I haven't seen anything like this critique of Criterion before. An upside of using Criterion is, of course, that it provides very steady, reproducible numbers... it's actionable data. But if those numbers don't represent anything relevant to end users then there's no point. |
|
We do not have the license from the files inside |
|
I agree that the model of a microbenchmark is not perfectly applicable to "real life", but I think that for vast majority of developer use cases criterion benchmark is a useful approximation of the impact of the change on the code performance. |
|
@zbraniecki I think that this should be decided by running actual tests :) Regardless of Legal, we can have a non-public machine running benchmarks based on real-js-samples. We can have a public machine running micro-benchmarks. If trends match, we can eventually get rid of the non-public machine. If they don't, we need to get authorization to publish the real tests. |
I'm not yet arguing whether this is worth merging, but let's talk about it!
I pulled https://github.com/nbp/real-js-samples/ and then looped over direct files in
20190416folder (no subfolders) testing for which of them parse without errors.It left me with ~600 files which I pulled into
benches/real-js.Unfortunately, attempting to parse all of them results once takes
544.38 mson my machine.I can conclude the test in ~30s (10 samples) with:
or, can instead
.take(10)paths and conclude the test in ~10s (5050 samples):10 samples is likely way to low to establish any reasonable significance tests, while testing 10 files doesn't give us the whole picture, of course.
My recommendation would be to land it with the
10 filesmodel first, and as the parser improves increase the number of files we incorporate.This should give us a good statistical significance of a subset of real-world JS files and allow the engineers to verify that their changes affect/not-affect that sample.