PEG grammar optimizer and formatter. Supports any grammar supported by PackCC parser generator.
Pegof parses peg grammar from input file and extracts it's AST. Then, based on the command it either just prints it out in nice and consistent format or directly as AST. It can also perform multi-pass optimization process that goes through the AST and tries to simplify it as much as possible to reduce number of rules and terms.
pegof [<options>] [--] [<input_file> ...]
-h/--help Show help (this text)
-V/--version Show version and exit
-c/--conf FILE Use given configuration file
-v/--verbose [LEVEL] Increase verbosity of logging by LEVEL (defaults to 1), may be repeated
-d/--debug Output very verbose debug info, implies max verbosity
-S/--skip-validation Skip result validation (useful only for debugging purposes)
-b/--benchmark SCRIPT Benchmarking script, see documentation for details
-D/--debug-script SCRIPT Debugging script, see documentation for details
-f/--format Output formatted grammar (default)
-a/--ast Output abstract syntax tree representation
-g/--graph Output description of the grammar in GraphViz format
-p/--packcc Output source files as if the grammar was passed to packcc
-P/--packcc-options Additional comma separated options passed to packcc
Supported options are 'lines', 'ascii' and 'debug' and also their short forms 'a', 'l' and 'd'
Note: --lines might not work as expected, because temporary file is used
-n/--inplace Modify the input files, use with caution
-i/--input FILE Path to file with PEG grammar, multiple paths can be given
Value "-" can be used to specify standard input
Mainly useful in config file
If no file or --input is given, read standard input
-o/--output FILE Output to file (should be repeated if there is more inputs)
Value "-" can be used to specify standard output
-I/--import PATH Directory where to search for import files
May be repeated for multiple locations
-H/--header Whether to write "Generated by pegof" header
Possible values are 'never, 'always' or 'auto' (which is the default)
-q/--quotes single/double Switch between double and single quoted strings (defaults to double)
-w/--wrap-limit N Wrap alternations with more than N sequences (default 1)
-t/--indent FORMAT How to indent long rules
FORMAT may be 'sN' for spaces or 'tN' for tabs, where N is a number (e.g.: 't1' for single tab)
Some sane literal strings are also accepted (e.g.: ' ' or '\t\t')
Default is 4 spaces
-O/--optimize OPT[,...] Comma separated list of optimizations to apply
-X/--exclude OPT[,...] Comma separated list of optimizations that should not be applied
-l/--inline-limit N Minimum inlining score needed for rule to be inlined
Number between 0.0 (inline everything) and 1.0 (most conservative)
Default is 0.2
Only applied when inlining is enabled
-N/--no-follow Do not inline imported files while optimizing
-
allAll optimizations: Shorthand option for combination of all available optimizations. -
concat-char-classesCharacter class concatenation: Join adjacent character classes in alternations into one. E.g.[AB] / [CD]becomes[ABCD]. -
concat-stringsString concatenation: Join adjacent string nodes into one. E.g."A" "B"becomes"AB". -
double-negationRemoving double negations: Negation of negation can be ignored, because it results in the original term (e.g.!(!TERM)->TERM). -
double-quantificationRemoving double quantifications: If a single term is quantified twice, it is always possible to convert this into a single potfix operator with equel meaning (e.g.(X+)?->X*). -
empty-actionRemoving empty actions: Actions that contain only whitespace are discarded. -
inlineRule inlining: Some simple rules can be inlined directly into rules that reference them. Reducing number of rules improves the speed of generated parser. -
noneNo optimizations: Shorthand option for no optimizations. -
normalize-char-classCharacter class optimization: Normalize character classes to avoid duplicities and use ranges where possible. E.g.[ABCDEFX0-53-9X]becomes[0-9A-FX]. -
remove-groupRemove unnecessary groups: Some parenthesis can be safely removed without changeing the meaning of the grammar. E.g.:A (B C) DbecomesA B C DorX (Y)* ZbecomesX Y* Z. -
repeatsRemoving unnecessary repeats: Joins repeated rules to single quantity. E.g. "A A*" -> "A+", "B* B*" -> "B*" etc. -
single-char-classConvert single character classes to strings: The code generated for strings is simpler than that generated for character classes. So we can convert for example[\n]to"\n". -
unused-captureRemoving unused captures: Captures denoted in grammar, which are not used in any source block, error block or referenced (via$n) are discarded. -
unused-variableRemoving unused variables: Variables denoted in grammar (e.g.e:expression) which are not used in any source oe error block are discarded.
Configuration file can be specified on command line. This file can contain the same options as can be given on command line, only without the leading dashes. Short versions are supported as well, but not recommended, because config file should be easy to read. It is encouraged to keep one option per line, but any whitespace character can be used as separator.
Following config file would result in the same behavior as if pegof was called without any arguments:
format
input -
output -
double-quotes
inline-limit 10
wrap-limit 1
Pegof makes it simple to measure how the optimizations affect the size and speed of the final code.
When the parameter --benchmark is passed, it's argument will be used as benchmarking script.
The benchmark is done twice - first when the input grammar is parsed and second time when
all the optimizations are done, so the results can be easily compared.
The actual script is called with two parameters:
<benchmark_script> [setup|benchmark|teardown] <basename>
setup: when called with this argument, the script should set up the environment for the benchmark (e.g.: compile the code, prepare input data)benchmark: this phase is the only one actually measured, so it should only do the actual parsingteardown: is passed to clean-up after the benchmark (e.g. delete the compiled files or input data)<basename>is always the base path to sources generated from the grammar
To get a better idea how to implement such script, see examples in the benchmark/scripts directory. The usage is like this:
pegof --optimize all --benchmark benchmark/scripts/json.sh --output /dev/null benchmark/grammars/json.pegThe output will look somewhat like this:
| lines | bytes | rules | terms | duration | memory
---------+------------+------------+------------+------------+------------+-----------
input | 1773 | 60271 | 10 | 56 | 341 | 29972
output | 1819 | 62053 | 4 | 69 | 298 | 21512
output % | 102% | 102% | 40% | 123% | 87% | 71%
Here input line shows values before any optimization and output are values after everything is optimized
according to the provided options. Meaning of the columns in the benchmark result table is following:
lines: number of lines in the C code generated bypackccbytes: length of the generated C code in bytesrules: number of rules in the grammarterms: number of terms in the grammarduration: how long the benchmark ran in millisecondsmemory: peak resident set memory in kB (only measured if GNU Time or BusyBox are installed)
Since pegof is still under development, it may sometimes contain bugs. There are two options that help to find out
what is happening under the hood. The first is --debug which turns on maximum verbosity and show in great detail
all the optimization decision that are considered and used. Second option is --debug-script, which is aimed
at more subtle bugs, where the grammar can be processed and compiled, but produces unexpected results. The script
passed to this option is run after each optimization pass, so it can be used to pinpoint which optimization causes
the problem. The only parameter passed to the scripts commandline is a name of file containing current state of the
grammar. The debug script should return 0 if the grammar works as expected. Non-zero exit code means that this
optimization iteration broke the grammar and pegof will immediately stop its execution.
This application was written with readability and maintainability in mind. Speed of execution is not a focus. Some of very big grammars (e.g. for Kotlin language) can take few minutes to process.
Pegof uses cmake. To build it just run:
cmake -B ./build
cmake --build ./build --parallel 4
cmake --build ./build --target test # optional, but recommendedBuilding on non-linux platforms has not been tested and might require some modifications to the process or even to the application itself.
If you just want to give pegof a quick try, without going through the hassle of compilation, there are two simple options.
There is a slightly limited version of pegof running in browser, which you can find at https://dolik-rce.github.io/pegof. It is only intended as a playground, not for serious work.
It uses the same code as the command-line, just compiled using emscripten to WebAssembly. This limits how it can be used, so some features are not available or do not make sense (e.g. operating on multiple files, imports or in-place changes).
If you want to try full version of pegof, but still avoid compilation, you can use a docker image, e.g.:
docker pull dolik/pegof:latest
# running without arguments just prints usage information
docker run dolik/pegof:latest
# to optimize a grammar you can call it like this:
docker run -i dolik/pegof:latest --verbose --optimize all < grammar.peg > optimized.peg
# to format all grammars in a directory, you'll have to mount it:
docker run -it -v "$PWD/grammars:/grammars" dolik/pegof:latest --verbose --inplace grammars/*.peg
Big thanks go to Arihiro Yoshida, author of PackCC for maintaining the great and very useful tool.