dataset.json: a set of requirements with semantically different specifications, collected from this dataset.generate.py: script for the generation of the benchmark Alloy models from the dataset.problems: pre-generated benchmark of Alloy models.evaluation.py: script for executing the evaluation of the narrowing algorithms over the benchmarks.alloy.py: script to process Alloy 5 models.optimal.pyandnonoptimal.py: directory containing scripts to call LLM APIs and ask for valid/invalid instances given a dataset of requirements and header promptsresults.txt: pre-collected results from running the evaluation over the benchmark.AlloyMax-1.0.2.jar: snapshot of AlloyMax, used to parse and execute Alloy models and commands.paper.als: example Alloy model used as the paper's running example.
| Specifications | 4 | 8 | 12 | 16 | 20 | 24 | 28 |
|---|---|---|---|---|---|---|---|
| Completed | 27 | 27 | 26 | 25 | 22 | 19 | 12 |
| Size | 2.44 | 4.07 | 5.04 | 5.76 | 6.68 | 7.05 | 7.67 |
| Min | 2 | 3 | 4 | 4 | 5 | 5 | 6 |
| Max | 3 | 6 | 6 | 7 | 8 | 9 | 9 |
| Median | 2.00 | 4.00 | 5.00 | 6.00 | 7.00 | 7.00 | 8.00 |
| Time | 0.27 | 1.71 | 3.48 | 6.78 | 13.10 | 20.99 | 25.68 |
| Min | 0.17 | 0.29 | 0.48 | 0.74 | 1.78 | 3.97 | 7.33 |
| Max | 0.51 | 25.73 | 34.87 | 25.59 | 46.59 | 55.92 | 57.33 |
| Median | 0.24 | 0.60 | 1.66 | 4.14 | 8.09 | 19.86 | 22.61 |
| Specifications | 4 | 8 | 12 | 16 | 20 | 24 | 28 |
|---|---|---|---|---|---|---|---|
| Completed | 28 | 28 | 28 | 28 | 28 | 28 | 28 |
| Size | 2.82 | 5.93 | 7.86 | 9.82 | 11.43 | 12.96 | 14.39 |
| Min | 2 | 4 | 6 | 7 | 7 | 8 | 9 |
| Max | 3 | 7 | 10 | 13 | 15 | 18 | 22 |
| Median | 3.00 | 6.00 | 8.00 | 10.00 | 11.50 | 13.00 | 14.00 |
| Time | 0.44 | 0.96 | 1.58 | 2.21 | 2.92 | 3.82 | 4.82 |
| Min | 0.28 | 0.56 | 0.88 | 1.24 | 1.49 | 1.91 | 2.19 |
| Max | 0.60 | 1.47 | 2.79 | 3.67 | 5.56 | 7.75 | 10.60 |
| Median | 0.49 | 1.02 | 1.65 | 2.29 | 3.05 | 3.90 | 4.78 |
| Specifications | 16 |
|---|---|
| Completed | 25 |
| Size | 5.76 |
| Min | 4 |
| Max | 7 |
| Median | 6.00 |
| Time | 6.78 |
| Min | 0.74 |
| Max | 25.59 |
| Median | 4.14 |
| Specifications | 16 |
|---|---|
| Completed | 22 |
| Size | 5.68 |
| Min | 4 |
| Max | 7. |
| Median | 6.00 |
| Time | 11.20 |
| Min | 1.26 |
| Max | 49.18 |
| Median | 7.81 |
| Specifications | 16 |
|---|---|
| Completed | 18 |
| Size | 5.72 |
| Min | 4 |
| Max | 7 |
| Median | 6.00 |
| Time | 18.35 |
| Min | 2.69 |
| Max | 47.75 |
| Median | 18.55 |
| Specifications | 16 |
|---|---|
| Completed | 11 |
| Size | 5.64 |
| Min | 4 |
| Max | 7 |
| Median | 6.00 |
| Time | 20.95 |
| Min | 3.66 |
| Max | 41.91 |
| Median | 17.59 |
| Specifications | 16 |
|---|---|
| Completed | 10 |
| Size | 5.50 |
| Min | 4 |
| Max | 7 |
| Median | 5.50 |
| Time | 32.08 |
| Min | 7.95 |
| Max | 57.90 |
| Median | 35.73 |
| Specifications | 16 |
|---|---|
| Completed | 28 |
| Size | 9.82 |
| Min | 7 |
| Max | 13 |
| Median | 10.00 |
| Time | 2.21 |
| Min | 1.24 |
| Max | 3.67 |
| Median | 2.29 |
| Specifications | 16 |
|---|---|
| Completed | 28 |
| Size | 9.93 |
| Min | 7 |
| Max | 14 |
| Median | 10.00 |
| Time | 4.31 |
| Min | 1.48 |
| Max | 10.94 |
| Median | 4.60 |
| Specifications | 16 |
|---|---|
| Completed | 28 |
| Size | 10.25 |
| Min | 7 |
| Max | 14 |
| Median | 10.00 |
| Time | 6.34 |
| Min | 1.65 |
| Max | 21.13 |
| Median | 6.03 |
| Specifications | 16 |
|---|---|
| Completed | 28 |
| Size | 10.21 |
| Min | 8 |
| Max | 13 |
| Median | 10.00 |
| Time | 11.55 |
| Min | 3.99 |
| Max | 43.27 |
| Median | 9.20 |
| Specifications | 16 |
|---|---|
| Completed | 27 |
| Size | 10.04 |
| Min | 7 |
| Max | 13 |
| Median | 10.00 |
| Time | 14.91 |
| Min | 5.93 |
| Max | 46.21 |
| Median | 9.43 |
Start by installing the requirements.
pip install -r requirements.txt
Script generate.py, for each pair model/requirements in dataset.json, creates an Alloy model with alternative requirement specifications as predicates. The specifications in dataset.json are from a dataset that guarantees that the alternative specifications are non-equivalent, as required by the algorithms. The optimal narrowing algorithm relies on AlloyMax which does not support Alloy 6; although the requirements in dataset.json are purely static, temporal Alloy 6 features occasionally occur in student submissions. Those are filtered out by generate.py.
The script takes as argument a range of sizes for the set of alternative specifications. If a certain requirement does not have the required number of specifications, it is skipped.
Run the script as:
Usage: python generate.py <min_size> <max_size> <step_size> <models>
The script was run for models Social Network (x3JXgWhJ3uti5Dzxz), Production line (dyj49tEp7j6aWAQQX) and Train Station (cXPP9QBPTYgTX6WJ6), ranging from 4 to 28 specifications, with step of 4 (the minimal number of specifications among all requirements in dataset.json is 29). The pre-generated models are available in problems.
python3 generate.py 4 28 4 x3JXgWhJ3uti5Dzxz dyj49tEp7j6aWAQQX cXPP9QBPTYgTX6WJ6
Script evaluation.py calls a given narrowing algorithms (either optimal or nonoptimal) and runs it for Alloy models available in folder problems withing a given range of specification set size, and a range of scopes (the default upper-bound for top-level signatures). Results are presented in the standard output.
Note that, even though the selected alternative specifications are guaranteed to be non-equivalent, they may become equivalent within certain smaller scopes. In those cases ...
To evaluate a narrowing algorithm, the evaluation script is run as:
"Usage: python evaluation.py <algorithm> <min_scope> <max_scope> <min_size> <max_size>"
Where <algorithm> is either optimal or nonoptimal.
The script was run for some combinations of specification set size and scopes to answer the paper's RQs. The results pre-calculated results is available at results.txt.
python3 evaluation.py optimal 3 3 4 28
python3 evaluation.py nonoptimal 3 3 4 28
python3 evaluation.py optimal 3 7 16 16
python3 evaluation.py nonoptimal 3 7 16 16