-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Description
In refactoring parser service to replace clang as the source code parser, we chose to use SrcML. However, in comparing the functions identified by clang with that identified by srcml, we observed certain differences. In this comparison, each function was uniquely identified as a string that followed the pattern <name>@<begin>:<end>, where <name> is the name of the function and <begin> and <end> are the line numbers at which the declaration or definition of the function begins and ends, respectively. We used FFmpeg at 20fed2f as the subject in the comparison.
We manually reviewed a random sample of 25 files for which srcml missed identifying at least one function that clang identified. In the review, we discovered that srcml incorrectly indicates blank lines and preprocessor lines that follow a function declaration and/or function definition as being part of the declaration and/or definition. The example included below demonstrates the issue
Example
1 #include<stdio.h>
2
3 int add(int a, int b);
4
5 #define MESSAGE "message"
6
7 int multiply(int a, int b);
8
9 static void say_hello()
10 {
11 printf("hello world!!!\n");
12 }
13
14 #define PI 3.141592654
15
16 int main()
17 {
18 say_hello();
19 return 0;
20 }
The example source code was parsed using srcml --position code.c > parse.srcml.txt. The parse.srcml.txt contains the parse of the source code in the SrcML format. The issues observed in the parse are as follows:
pos:endof declaration ofaddis6:0instead of3:0.
<?xml version="1.0" encoding="UTF-8"?>
<unit ...>
...
<function_decl pos:start="3:1" pos:end="6:0">
<type pos:start="3:1" pos:end="3:3">
<name pos:start="3:1" pos:end="3:3">int</name>
</type>
...
</function_decl>
...
</unit>pos:endof definition ofsay_hellois15:0instead of12:0.
<?xml version="1.0" encoding="UTF-8"?>
<unit ...>
...
<function pos:start="9:1" pos:end="15:0">
...
<name pos:start="9:13" pos:end="9:21">say_hello</name>
...
</function>
...
</unit>These issues seem to be manifestations of srcML/srcML#1697.
Workaround
While issue is being resolved upstream, there are workarounds that may be implemented to overcome these issues.
- Use line number from
pos:endfrom<parameter_list>in<function_decl>to get the ending line of a function declaration.
<?xml version="1.0" encoding="UTF-8"?>
<unit ...>
...
<function_decl pos:start="3:1" pos:end="6:0">
...
<name pos:start="3:5" pos:end="3:7">add</name>
<parameter_list pos:start="3:8" pos:end="3:21">
...
</parameter_list>
;
</function_decl>
...
</unit>- Use line number plus one from
pos:endof<block_content>in<block>in<function>to get the ending line of a function definition.
<?xml version="1.0" encoding="UTF-8"?>
<unit ...>
...
<function pos:start="9:1" pos:end="15:0">
...
<name pos:start="9:13" pos:end="9:21">say_hello</name>
...
<block pos:start="10:1" pos:end="15:0">
{
<block_content pos:start="11:9" pos:end="11:35">
...
</block_content>
}
</block>
</function>
...
</unit>Unfortunately, the workaround is not perfect because adding one to the line from pos:end in <block_content> assumes that the ending } is on the line after the end of the content of the function block. If this assumption is violated, the workaround fails.
Summary
The implementation of the workaround improved the performance of the parser service in identifying functions when compared to that identified by clang. Shown in the tables below is the summary of the number of functions missed by srcml (when compared to functions identified by clang) with and without the workarounds.
# Functions Missed by srcml Without Workaround
| Minimum | Mean | Median | Maximum | Total | Variance |
|---|---|---|---|---|---|
| 0.00 | 3.17 | 1.00 | 1,261.00 | 10,178.00 | 751.07 |
# Functions Missed by srcml With Workaround
| Minimum | Mean | Median | Maximum | Total | Variance |
|---|---|---|---|---|---|
| 0.00 | 2.89 | 0.00 | 1,260.00 | 9,286.00 | 754.83 |