Skip to content

Line Range Reported by SrcML Incorrectly Includes Blank Lines and Preprocessor Lines #20

@nuthanmunaiah

Description

@nuthanmunaiah

Description

In refactoring parser service to replace clang as the source code parser, we chose to use SrcML. However, in comparing the functions identified by clang with that identified by srcml, we observed certain differences. In this comparison, each function was uniquely identified as a string that followed the pattern <name>@<begin>:<end>, where <name> is the name of the function and <begin> and <end> are the line numbers at which the declaration or definition of the function begins and ends, respectively. We used FFmpeg at 20fed2f as the subject in the comparison.

We manually reviewed a random sample of 25 files for which srcml missed identifying at least one function that clang identified. In the review, we discovered that srcml incorrectly indicates blank lines and preprocessor lines that follow a function declaration and/or function definition as being part of the declaration and/or definition. The example included below demonstrates the issue

Example

  1 #include<stdio.h>
  2 
  3 int add(int a, int b);
  4 
  5 #define MESSAGE "message"
  6 
  7 int multiply(int a, int b);
  8 
  9 static void say_hello()
 10 {
 11     printf("hello world!!!\n");
 12 }
 13 
 14 #define PI 3.141592654
 15 
 16 int main()
 17 {
 18     say_hello();
 19     return 0;
 20 }

The example source code was parsed using srcml --position code.c > parse.srcml.txt. The parse.srcml.txt contains the parse of the source code in the SrcML format. The issues observed in the parse are as follows:

  • pos:end of declaration of add is 6:0 instead of 3:0.
<?xml version="1.0" encoding="UTF-8"?>
<unit ...>
  ...
  <function_decl pos:start="3:1" pos:end="6:0">
    <type pos:start="3:1" pos:end="3:3">
      <name pos:start="3:1" pos:end="3:3">int</name>
    </type>
    ...
  </function_decl>
  ...
</unit>
  • pos:end of definition of say_hello is 15:0 instead of 12:0.
<?xml version="1.0" encoding="UTF-8"?>
<unit ...>
  ...
  <function pos:start="9:1" pos:end="15:0">
    ...
    <name pos:start="9:13" pos:end="9:21">say_hello</name>
    ...
  </function>
  ...
</unit>

These issues seem to be manifestations of srcML/srcML#1697.

Workaround

While issue is being resolved upstream, there are workarounds that may be implemented to overcome these issues.

  • Use line number from pos:end from <parameter_list> in <function_decl> to get the ending line of a function declaration.
<?xml version="1.0" encoding="UTF-8"?>
<unit ...>
  ...
  <function_decl pos:start="3:1" pos:end="6:0">
    ...
    <name pos:start="3:5" pos:end="3:7">add</name>
    <parameter_list pos:start="3:8" pos:end="3:21">
      ...
    </parameter_list>
    ;
  </function_decl>
  ...
</unit>
  • Use line number plus one from pos:end of <block_content> in <block> in <function> to get the ending line of a function definition.
<?xml version="1.0" encoding="UTF-8"?>
<unit ...>
  ...
  <function pos:start="9:1" pos:end="15:0">
    ...
    <name pos:start="9:13" pos:end="9:21">say_hello</name>
    ...
    <block pos:start="10:1" pos:end="15:0">
      {
      <block_content pos:start="11:9" pos:end="11:35">
        ...
      </block_content>
      }
    </block>
  </function>
  ...
</unit>

Unfortunately, the workaround is not perfect because adding one to the line from pos:end in <block_content> assumes that the ending } is on the line after the end of the content of the function block. If this assumption is violated, the workaround fails.

Summary

The implementation of the workaround improved the performance of the parser service in identifying functions when compared to that identified by clang. Shown in the tables below is the summary of the number of functions missed by srcml (when compared to functions identified by clang) with and without the workarounds.

# Functions Missed by srcml Without Workaround

Minimum Mean Median Maximum Total Variance
0.00 3.17 1.00 1,261.00 10,178.00 751.07

# Functions Missed by srcml With Workaround

Minimum Mean Median Maximum Total Variance
0.00 2.89 0.00 1,260.00 9,286.00 754.83

Metadata

Metadata

Assignees

Labels

limitationIssue is a limitation that is known.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions