Strengthen the Validity of Probing and Results

Here are what needs to be addressed:

1. > The validity of the <d, c, u> tuples forms the foundation of this study, and while the paper validates them in RQ1, the current evaluation does not seem sufficient to fully establish their reliability. Specifically, the average similarity of similar and dissimilar pairs, as currently presented, appears too abstract to accurately reflect the quality of the <d, c, u> tuples, as it depends heavily on the distribution of the data points. I suggest providing more fine-grained results that offer deeper insights into the validity of <d, c, u>. For example, the authors could construct triples such as <c1, c2, c3>, where c1 and c2 are similar while c1 and c3 are dissimilar, and compare the cosine similarity of <d, c, u> between the pairs within each triple. Additionally, statistical testing could be employed to validate the significance of the results more rigorously. Furthermore,  the design of <d, c, u> were changed in RQ2 to probe the models for capturing abstractions of syntactic information. However, this design change was not validated in RQ1, which raises concerns about the validity of the findings in RQ2. To address this, I recommend including a validation of the modified <d, c, u> design as part of RQ1 to strengthen the validity of the study.

2. > Finding 2 concludes that using a more abstract data representation significantly enhances the syntactic information captured by code representation models. While this finding is intriguing, it is not yet fully validated. The claim is based on abstracting <d, c, u> tuples, which may reduce the difficulty of <d, c, u> predictions and potentially lead to higher results. To validate this finding, I suggest conducting additional experiments. For instance, if the finding holds, I would expect that models trained on abstracted training data should yield better results, which could be explicitly tested.

3. >At first glance, Findings #2 in RQ2 appear interesting. However, the motivation for transforming the <d,c,u> tuples into <c,u> is not stated, and it is unclear what this abstract structure represents. As shown in Listing 2, it looks like the new <c,u> structure is quite simplistic, which could lead to the probe learning information. Therefore, the scores reported in Table 4 may not be representative of the actual information encoded in the embeddings of the models. Conversely to <d,c,u>, the validity of the <c,u> vectors for probing was not verified. Consequently, I am left skeptical about the conclusions drawn for RQ2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strengthen the Validity of Probing and Results #7

Sub-issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Strengthen the Validity of Probing and Results #7

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions