Skip to content

Conversation

@tonywu1999
Copy link
Contributor

@tonywu1999 tonywu1999 commented Apr 4, 2025

Motivation and Context

https://groups.google.com/g/msstats/c/8OrKxfxMxOo - The PD converter does not remove the QuanInfo column from the input data frame. QuanInfo is a PD specific column and is not referenced anywhere else in the code. However, this column causes problems with summarizing multiple PSMs:

Workflow:

  • It is a common edge case when summarizing with max where if we have duplicate PSMs with equal max intensities here, we cannot determine which PSM to use to summarize a feature.
  • Under this circumstance, we summarize the PSMs into a feature via taking the mean of the PSMs here
  • However, if QuanInfo is included, the code may mistaken two duplicate PSMs as being associated with different features here, leading to duplicate PSMs remaining in the input data.
  • proteinSummarization / dataProcess crashes when there's multiple PSMs of the same feature.

I've determined the best solution is to remove QuanInfo from the input data frame during conversion.

Changes

  • Remove QuanInfo column from the pd input table after cleanup since it is not needed anymore.

Testing

Fixed existing unit tests

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

@tonywu1999 tonywu1999 merged commit 62fbca2 into devel Apr 7, 2025
1 check passed
@tonywu1999 tonywu1999 deleted the pd-fix branch April 7, 2025 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants