Skip to content

Comments

Rewrite of package#16

Open
jakobnissen wants to merge 9 commits intomainfrom
typed_refactor
Open

Rewrite of package#16
jakobnissen wants to merge 9 commits intomainfrom
typed_refactor

Conversation

@jakobnissen
Copy link
Member

This PR is a complete rewrite of the package. It was motivated by the old code being flaky and brittle, and hard to understand. When digging into why, the package relied on some shady foundations, which this PR reworks from the bottom up:

Use of pandas causes vague parsing

  • Pandas parsing does not validate that the columns are of the expected type, instead it just changes the type. If strict typing is enabled, the error message is not context-aware, e.g. it will say "invalid integer", instead of something useful like "on line 9, so-and-so contig has an invalid TaxID which cannot be represented as an integer".
  • The parser, being generic CSV parsers, can not uphold invariants of this code - for example, this program will break if a Metabuli clade name contains a semicolon, but obviously a generic TSV parser doesn't check for these things.

Use of dataframes removes all type information

Once parsing is done, all type information is erased. That makes it hard, when reading the code, to understand what data is being loaded, and to understand whether the data is transformed correctly. This is the underlying cause of stuff like #8

Underspecified variants of the data being loaded

For example, what's the content of data/clades.tsv? What does it assume? This was not encoded into the source code, and therefore, these assumptions were violated. For example, on main now, using the -m flag causes an internal error.

Design in this PR

  • In this PR, I've leaved heavily into using specific types to check these assumptions.
  • I've removed the Pandas and Numpy dependencies
  • I've added a test suite

To review it, I suggest you read the new CLAUDE.md, or ask a chat bot to summarize the code.

@jakobnissen jakobnissen requested a review from sgalkina January 29, 2026 15:12
@jakobnissen
Copy link
Member Author

@sgalkina I'm well aware that this is the PR from hell to review. If you want to do it, I suggest:

  • Reviewing it as if it is a whole new package, since little code has survived
  • Using an LLM to help with the review, summarizing the changes. Check out the CLAUDE.md file (I used an LLM to generate the tests, not the rest)
  • Trying the CLI out using the files you sent me to verify the overall workings is the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant