Add Corrected SaProt VH+VL baseline#36
Add Corrected SaProt VH+VL baseline#36apathithyan wants to merge 7 commits intoginkgobioworks:mainfrom
Conversation
This is correcting the previous errorneous logic in model.py, by splitting the .pdb files into separate chains for SaProt to process VH and VL sequences individually
- Ensured that train and heldout sequences use respective structures for embedding - Add BioPython dependency to pixi.toml - Updated README
|
Hey, sorry for delay. I needed to make some changes to your submission to fix some runtime errors that I was having. Could you take a look? I've merged it into main, it was branched from your repo. |
… extracted MOE structures for SaProt embedding calculations.
|
Hey Seth, thanks for taking the time to review. Apologies for the delay on my end. I've pushed an update that should run without issues now, that implements the following:
Implementation Note: The model needs different structure directories for train vs heldout data, but the |
Adds a structure-aware baseline using the SaProt protein language model with Foldseek 3Di structural tokens.
Similar to the ESM2_Ridge philosophy:
Separate VH, VL embedding extraction, and concatenation for train and heldout data.
Uses Ridge Regression to predict the properties.
Note: