Struggle with replicating the training curve

Hi! Thank you for the amazing paper and for the open-sourcing the code!

I was trying to reproduce the training process and encountered few difficulties.
For some reason, the given datasets, when loaded, has all the grain `masses` equal to zero:
<img width="998" alt="Screenshot 2024-05-18 at 21 37 52" src="https://github.com/compsciencelab/torchmd-exp/assets/7863079/02c3f3fe-0b19-42c3-aa02-12688a79503b">
<img width="651" alt="Screenshot 2024-05-18 at 21 37 20" src="https://github.com/compsciencelab/torchmd-exp/assets/7863079/68ab215c-508b-4235-95a5-59a01b266a01">
Which, obviously, breaks the program when divided by. 

I was trying to set it `None` for each molecule since in this case `torchmd` guesses masses during `Parameters` building. However, it sets all to 12. (which, I guess, corresponds to C-alpha atoms). This is probably not the way it is supposed to be, since it implies the wrong physics. And the training loss is stuck around 2.8:
<img width="869" alt="Screenshot 2024-05-19 at 01 14 54" src="https://github.com/compsciencelab/torchmd-exp/assets/7863079/f75fb049-b816-4bc3-8583-9f20c417c336">

After that, I've mapped `resname`-s to the known AA masses. This indeed improved the train loss being started from 2 and decreased to 1 (and still being slowly on the way down):
<img width="853" alt="Screenshot 2024-05-19 at 13 14 59" src="https://github.com/compsciencelab/torchmd-exp/assets/7863079/d66997d5-bc44-4bfc-acba-256d2e5a0dc5">
However, the training curve looks nothing like in the example notebook where it start from around 5 and drops to almost zero.


I am using `train_ff.yaml` with only "log_dir" and "device" modified.

Do you have an idea of what might be wrong?
From what I can see, the `input.yaml` of the newly trained  model differs from the one in `data/models/fastfolders`, particularly in such fields as  `max_num_neighbors` and some other , so my next step would be to try using the same values, I guess.

I would much appreciate If you could help me with replicating the results. I am eager to use the trajectory reweighting method with another CG-potentials and slightly extended CG-systems and I really hope your implementation to help me a lot with that.

P.S. In order to launch  I also had to resolve  an environment (which doesn't work from the `environment.yaml` missing certain packages that conflict with each other) and add a "timestep" key to the logger. I can make a PR with the `environment.yaml` that worked for me.

_Edit: The one with mapped AA masses eventually went to the zero-proximity after 5k steps. Still would be great to make optimisation faster, like in the example notebook._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Struggle with replicating the training curve #44

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Struggle with replicating the training curve #44

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions