Skip to content

Conversation

@jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Dec 9, 2025

✨ Description

  • Extract all varlen preprocessing into a common util.
  • Add varlen implementation for Mamba (based on hybrid_dev) broken, see [bug] Can't compile varlen mamba with base image 25.11 #416
  • Simplify test_varlen, also test attention and mamba
  • Improve get_stage util to allow simpler usage
  • Move / rename some files in layer tests.
  • Mark MoE tests as broken (moved to Ensure compatibility between models and datasets #402)
  • Use Assert instead of torch.testing (more details on error, slightly better formula)
  • Remove global seeds in tests so we get more visibility on edge cases
  • Improve cross-entropy / reverse KL tests, add gradient check. It's failing for reverse KL @oleksost. Not sure about test_reverse_kl, but _torch_reverse_kl_forward_backward is using loss.backward which is unlikely to work in a distributed setting.
  • Other minor tweaks.

Current test failures:

FAILED tests/models/test_checkpoint.py::test_huggingface_model[apriel2]@dependency_group_4 - ValueError: Unrecognized configuration class <class 'transformers_modules.apriel2_from_distributed.configuration...
FAILED tests/models/test_checkpoint.py::test_huggingface_model[apriel2_text_all_hybrid]@dependency_group_3 - ValueError: Comparison failed (1 errors)
FAILED tests/functional/test_cross_entropy.py::test_reverse_kl[logits-False] - AssertionError: Rms diff too big (4.44e-07 > 1.00e-08, scale = 2.29e-07) between tensors tensor([[ 4.0978e-07, -...
FAILED tests/functional/test_cross_entropy.py::test_reverse_kl[logits-True] - AssertionError: Rms diff too big (6.34e-07 > 1.00e-08, scale = 3.26e-07) between tensors tensor([[ 0.0000e+00,  ...
FAILED tests/functional/test_cross_entropy.py::test_distillation_losses - torch.multiprocessing.spawn.ProcessRaisedException:

Don't know about the first two, the other ones are gradient mismatches for reverse KL.


# Takes ~6s, much more if it needs to compile, reducing the hidden size doesn't help.
@pytest.mark.slow
@pytest.mark.skip("Dropless MoE is broken")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, torch now comes with grouped_mm, which is much faster than naive looping... huggingface/transformers#42697

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's very similar to our sparse linear kernel, though it doesn't seem to use padding so could be simpler. However, most of the difficulty is in the preparation and sparse data copy, so not sure it would help much by itself.

Base automatically changed from jlp/consistent_preprocessing to main December 10, 2025 01:33
@tscholak
Copy link
Collaborator

thanks for making this better and better!
you may have missed it, but I need your feedback on this commit as well, @jlamypoirier:

68d1516

there was an issue where vlm training would crash if the model debug level > 0. this commit was an attempt of a fix it. I think you're already discovering that there are issues with distributed training...

@jlamypoirier jlamypoirier marked this pull request as ready for review December 12, 2025 00:53
@jlamypoirier
Copy link
Collaborator Author

thanks for making this better and better! you may have missed it, but I need your feedback on this commit as well, @jlamypoirier:

68d1516

there was an issue where vlm training would crash if the model debug level > 0. this commit was an attempt of a fix it. I think you're already discovering that there are issues with distributed training...

I saw this in #409 after it was merged and posted some comments. Concerning the vision dim, I think I missed a few bugs because I used the same hidden size for the vision and text models in the tests, will address.

Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jlamypoirier jlamypoirier merged commit 200f43a into main Dec 12, 2025
3 of 4 checks passed
@jlamypoirier jlamypoirier deleted the jlp/varlen_tweaks branch December 12, 2025 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants