Varlen and testing tweaks #408

jlamypoirier · 2025-12-09T04:05:11Z

✨ Description

Extract all varlen preprocessing into a common util.
~~Add varlen implementation for Mamba (based on hybrid_dev)~~ broken, see [bug] Can't compile varlen mamba with base image 25.11 #416
Simplify test_varlen, also test attention and mamba
Improve get_stage util to allow simpler usage
Move / rename some files in layer tests.
~~Mark MoE tests as broken~~ (moved to Ensure compatibility between models and datasets #402)
Use Assert instead of torch.testing (more details on error, slightly better formula)
Remove global seeds in tests so we get more visibility on edge cases
Improve cross-entropy / reverse KL tests, add gradient check. It's failing for reverse KL @oleksost. Not sure about test_reverse_kl, but _torch_reverse_kl_forward_backward is using loss.backward which is unlikely to work in a distributed setting.
Other minor tweaks.

Current test failures:

FAILED tests/models/test_checkpoint.py::test_huggingface_model[apriel2]@dependency_group_4 - ValueError: Unrecognized configuration class <class 'transformers_modules.apriel2_from_distributed.configuration...
FAILED tests/models/test_checkpoint.py::test_huggingface_model[apriel2_text_all_hybrid]@dependency_group_3 - ValueError: Comparison failed (1 errors)
FAILED tests/functional/test_cross_entropy.py::test_reverse_kl[logits-False] - AssertionError: Rms diff too big (4.44e-07 > 1.00e-08, scale = 2.29e-07) between tensors tensor([[ 4.0978e-07, -...
FAILED tests/functional/test_cross_entropy.py::test_reverse_kl[logits-True] - AssertionError: Rms diff too big (6.34e-07 > 1.00e-08, scale = 3.26e-07) between tensors tensor([[ 0.0000e+00,  ...
FAILED tests/functional/test_cross_entropy.py::test_distillation_losses - torch.multiprocessing.spawn.ProcessRaisedException:

Don't know about the first two, the other ones are gradient mismatches for reverse KL.

…cessing

tscholak · 2025-12-09T13:17:16Z

tests/functional/test_functional.py


 # Takes ~6s, much more if it needs to compile, reducing the hidden size doesn't help.
 @pytest.mark.slow
+@pytest.mark.skip("Dropless MoE is broken")


btw, torch now comes with grouped_mm, which is much faster than naive looping... huggingface/transformers#42697

I think it's very similar to our sparse linear kernel, though it doesn't seem to use padding so could be simpler. However, most of the difficulty is in the preparation and sparse data copy, so not sure it would help much by itself.

…cessing

tscholak · 2025-12-11T14:55:06Z

thanks for making this better and better!
you may have missed it, but I need your feedback on this commit as well, @jlamypoirier:

68d1516

there was an issue where vlm training would crash if the model debug level > 0. this commit was an attempt of a fix it. I think you're already discovering that there are issues with distributed training...

jlamypoirier · 2025-12-12T00:57:03Z

thanks for making this better and better! you may have missed it, but I need your feedback on this commit as well, @jlamypoirier:

68d1516

there was an issue where vlm training would crash if the model debug level > 0. this commit was an attempt of a fix it. I think you're already discovering that there are issues with distributed training...

I saw this in #409 after it was merged and posted some comments. Concerning the vision dim, I think I missed a few bugs because I used the same hidden size for the vision and text models in the tests, will address.

tscholak

LGTM!

jlamypoirier added 10 commits December 3, 2025 20:25

Fix rotary 2d

2ab1825

stuff

8305dd5

stuff

b6e38b8

Merge branch 'main' into jlp/consistent_preprocessing

72f915d

stuff

350fb3d

fix

d27a815

Merge branch 'main' into jlp/consistent_preprocessing

72f3a31

fixes

5ab6cd0

Merge remote-tracking branch 'origin/main' into jlp/consistent_prepro…

1e74469

…cessing

stuff

6454db4

tscholak reviewed Dec 9, 2025

View reviewed changes

jlamypoirier added 7 commits December 9, 2025 18:40

cleanup

916af7a

Merge remote-tracking branch 'origin/main' into jlp/consistent_prepro…

355af7c

…cessing

Merge branch 'jlp/consistent_preprocessing' into jlp/varlen_tweaks

8f6841e

cleanup

bd7a8e6

fix

660fecc

fixes

db93bb5

Merge branch 'jlp/consistent_preprocessing' into jlp/varlen_tweaks

a3fa577

Base automatically changed from jlp/consistent_preprocessing to main December 10, 2025 01:33

jlamypoirier added 2 commits December 9, 2025 20:35

Merge remote-tracking branch 'origin/main' into jlp/varlen_tweaks

a1c0ade

misc

96ce759

jlamypoirier mentioned this pull request Dec 11, 2025

Add KDA to external Apriel 2 modelling files and Fast-LLM converters #409

Merged

3 tasks

jlamypoirier added 3 commits December 10, 2025 22:06

Merge remote-tracking branch 'origin/main' into jlp/varlen_tweaks

e23ea04

stuff

e5fe8b2

fixes

68f457b

jlamypoirier marked this pull request as ready for review December 12, 2025 00:53

Merge remote-tracking branch 'origin/main' into jlp/varlen_tweaks

fa668fa

jlamypoirier requested a review from tscholak December 12, 2025 00:57

jlamypoirier requested a review from oleksost December 12, 2025 00:58

tscholak approved these changes Dec 12, 2025

View reviewed changes

jlamypoirier merged commit 200f43a into main Dec 12, 2025
3 of 4 checks passed

jlamypoirier deleted the jlp/varlen_tweaks branch December 12, 2025 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Varlen and testing tweaks #408

Varlen and testing tweaks #408

Uh oh!

jlamypoirier commented Dec 9, 2025 •

edited

Loading

Uh oh!

tscholak Dec 9, 2025

Uh oh!

jlamypoirier Dec 10, 2025

Uh oh!

tscholak commented Dec 11, 2025

Uh oh!

jlamypoirier commented Dec 12, 2025

Uh oh!

tscholak left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Varlen and testing tweaks #408

Varlen and testing tweaks #408

Uh oh!

Conversation

jlamypoirier commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Uh oh!

tscholak Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

tscholak commented Dec 11, 2025

Uh oh!

jlamypoirier commented Dec 12, 2025

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jlamypoirier commented Dec 9, 2025 •

edited

Loading