-
Notifications
You must be signed in to change notification settings - Fork 39
Varlen and testing tweaks #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| # Takes ~6s, much more if it needs to compile, reducing the hidden size doesn't help. | ||
| @pytest.mark.slow | ||
| @pytest.mark.skip("Dropless MoE is broken") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, torch now comes with grouped_mm, which is much faster than naive looping... huggingface/transformers#42697
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's very similar to our sparse linear kernel, though it doesn't seem to use padding so could be simpler. However, most of the difficulty is in the preparation and sparse data copy, so not sure it would help much by itself.
|
thanks for making this better and better! there was an issue where vlm training would crash if the model debug level > 0. this commit was an attempt of a fix it. I think you're already discovering that there are issues with distributed training... |
I saw this in #409 after it was merged and posted some comments. Concerning the vision dim, I think I missed a few bugs because I used the same hidden size for the vision and text models in the tests, will address. |
tscholak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
✨ Description
Add varlen implementation for Mamba (based on hybrid_dev)broken, see [bug] Can't compile varlen mamba with base image 25.11 #416test_varlen, also test attention and mambaget_stageutil to allow simpler usageMark MoE tests as broken(moved to Ensure compatibility between models and datasets #402)test_reverse_kl, but_torch_reverse_kl_forward_backwardis usingloss.backwardwhich is unlikely to work in a distributed setting.Current test failures:
Don't know about the first two, the other ones are gradient mismatches for reverse KL.