My homemade text-to-image model - a frankensteined Sana (NVIDIA).
Latest specs:
- 600M parameter diffuser model (Sana architecture)
- Hugging Face's SmolLM2-360M as text encoder
- Han lab's Deep Compression AutoEncoder
- Trained on ImageNet-1k
- .. at home with 4x3090s
This is a public log of my work in progress.
Alpha-44 samples, more
| Model | Text Encoder | AE | Transformer | Dataset | Compute | Model | Code | Loss | Samples |
|---|---|---|---|---|---|---|---|---|---|
| Beta-8 | SmolLM2-360M | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (116M) num_layers=12, hidden_dim=768 |
IN1k_256px recaptioned with md2+qwen2-vl+smolvlm2, AR 1:1 3:4 4:3 100 epochs, ??? steps, BS 1024, single GPU LR 5e-4 constant 10% label dropout |
1xRTX 6000 pro ??? hrs | Model | Code | ??? loss | |
| Beta-7 | SmolLM2-360M | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (116M) num_layers=12, hidden_dim=768 |
CC12M+IN21K 256px 17M subset ~4 epochs, 300k steps, BS 256, single GPU LR 5e-4 constant 10% label dropout |
1xRTX 6000 Ada, 64 hrs | Model | Code | 1.01 ![]() |
![]() |
| Beta-6 | SmolLM2-360M | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (116M) num_layers=12, hidden_dim=768 |
IN1k_256px recaptioned with md2+qwen2-vl+smolvlm2, AR 1:1 3:4 4:3 70 epochs, 280k steps, BS 320, single GPU LR 5e-4 constant 20% label dropout |
1xRTX 5090, 28 hrs | Model | Code | 1.03 ![]() |
![]() |
| Beta-5 | SmolLM2-360M | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (116M) num_layers=12, hidden_dim=768 |
IN1k_256px recaptioned with md2+qwen2-vl+smolvlm2, AR 1:1 3:4 4:3 40 epochs, 168k steps, BS 320, single GPU LR 5e-4 constant 10% label dropout |
1xRTX 5090, 16 hrs | Model | Code | 1.03 ![]() |
![]() |
| Beta-4 | SmolLM2-360M | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (116M) num_layers=12, hidden_dim=768 |
IN1k_256px recaptioned with md2+qwen2-vl+smolvlm2, AR 1:1 3:4 4:3 65 epochs, 250k steps, BS 320, single GPU LR 5e-4 constant no label dropout |
1xRTX 5090, 25 hrs | Model | Code | 1.02 ![]() |
![]() |
| Alpha-45 | SmolLM2-360M | dc-ae-f32c32-sana-1.0 | Alpha-44 | PD12M 256px 200k steps, BS 256 x4, grad_checkpointing LR 4e-4 with linear decay to 1e-4 over 50k steps 10% dropout CFG |
4x3090 250hrs | Model | Code | 0.89 ![]() |
![]() |
| Alpha-44 | SmolLM2-360M | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (590M) num_layers=28, hidden_dim=1152 |
IN1k_256px, captions by md2+qwen2-vl+smolvlm2, all aspects 100 epochs, 125k steps, BS 256 x4, 1000 timesteps, LR 4e-4 with linear decay to 1e-4 over 50k steps 10% dropout, CFG |
4x3090 150hrs | Model | Code | 0.992 ![]() |
![]() |
| Alpha-42 | SmolLM2-360M | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (590M) num_layers=28, hidden_dim=1152 |
IN1k_256px, captions by md2+qwen2-vl+smolvlm2, bfl16, AR 1:1 3:4 4:3 2 runs, model broke after 45 epochs, restarted at epoch 25 with lower LR 100 epochs, 125k steps, BS 256 x4, 1000 timesteps, LR 5e-4 + 1e-4 after restart, 10% dropout is back |
4x3090 50hrs + 85 hrs | Model | Code | run1 run2 ![]() |
![]() |
| Alpha-39 | SmolLM2-360M | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (125.3M) num_layers=12, hidden_dim=768 |
IN1k_256px, captions by md2+qwen2-vl+smolvlm2, bfl16, AR 1:1 3:4 4:3 100 epochs, 125k steps, BS 256 x4, 1000 timesteps, LR 5e-4, CFG |
4x3090 32hrs | Model | Code | 0.99 ![]() |
![]() |
| Alpha-38 | SigLIP2 | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (125.3M) num_layers=12, hidden_dim=768 |
IN1k_256px, captions by md2+qwen2-vl+smolvlm2, bfl16, AR 1:1 3:4 4:3 100 epochs, 125k steps, BS 256 x4, 1000 timesteps, LR 5e-4, CFG |
4x3090 24hrs | Model | Code | 0.99 ![]() |
![]() |
| Alpha-37 | ModernBERT-large | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (125.3M) num_layers=12, hidden_dim=768 |
IN1k_256px, captions by md2+qwen2-vl+smolvlm2, bfl16, AR 1:1 3:4 4:3 100 epochs, 125k steps, BS 256 x4, 1000 timesteps, LR 5e-4, CFG |
4x3090 24hrs | Model | Code | 0.98 ![]() |
![]() |
| Alpha-35 | ModernBERT-large | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (125.3M) num_layers=12, hidden_dim=768 |
IN1k_256px recaptioned with md2, bfl16, AR 1:1 3:4 4:3 100 epochs, 125k steps, BS 256 x4, 1000 timesteps, LR 5e-4, CFG |
4x3090 34hrs | Model | Code | 0.97 ![]() |
![]() |
| Alpha-34 | ModernBERT-large | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel, wide variant (205M) num_layers=12, hidden_dim=1024 |
IN1k_256px, no aug., +aspect ratios 1:1 3:4 4:3 88 epochs, 140k steps, BS 192 x4, 1000 timesteps, LR 5e-4, CFG |
4x3090 31hrs | trashed | Code | 1.00 ![]() |
![]() |
| Alpha-33 | ModernBERT-large | KBlueLeaf/EQ-SDXL-VAE | SanaTransformer2DModel (116.0M) num_layers=12, hidden_dim=768 |
imagenet1k_eqsdxlvae_latents_withShape 4.4 epochs, 16k steps, BS 80 x4, 1000 timesteps, LR 3e-4, 10% dropout, CFG |
4x3090 3.4hrs | trashed | Code | 0.72 ![]() |
![]() |
| Alpha-32 | ModernBERT-large | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (125.3M) num_layers=12, hidden_dim=768 |
IN1k_256px, no aug., +aspect ratios 1:1 3:4 4:3 100 epochs, 125k steps, BS 256 x4, 1000 timesteps, LR 5e-4, CFG |
4x3090 24hrs | Model | Code | 0.98 still undertrained I guess ![]() |
![]() |
| Alpha-31 | ModernBERT-large | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (125.3M) num_layers=12, hidden_dim=768 |
IN1k_128px, no aug., +aspect ratios 1:1 3:4 4:3 100 epochs, 38.6k steps, BS 832 x4, 1000 timesteps, LR 5e-4, 10% dropout, CFG |
4x3090 12hrs | Model | Code | 1.14 not 100% sure if a31>30 |
![]() |
| Alpha-30 | ModernBERT-large | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (125.3M) num_layers=12, hidden_dim=768 |
IN1k_128px +4 augmentations 102 epochs, 32k steps, BS 1024 x4, 1000 timesteps, LR 5e-4, 10% dropout, CFG |
4x3090 16hrs | Model | Code | 1.14 ![]() |
![]() |
| Alpha-29 | ModernBERT-large | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (125.3M) num_layers=12, hidden_dim=768 |
IN1k_96px +4 augmentations 103 epochs, 32k steps, BS 1024 x4, 1000 timesteps, LR 5e-4, 10% dropout, CFG |
4x3090 12.5hrs | Model | Code | 1.21 less overfitting with augmentations |
![]() |
| Alpha-27 | ModernBERT-large | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (116.08M) num_layers=12, hidden_dim=768 |
IN1k_96px 143 epochs, BS 1024 x4, 1000 timesteps, LR 5e-4 |
4x3090 17hrs | none, run killed, overfitting | Code | 1.14 ![]() |
![]() |
| Alpha-15 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7 |
CIFAR10 128px 800 epochs, BS 512, 1000 timesteps, LR 5e-4 |
3x3090 8.5hrs | forgot to save | Code | 0.78 ![]() |
![]() |
| Alpha-14 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7 |
CIFAR10 128px 1000 epochs, BS 384, 1000 timesteps, LR 5e-4, scalingf fix |
1x3090, 23hrs | Model | Code | 0.78 ![]() |
![]() |
| Alpha-13 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7 |
CIFAR10 128px 1000 epochs, BS 384, 1000 timesteps, LR 5e-4 |
1x3090, 23hrs | trashed | Code | 1.9 ![]() |
![]() |
| Alpha-22 | ModernBERT-large | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (116.08M) num_layers=12, hidden_dim=768 |
CIFAR10-augmented 64px 500 epochs, BS 1024 * 4, 1000 timesteps, LR 5e-4 |
4x3090, 2hrs | Model | Code | 1.07 ![]() |
![]() |
| Alpha-12 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7 |
CIFAR10 64px 500 epochs, BS 896, 1000 timesteps, LR 5e-4 |
1x3090, 8hrs | Model | Code | 1.27 ![]() |
["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"] |
| Alpha-11 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7 |
CIFAR10 64px 500 epochs, BS 896, 1000 timesteps, LR 5e-4 |
1x3090, 12hrs | Model | Code | 1.25 ![]() |
["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"] |
| Alpha-20 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (17.8M) num_layers=7, hidden_dim=384 |
Fashion MNIST 20 epochs, 1170 steps, BS 1024, 1000 timesteps (logit normal), LR 5e-4 |
1x3090, 13mins | gone | Code | 0.61 lnormal 0.71 uniform 0.61 beta high 0.82 beta low ![]() |
notes |
| Alpha-17 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (17.8M) num_layers=7, hidden_dim=384 |
Fashion MNIST 20 epochs, 600 steps, BS 2048, 1000 timesteps (uniform), LR 5e-4 |
1x3090, 20mins | Model | Code | 0.85 ![]() |
notes |
| Alpha-8d | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7 |
Fashion MNIST 400 epochs, BS 896, 1000 timesteps |
1x4090, 6.2hrs | Model | Code | 1.14 wandb ![]() |
['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'] |
| Alpha-8c | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7 |
Fashion MNIST 236 epochs, BS 896, 40 timesteps |
1x4090, 2.6hrs | trashed | Code | 1.33 wandb | ![]() |
| Alpha-8b | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7 |
Fashion MNIST 250 epochs, BS 896, 20 timesteps |
1x4090, 2.6hrs | trashed | Code | 1.33 wandb | ![]() |
| Alpha-8a | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7 |
Fashion MNIST 250 epochs, BS 896, 10 timesteps |
1x4090, 2.6hrs | trashed | Code | 1.34 wandb | ![]() |
| Alpha-10 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7, cross_attention_dim=1152 |
MNIST 3 epochs, LR 5e-4 BS 256, 10 TS (lognormal) |
1x4090, 10' | Model | Code | 1.069 ![]() |
![]() |
| Alpha-7 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7, cross_attention_dim=1152 |
MNIST 3 epochs, LR 5e-4, BS 256, 10 TS |
1x4090, 10' | Model | Code | 0.99 | ![]() |
| Alpha-9 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7, cross_attention_dim=1152 |
Imagenet-1k 60 epochs, BS 320, LR 5e-4 |
1x4090, 83hrs | FAIL | Code | 2.32 ![]() |
![]() |
| Alpha-5 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7, cross_attention_dim=1152 |
Imagenet-1k 20 epochs, BS 128 |
1x4090, 22hrs | FAIL | Code | diverging after e2 2.57 | ![]() |
| Alpha-4 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7, cross_attention_dim=1152 |
MNIST 3 epochs, BS 128 |
1x4090, 9' | Model | Code | 1.050 | ![]() |
| Alpha-3 | ModernBERT-base | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) num_layers=7, cross_attention_dim=1152 |
MNIST 85660 steps,=150 epochs, BS 128 |
1x4090, 8 hrs | Model | Code | 0.833 | ![]() |
| Alpha-2 | Gemma2 2b | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) 7 layers instead of 28 | MNIST 7940 steps, BS 128 |
1x4090, 40' | Model | Code | 0.933 | ![]() |
| Alpha-1 | Gemma2 2b | dc-ae-f32c32-sana-1.0 | SanaTransformer2DModel (158.18M) 7 layers instead of 28 | MNIST 5 epochs, LR 1e-4, 300k steps, BS 1 |
1x4090, 4 hours | Model | Code | 0.958 | ![]() |
- SwayStar ⭐️
- cloneofsimo/minRF
- HuggingFace's 🧨 diffusers, transformers, SmolLM2, and SmolVLM2
- Two more great VLMs: Moondream and Qwen2.5-VL
- MITs HAN lab
- ModernBERT, answer.ai, Jeremy Howard and Jonathan Whitaker
- ostris/ai-toolkit
- bghira/SimpleTuner
- Google's PartiPrompts
























































































