- Author: Cory Janowski
- Date: 2025-08-05
This repository investigates architectural and computational trade-offs between dense Transformer decoders and Mixture-of-Experts (MoE) variants in autoregressive language modeling. Project-Chimera implements a nano-GPT style dense baseline and introduce a modular MoE interface with top-k gating and configurable expert ensembles. My goals are to: (1) establish a strong, reproducible dense baseline; (2) design an extensible MoE layer with clear interfaces for experts, gating, and routing; (3) evaluate training stability, loss convergence, and performance efficiency; and (4) provide open, testable code for further research on routing quality, auxiliary balancing objectives, and dispatch efficiency. Preliminary results are forthcoming; this document details the methodology and experimental protocol to ensure reproducibility and facilitate future extensions.
Large-scale language models achieve strong performance but often require prohibitive compute. MoE architectures reduce per-token compute by routing tokens to a small subset of experts, potentially improving throughput and scaling efficiency. However, MoE introduces new challenges: routing stability, load balancing, dispatch overhead, and expert under/over-utilization. This project builds a rigorous experimental backbone to assess when and how MoE outperforms dense models at similar quality or latency.
- Dense baseline: nano-GPT style decoder with unit tests and training loop.
- MoE interface: modular design isolating experts, gating, and layer combination, with feature-flag integration.
- Reproducible pipeline: data preprocessing for AG News, CLI-driven training, device and seed utilities, tests for core functionality.
- Evaluation plan: metrics for model quality and system efficiency, plus ablations on top-k, expert count, and routing noise.
Dense Transformers: The standard decoder-only Transformer architecture underpins autoregressive modeling across domains. Mixture-of-Experts: Switch Transformer, GShard, and later works demonstrate sparse activation via routed experts, enabling larger model capacity without linearly increasing per-token FLOPs. Routing and Load Balancing: Top-k routing with auxiliary balancing losses addresses collapsed utilization. Variants (noisy gating, soft routing, Sinkhorn routing) explore improved token-to-expert assignments. System Optimizations: Expert parallelism and efficient gather/scatter implementations (e.g., in DeepSpeed or FSDP-based setups) reduce dispatch costs, enabling practical training.
- Fedus et al., “Switch Transformers”
- Lepikhin et al., “GShard”
- Shazeer et al., “Outrageously Large Neural Networks”
- Recent MoE routing and load balancing advances
- Token embeddings with weight tying to the output head.
- Sine/cos positional encoding.
- Pre-norm residual blocks comprising MultiheadAttention and MLP FFN.
- Causal attention masking and key_padding support.
- Config and model: src/project_chimera/baseline.py, GPTDecoder, TransformerBlock
- Training utilities: src/project_chimera/trainer.py, BaselineTrainer with AdamW, AMP, gradient clipping
- Data preprocessing: src/project_chimera/data/preprocess.py with GPT-2 tokenizer, lowercase optional, padding/truncation
- Device and seeding: src/project_chimera/utils/device.py, src/project_chimera/utils/repro.py
- Tests: tests/test_model.py, tests/test_data.py, tests/test_device_and_seed.py
- Experts: Parameterized FFN experts with configurable activation and dropout.
- src/project_chimera/moe/experts.py: ExpertConfig, FFNExpert, ExpertParallel
- Gating: Top-k gating to produce expert indices and weights per token.
- src/project_chimera/moe/gating.py: GatingConfig, TopKGating
- Layer: Combines experts and gating; returns output and routing diagnostics.
- src/project_chimera/moe/layer.py: MoEConfig, MoELayer, MoEFFNWrapper
- Feature flag in BaselineConfig enables MoE FFN in each TransformerBlock.
- When disabled, the model behaves identically to the dense baseline.
- moe_enabled: bool
- moe_n_experts: int
- moe_top_k: int
- moe_activation: str
- moe_noisy_gate: bool
See integration points in src/project_chimera/baseline.py.
- Pre-norm blocks for stability under both dense and MoE.
- Weight tying to reduce softmax head parameters, matching GPT-style conventions.
- Top-k gating stub prioritizes API clarity; dispatch efficiency to be added.
- AMP enabled optionally for CUDA to evaluate training throughput.
Primary: AG News (subset) as a compact, well-known dataset to validate end-to-end correctness and performance assumptions. Data is handled by scripts and loaders with tokenizer alignment.
- Dense Baseline:
- d_model {128, 256}, n_layers {2, 4}, n_heads {2, 4}, ff_dim {4x d_model}, dropout {0.0, 0.1}
- MoE Variants:
- n_experts {4, 8}, top_k {1, 2}, activation {"gelu"}, noisy_gate {False, True}
- Optimization: AdamW (lr 3e-4 default), gradient clipping, mixed precision on CUDA.
- Logging: periodic training and validation loss reporting, final validation loss.
- Compute: single GPU or CPU fallback.
- Quality: training loss, validation perplexity (derived), stability (finite loss, no NaNs).
- Efficiency: steps/sec, memory footprint, and inference throughput (to be added).
- Routing Diagnostics: expert utilization histograms, top-k distributions (to be added).
- Effect of top-k on loss and utilization.
- Number of experts vs. compute/quality trade-off.
- Noisy gating impact on load balancing and stability.
- Dispatch efficiency via gather/scatter vs naive all-expert compute.
Placeholder: Results Pending; will include dense baseline curves, MoE comparisons at compute-matched settings, and routing diagnostics. I will report:
- Training and validation loss curves with confidence intervals (multiple seeds).
- Expert utilization histograms; load balancing metrics.
- Throughput and memory measurements across configurations.
I anticipate that MoE models can match or exceed dense baseline quality at reduced per-token compute, but only with careful routing and load balancing. The naive soft-combine implementation serves as a correctness oracle but is inefficient. The central question is identifying the regimes (dataset size, model capacity, routing k) where MoE’s sparsity yields practical benefits without sacrificing convergence stability.
- Current MoE layer computes all experts, then combines by top-k weights — not efficient and intended as a functional baseline.
- No distributed expert parallelism yet; single-process prototype.
- Load balancing / auxiliary objectives not yet implemented.
- Limited dataset scope at present; broader corpora planned.
- Implement efficient token dispatch with capacity constraints and batch-wise gather/scatter.
- Add auxiliary load balancing losses and scheduling.
- Integrate FSDP/DeepSpeed expert parallelism for scaling.
- Expand datasets and conduct large-scale comparisons.
- Explore advanced routing (noisy top-k variants, Sinkhorn-based, task-aware routing).
Environment
- Python 3.10+
- Quick setup:
python scripts/quick_start.py - Manual install:
pip install -e .[dev]
- For common training problems and their fixes, see TROUBLESHOOTING.md
- For commands, scripts, and workflows, see RUNBOOK.md
- AG News download handled by:
- scripts/download_ag_news.py
- See data/README.md for dataset notes
- Tokenizer: GPT-2 via HuggingFace; lowercase optional.
- Settings: max length, padding/truncation controlled by TokenizerConfig in src/project_chimera/data/preprocess.py.
- Command example:
- python -u scripts/train_baseline_smoke.py --model_name gpt2 --lowercase --max_length 128 --batch_size 16 --train_limit 2048 --val_limit 512 --d_model 256 --n_layers 4 --n_heads 4 --ff_dim 1024 --dropout 0.1 --max_steps 200 --eval_every 100 --log_every 20 --lr 3e-4
- Set flags via BaselineConfig:
- moe_enabled=True, moe_n_experts=4, moe_top_k=1, moe_activation="gelu", moe_noisy_gate=False
- Example (pseudo-code):
- cfg = BaselineConfig(vocab_size=tokenizer.vocab_size, d_model=256, n_layers=4, n_heads=4, ff_dim=1024, dropout=0.1, max_seq_len=128, tie_weights=True, moe_enabled=True, moe_n_experts=4, moe_top_k=1)
- model = GPTDecoder(cfg)
- Use BaselineTrainer as usual.
- Run unit tests:
- pytest -q
- Tests include data pipeline smoke checks and model forward shape/finite loss tests.
- Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. ICML. arXiv:2101.03961. https://arxiv.org/abs/2101.03961
- Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2021). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. ICLR. arXiv:2006.16668. https://arxiv.org/abs/2006.16668
- Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V., Hinton, G. E., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR. arXiv:1701.06538. https://arxiv.org/abs/1701.06538
- Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Ruwase, O., Rasley, J., Zhang, S., & He, Y. (2022). DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. arXiv:2201.05596. https://arxiv.org/abs/2201.05596
- Zhou, D., Hu, H., Pfister, T., & Soricut, R. (2022). Mixture-of-Experts with Expert Choice Routing. NeurIPS. arXiv:2202.09368. https://arxiv.org/abs/2202.09368
- src/project_chimera/baseline.py: Config, decoder model, transformer block
- src/project_chimera/trainer.py: Training loop, evaluation
- src/project_chimera/data/: Dataset and preprocessing
- src/project_chimera/utils/: Device and reproducibility utilities
- src/project_chimera/moe/: Experts, gating, and MoE layer modules
- scripts/: Training and data scripts
- tests/: Unit tests
- MIT
- This work builds on the open community’s progress in dense and sparse Transformer research. Thanks to maintainers of PyTorch and HuggingFace ecosystems.