Skip to content

cjanowski/project-chimera

Repository files navigation

Project Chimera: Dense vs Mixture-of-Experts (MoE) Transformer Study

  • Author: Cory Janowski
  • Date: 2025-08-05

Abstract

This repository investigates architectural and computational trade-offs between dense Transformer decoders and Mixture-of-Experts (MoE) variants in autoregressive language modeling. Project-Chimera implements a nano-GPT style dense baseline and introduce a modular MoE interface with top-k gating and configurable expert ensembles. My goals are to: (1) establish a strong, reproducible dense baseline; (2) design an extensible MoE layer with clear interfaces for experts, gating, and routing; (3) evaluate training stability, loss convergence, and performance efficiency; and (4) provide open, testable code for further research on routing quality, auxiliary balancing objectives, and dispatch efficiency. Preliminary results are forthcoming; this document details the methodology and experimental protocol to ensure reproducibility and facilitate future extensions.

1. Introduction

Large-scale language models achieve strong performance but often require prohibitive compute. MoE architectures reduce per-token compute by routing tokens to a small subset of experts, potentially improving throughput and scaling efficiency. However, MoE introduces new challenges: routing stability, load balancing, dispatch overhead, and expert under/over-utilization. This project builds a rigorous experimental backbone to assess when and how MoE outperforms dense models at similar quality or latency.

Contributions:

  • Dense baseline: nano-GPT style decoder with unit tests and training loop.
  • MoE interface: modular design isolating experts, gating, and layer combination, with feature-flag integration.
  • Reproducible pipeline: data preprocessing for AG News, CLI-driven training, device and seed utilities, tests for core functionality.
  • Evaluation plan: metrics for model quality and system efficiency, plus ablations on top-k, expert count, and routing noise.

2. Related Work

Dense Transformers: The standard decoder-only Transformer architecture underpins autoregressive modeling across domains. Mixture-of-Experts: Switch Transformer, GShard, and later works demonstrate sparse activation via routed experts, enabling larger model capacity without linearly increasing per-token FLOPs. Routing and Load Balancing: Top-k routing with auxiliary balancing losses addresses collapsed utilization. Variants (noisy gating, soft routing, Sinkhorn routing) explore improved token-to-expert assignments. System Optimizations: Expert parallelism and efficient gather/scatter implementations (e.g., in DeepSpeed or FSDP-based setups) reduce dispatch costs, enabling practical training.

References placeholders:

  • Fedus et al., “Switch Transformers”
  • Lepikhin et al., “GShard”
  • Shazeer et al., “Outrageously Large Neural Networks”
  • Recent MoE routing and load balancing advances

3. Methods

3.1 Dense Baseline

Project-Chimera implements a GPT-style Transformer decoder with:

  • Token embeddings with weight tying to the output head.
  • Sine/cos positional encoding.
  • Pre-norm residual blocks comprising MultiheadAttention and MLP FFN.
  • Causal attention masking and key_padding support.

Core components:

3.2 MoE Architecture

The MoE layer is designed as an FFN replacement with clear separation of concerns:

Integration:

  • Feature flag in BaselineConfig enables MoE FFN in each TransformerBlock.
  • When disabled, the model behaves identically to the dense baseline.

Config flags in BaselineConfig:

  • moe_enabled: bool
  • moe_n_experts: int
  • moe_top_k: int
  • moe_activation: str
  • moe_noisy_gate: bool

See integration points in src/project_chimera/baseline.py.

3.3 Design Choices

  • Pre-norm blocks for stability under both dense and MoE.
  • Weight tying to reduce softmax head parameters, matching GPT-style conventions.
  • Top-k gating stub prioritizes API clarity; dispatch efficiency to be added.
  • AMP enabled optionally for CUDA to evaluate training throughput.

4. Experiments

4.1 Datasets

Primary: AG News (subset) as a compact, well-known dataset to validate end-to-end correctness and performance assumptions. Data is handled by scripts and loaders with tokenizer alignment.

4.2 Experimental Setup

  • Dense Baseline:
    • d_model {128, 256}, n_layers {2, 4}, n_heads {2, 4}, ff_dim {4x d_model}, dropout {0.0, 0.1}
  • MoE Variants:
    • n_experts {4, 8}, top_k {1, 2}, activation {"gelu"}, noisy_gate {False, True}
  • Optimization: AdamW (lr 3e-4 default), gradient clipping, mixed precision on CUDA.
  • Logging: periodic training and validation loss reporting, final validation loss.
  • Compute: single GPU or CPU fallback.

4.3 Metrics

  • Quality: training loss, validation perplexity (derived), stability (finite loss, no NaNs).
  • Efficiency: steps/sec, memory footprint, and inference throughput (to be added).
  • Routing Diagnostics: expert utilization histograms, top-k distributions (to be added).

4.4 Ablation Studies (Planned)

  • Effect of top-k on loss and utilization.
  • Number of experts vs. compute/quality trade-off.
  • Noisy gating impact on load balancing and stability.
  • Dispatch efficiency via gather/scatter vs naive all-expert compute.

5. Results

Placeholder: Results Pending; will include dense baseline curves, MoE comparisons at compute-matched settings, and routing diagnostics. I will report:

  • Training and validation loss curves with confidence intervals (multiple seeds).
  • Expert utilization histograms; load balancing metrics.
  • Throughput and memory measurements across configurations.

6. Discussion

I anticipate that MoE models can match or exceed dense baseline quality at reduced per-token compute, but only with careful routing and load balancing. The naive soft-combine implementation serves as a correctness oracle but is inefficient. The central question is identifying the regimes (dataset size, model capacity, routing k) where MoE’s sparsity yields practical benefits without sacrificing convergence stability.

7. Limitations

  • Current MoE layer computes all experts, then combines by top-k weights — not efficient and intended as a functional baseline.
  • No distributed expert parallelism yet; single-process prototype.
  • Load balancing / auxiliary objectives not yet implemented.
  • Limited dataset scope at present; broader corpora planned.

8. Future Work

  • Implement efficient token dispatch with capacity constraints and batch-wise gather/scatter.
  • Add auxiliary load balancing losses and scheduling.
  • Integrate FSDP/DeepSpeed expert parallelism for scaling.
  • Expand datasets and conduct large-scale comparisons.
  • Explore advanced routing (noisy top-k variants, Sinkhorn-based, task-aware routing).

9. Reproducibility

Environment

  • Python 3.10+
  • Quick setup: python scripts/quick_start.py
  • Manual install: pip install -e .[dev]

Issues and Solutions

Data

Preprocessing

Training (Dense Baseline)

  • Command example:
    • python -u scripts/train_baseline_smoke.py --model_name gpt2 --lowercase --max_length 128 --batch_size 16 --train_limit 2048 --val_limit 512 --d_model 256 --n_layers 4 --n_heads 4 --ff_dim 1024 --dropout 0.1 --max_steps 200 --eval_every 100 --log_every 20 --lr 3e-4

Training (Enable MoE)

  • Set flags via BaselineConfig:
    • moe_enabled=True, moe_n_experts=4, moe_top_k=1, moe_activation="gelu", moe_noisy_gate=False
  • Example (pseudo-code):
    • cfg = BaselineConfig(vocab_size=tokenizer.vocab_size, d_model=256, n_layers=4, n_heads=4, ff_dim=1024, dropout=0.1, max_seq_len=128, tie_weights=True, moe_enabled=True, moe_n_experts=4, moe_top_k=1)
    • model = GPTDecoder(cfg)
    • Use BaselineTrainer as usual.

Testing

  • Run unit tests:
    • pytest -q
  • Tests include data pipeline smoke checks and model forward shape/finite loss tests.

10. References

Key references for MoE and scaling:

  • Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. ICML. arXiv:2101.03961. https://arxiv.org/abs/2101.03961
  • Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., & Chen, Z. (2021). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. ICLR. arXiv:2006.16668. https://arxiv.org/abs/2006.16668
  • Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V., Hinton, G. E., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR. arXiv:1701.06538. https://arxiv.org/abs/1701.06538
  • Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Ruwase, O., Rasley, J., Zhang, S., & He, Y. (2022). DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. arXiv:2201.05596. https://arxiv.org/abs/2201.05596
  • Zhou, D., Hu, H., Pfister, T., & Soricut, R. (2022). Mixture-of-Experts with Expert Choice Routing. NeurIPS. arXiv:2202.09368. https://arxiv.org/abs/2202.09368

Appendix

A. Code Structure

  • src/project_chimera/baseline.py: Config, decoder model, transformer block
  • src/project_chimera/trainer.py: Training loop, evaluation
  • src/project_chimera/data/: Dataset and preprocessing
  • src/project_chimera/utils/: Device and reproducibility utilities
  • src/project_chimera/moe/: Experts, gating, and MoE layer modules
  • scripts/: Training and data scripts
  • tests/: Unit tests

B. License

  • MIT

C. Acknowledgements

  • This work builds on the open community’s progress in dense and sparse Transformer research. Thanks to maintainers of PyTorch and HuggingFace ecosystems.

About

ML Systems Design

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published