This project performs a rigorous comparative analysis of Selective State Space Models (Mamba) against Transformers and linear baselines. The primary objective was to investigate the trade-offs between modeling capability and computational complexity, specifically addressing the
Experiments were conducted on Tiny Shakespeare (character-level) and WikiText-2/PTB (word-level) to evaluate training convergence, inference throughput, and generalization capabilities.
The Mamba architecture demonstrated superior data efficiency and scaling properties compared to the Transformer baseline, achieving lower test loss with linear complexity.
| Model | Params / Config | Test Loss (Lower is Better) | Complexity | Inference Memory |
|---|---|---|---|---|
| Mamba | State Dim: 16 | 1.8006 |
|
Fixed |
| Transformer | Layers: 6 | 1.8640 |
|
Growing |
| Self-Attention | Heads: 4 | 2.2619 | Growing |
|
| Linear Baseline | Context: 128 | 2.5040 | Fixed |
-
Efficiency: Mamba achieved the lowest validation loss (1.8006) while maintaining linear
$\mathcal{O}(L)$ training complexity and constant$\mathcal{O}(1)$ inference time. - Compute-to-Performance Ratio: Analysis of Test-Set Log-Likelihood vs. Training FLOPs revealed that Mamba converges faster and reaches a lower loss ceiling than Transformers for the same computational budget.
- Generalization: On word-level tasks (WikiText-2), Mamba consistently converged to a lower training loss than the Transformer.
Mamba achieves lower loss with fewer FLOPs compared to Transformers.

Mamba (Purple) and Transformer (Red) show rapid feature learning compared to baselines.

Transformers rely on Global Self-Attention, computing pairwise interactions between all tokens (
Mamba utilizes a Selective State Space Model (SSM) to compress context into a fixed-size latent state
git clone [https://github.com/YOUR_USERNAME/scalable-sequence-modeling.git](https://github.com/YOUR_USERNAME/scalable-sequence-modeling.git)
cd scalable-sequence-modelingit is recommended to use a virtual environment.
# create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
# install PyTorch (visit pytorch.org for command specific to your OS/CUDA version)
pip install torch --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118) # Example for CUDA 11.8
# install remaining requirements
pip install -r requirements.txtThe core logic and training loops are contained within the Jupyter Notebook.
jupyter notebook code/scalable-sequence-modeling.ipynbscalable-sequence-modeling/
├── code/
│ └── scalable-sequence-modeling.ipynb # Complete implementation from scratch
├── datasets/
│ ├── tiny_shakespeare/ # Character-level dataset
│ ├── wikitext-2/ # Word-level benchmark
│ └── ptb/ # Penn Treebank benchmark
├── plots/ # Generated analysis figures
├── README.md
└── requirements.txt
This project was developed as part of the CS689 Machine Learning course at UMass Amherst.