This repository contains the code and experiments from my application to the Eastern European Machine Learning (EEML) Summer School 2025. The project involves:
- A systematic study of Linformer’s behavior under resource constraints.
- A proposed variant — Conv-Linformer — aimed at improving training stability and performance in these settings.
The Linformer reduces the quadratic complexity of self-attention by leveraging low-rank projections of keys and values. While efficient at scale, its performance in resource-constrained environments (i.e., limited data, constrained compute) remains underexplored.
This project investigates:
- How Linformer behaves in such settings.
- Whether simple architectural tweaks can improve its robustness and effectiveness.
Linformer’s core innovation is low-rank approximation of the self-attention mechanism to reduce its time and memory complexity. Traditional self-attention requires quadratic time complexity
Key elements of Linformer's mechanism:
- Low-Rank Projections: Instead of computing the attention across the entire sequence, Linformer approximates the key and value matrices by using learnable projections. This reduces the sequence length to a fixed smaller value.
- Global Context: The projection ensures that long-range dependencies are maintained, making it computationally efficient without significant information loss.
Where:
-
$E$ and$F$ are low-rank projection matrices that reduce$K$ and$V$ from$n$ to$k$ , enabling linear complexity.
- Dataset: WikiText-103, subset to 50M tokens.
- Sequence lengths: 128, 256, 512, 1024.
- Model: Encoder-only Transformer with 8 layers, hidden size 512, and 8 attention heads.
- Training: MLM objective, Hugging Face
Trainer, AdamW optimizer with 10% warmup and weight decay (0.001).
- Short sequences perform well, but longer sequences suffer from instability at high learning rates.
- Lower learning rates stabilize training, but may lead to slower convergence or cause the model to get stuck in a local minimum.
- The performance degradation seems tied to Linformer’s low-rank projections, which struggle to learn key patterns in data-scarce settings.
To overcome the limitations of Linformer, I introduce Conv-Linformer, a hybrid architecture that combines the best of both worlds:
- Linear projections in the early layers to retain global context efficiently.
- 1D convolution in the later layers to capture local patterns.
This design allows Conv-Linformer to combine the benefits of Linformer’s efficient global context modeling with convolution’s ability to capture fine-grained local structures.
Where:
-
$F_k$ and$F_v$ are convolutional kernels applied to the key and value matrices with kernel size and stride$n/k$ , preserving linear complexity while enhancing local feature extraction.
- Improved training stability across sequence lengths.
- More consistent performance than Linformer in constrained settings.
- Transformer-level performance, with linear complexity and minimal overhead from convolution.
Below are validation perplexity curves for Transformer, Linformer and Conv-Linformer across different sequence lengths:
- Conv-Linformer maintains more stable training and performs on par with Transormer.
- Linformer's performance degrades more gradually with increasing sequence length and learning rate.
We compare the average inference time per sample for each model across different sequence lengths with
| Sequence Length | Linformer (s) | Conv-Linformer (s) | Transformer (s) |
|---|---|---|---|
| 512 | 0.133 | 0.203 | 0.161 |
| 1024 | 0.249 | 0.303 | 0.346 |
| 2048 | 0.483 | 0.520 | 0.916 |
| 4096 | 0.989 | 1.082 | 2.676 |
Key observations:
- Conv-Linformer introduces minimal overhead.
- Time scales linearly with sequence length for both Linformer and Conv-Linformer.
- 📄 Linformer: Self-Attention with Linear Complexity
- 📄 Conv-Linformer: Boosting Linformer's Performance with Convolution in Small-Scale Settings
- 🔗 lucidrains/linformer
- 📚 Salesforce/wikitext
- Scale experiments to larger datasets and sequence lengths.
- Investigate additional compression strategies.
- Evaluate performance on downstream NLP benchmarks.



