H64LM: MoE Transformer

A complete, scalable Mixture-of-Experts language model architecture built entirely from scratch. Ready for billion-parameter pretraining.

Overview

H64LM is a state-of-the-art transformer architecture implementing modern techniques from leading LLMs (GPT-4, Claude, LLaMA). Every component—attention, MoE routing, tokenization, training loop—is built from first principles with no black boxes.

Current demo: 249M parameters trained on 100K tokens to validate the pipeline. The architecture scales to billions of parameters and trillion-token datasets.

Key Features:

Mixture-of-Experts with load balancing
Grouped-Query Attention (GQA) for efficient inference
RoPE + ALiBi positional encoding
Multi-GPU support (DataParallel/DDP)
Mixed-precision (FP16) training
Checkpoint resumption and atomic saves
Fully documented and modular code

Architecture

Model Design

H64LMConfig(
    vocab_size=32000,           # Mistral tokenizer
    hidden_size=768,
    num_layers=6,               # Alternating dense/sparse
    num_attention_heads=12,
    num_kv_heads=4,             # Grouped-Query Attention
    num_experts=8,              # MoE routing
    num_experts_per_token=2,    # Top-2 routing
    max_position_embeddings=1024,
    sliding_window_size=2048,
    attention_type="alibi",     # ALiBi + RoPE hybrid
)

Components

Component	Implementation	Benefit
Attention	Grouped-Query Attention (GQA)	3× faster inference with KV cache
Position Encoding	RoPE + ALiBi hybrid	Length extrapolation beyond training
Feedforward	SwiGLU activation	Better than GELU in modern LLMs
Expert Routing	Top-2 sparse MoE	Scale capacity without compute cost
Normalization	RMSNorm (pre-norm)	Faster and more stable than LayerNorm
Attention Masking	Causal + Sliding Window	Efficient long-context modeling

Layer Architecture

Input Tokens (32K vocab)
    ↓
┌─────────────────────────┐
│ Layer 0: Dense SwiGLU   │  ← Even layers
├─────────────────────────┤
│ Layer 1: MoE (8×2)      │  ← Odd layers (sparse)
├─────────────────────────┤
│ Layer 2: Dense SwiGLU   │
├─────────────────────────┤
│ Layer 3: MoE (8×2)      │
├─────────────────────────┤
│ Layer 4: Dense SwiGLU   │
├─────────────────────────┤
│ Layer 5: MoE (8×2)      │
└─────────────────────────┘
    ↓
RMSNorm → LM Head

Installation

Requirements

# Install dependencies
pip install -r requirements.txt

requirements.txt:

# Core dependencies
numpy>=1.21.0
tqdm>=4.62.0
matplotlib>=3.4.0

# PyTorch with CUDA 12.1
--index-url https://download.pytorch.org/whl/cu121
torch>=2.0.0

# Hugging Face
datasets>=2.14.0
transformers>=4.30.0
tokenizers>=0.13.0

# Optional (recommended)
tensorboard>=2.13.0
flash-attn>=2.0.0  # GPU-specific, may require manual install

Setup

git clone https://github.com/Haiderkhan64/H64LM.git
cd H64LM

Usage

Quick Start

The repository includes pretrained tokenizer and checkpoint files via Git LFS.

Option 1: Run the Jupyter Notebook

jupyter notebook H64LM-v1.ipynb

Option 2: Run as Python Script

python H64LM-v1.py

The script will:

Load tokenizer from ./mistral_tokenizer/
Download WikiText-103 dataset (or use local copy)
Train the model with default config
Save checkpoints to ./checkpoints_h64lm/
Generate sample text outputs

Configuration

Modify hyperparameters in the script:

config = H64LMConfig(
    vocab_size=32000,
    hidden_size=768,          # Scale to 4096+ for large models
    num_layers=6,             # Scale to 32+ layers
    num_experts=8,
    max_position_embeddings=1024,  # Increase for longer contexts
    # ... see full config in code
)

Scaling to Production

For billion-parameter models:

config = H64LMConfig(
    hidden_size=4096,         # GPT-3 scale
    num_layers=32,
    num_attention_heads=32,
    num_kv_heads=8,
    num_experts=16,
    max_position_embeddings=4096,
    # Update dataset
    max_samples=-1,           # Use full dataset
)

For large datasets:

Set max_samples=-1 to use full WikiText-103 (1.8M samples)
Use streaming mode: streaming=True for massive datasets
Follow the 5-stage training pipeline (see references)

Training Pipeline

This implementation supports the standard LLM training pipeline:

Stage 1: Pretraining (Current Implementation)

python H64LM-v1.py  # Next-token prediction on raw text

Stage 2: Instruction Tuning

Add instruction-following datasets (Alpaca, Dolly, etc.):

dataset = load_dataset("tatsu-lab/alpaca")
# Continue training with instruction format

Stage 3: RLHF (Reinforcement Learning)

Implement reward model + PPO:

# Train reward model on human preferences
# Apply PPO to optimize policy

Stage 4: Reasoning Enhancement

Add chain-of-thought data:

dataset = load_dataset("gsm8k")  # Math reasoning
# Fine-tune with COT examples

Stage 5: Chat Optimization

Fine-tune on dialogue data:

dataset = load_dataset("OpenAssistant/oasst1")
# Multi-turn conversation training

See the Medium article for full pipeline details.

Checkpointing

Resume Training

config = H64LMConfig(
    resume_from="./checkpoints_h64lm/best_model.pt",
    num_epochs=10  # Additional epochs
)

The checkpoint loader automatically handles:

DataParallel module. prefix stripping
Partial state dict loading
Optimizer and scheduler restoration

Checkpoint Format

checkpoint = {
    'epoch': int,
    'step': int,
    'model_state_dict': OrderedDict,
    'optimizer_state_dict': dict,
    'scheduler_state_dict': dict,
}

Generation

from h64lm_v1 import H64LMForCausalLM, test_generation
from transformers import AutoTokenizer

# Load model
tokenizer = AutoTokenizer.from_pretrained("./mistral_tokenizer")
model = H64LMForCausalLM(config, tokenizer)
checkpoint = torch.load("checkpoints_h64lm/best_model.pt")
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Generate text
test_generation(
    model, tokenizer, device,
    prompt="Deep learning is",
    max_new_tokens=100,
    temperature=0.8,
    top_p=0.9
)

Hyperparameters

Model Architecture

Parameter	Demo Value	Production
`hidden_size`	768	4096-8192
`num_layers`	6	32-96
`num_attention_heads`	12	32-128
`num_kv_heads`	4	8-16
`num_experts`	8	8-64
`max_position_embeddings`	1024	4096-32768

Training Config

Parameter	Value	Notes
`batch_size`	16	Per-GPU
`grad_accum_steps`	8	Effective batch: 128
`learning_rate`	3e-4	With warmup
`weight_decay`	0.01	Exclude norms/biases
`use_fp16`	True	Mixed precision
`gradient_checkpointing`	True	Memory efficient

Model Card

Intended Use

Primary: Educational reference for LLM architecture
Secondary: Foundation for research and experimentation
Production: Scales to billion-parameter pretraining

Capabilities

Next-token prediction
Text generation
Multi-GPU training
Checkpoint management
Ready for instruction tuning and RLHF

Limitations

Demo model trained on small dataset (pipeline validation only)
Requires additional training stages for chat/alignment
No built-in safety filters (add in Stage 3: RLHF)

License

Code: Apache 2.0 License
Tokenizer: Apache 2.0 (Mistral AI)
Dataset: Creative Commons (WikiText-103)

File Structure

H64LM/
├── H64LM-v1.ipynb              # Main notebook
├── H64LM-v1.py                 # Standalone script
├── requirements.txt            # Dependencies
├── LICENSE                     # Apache 2.0
├── README.md                   # This file
├── .gitattributes              # LFS config
├── mistral_tokenizer/          # Pretrained tokenizer (LFS)
│   ├── tokenizer.json
│   └── tokenizer_config.json
|   └── special_tokens_map.json
|   └── tokenizer.model
└── checkpoints_h64lm/          # Saved models (LFS)
    ├── tokenizer
    └── best_model_state_dict.pt
    └── training_history.json

References

Architecture Papers

Implementation Inspirations

GPT-NeoX (MoE)
LLaMA (RMSNorm, RoPE)
Mistral (Sliding window)

Training Pipeline

See Medium Article: LLM Training Pipeline for the complete 5-stage training guide (Pretraining → Instruction Tuning → RLHF → Reasoning → Chat).

Contributing

Contributions welcome for:

Scaling experiments (billion-parameter runs)
Additional training stages (RLHF, instruction tuning)
Optimization improvements
Documentation enhancements

Guidelines: Keep the educational focus—avoid wrapping complexity in high-level abstractions.

Citation

@software{h64lm2025,
  author = {Haider Khan},
  title = {H64LM: MoE Transformer},
  year = {2025},
  url = {https://github.com/Haiderkhan64/H64LM}
}

Contact

Author: Haider Khan
GitHub: @Haiderkhan64 Medium: LLM Training Articles

For questions or collaboration, open an issue on GitHub.

Last Updated: January 2026 Status: Architecture complete | Ready for large-scale training

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
checkpoints_h64lm		checkpoints_h64lm
mistral_tokenizer		mistral_tokenizer
.gitattributes		.gitattributes
H64LM-v1.ipynb		H64LM-v1.ipynb
H64LM-v1.py		H64LM-v1.py
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

Haiderkhan64/H64LM

Folders and files

Latest commit

History

Repository files navigation

H64LM: MoE Transformer

Overview

Architecture

Model Design

Components

Layer Architecture

Installation

Requirements

Setup

Usage

Quick Start

Configuration

Scaling to Production

Training Pipeline

Stage 1: Pretraining (Current Implementation)

Stage 2: Instruction Tuning

Stage 3: RLHF (Reinforcement Learning)

Stage 4: Reasoning Enhancement

Stage 5: Chat Optimization

Checkpointing

Resume Training

Checkpoint Format

Generation

Hyperparameters

Model Architecture

Training Config

Model Card

Intended Use

Capabilities

Limitations

License

File Structure

References

Architecture Papers

Implementation Inspirations

Training Pipeline

Contributing

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages