This repository contains curated lists of special tokens used by various Large Language Models (LLMs), collected and analyzed for security research purposes.
Special tokens—such as control sequences, formatting markers, and reserved vocabulary—can influence model behavior in subtle or undocumented ways. Understanding these tokens is critical for:
- 🔍 Auditing model behavior and prompt injection risks
- 🧪 Fuzzing and adversarial testing
- 🧰 Building robust token-level filters and sanitizers
- 📚 Reverse-engineering tokenizer internals
📬 Contributions Pull requests are welcome! If you've explored special tokens in other models (e.g., GPT, LLaMA, Mistral), feel free to share your findings.