A Go implementation of Byte Pair Encoding (BPE), inspired by Andrej Karpathy's tutorial. Support the tiktoken file format from OpenAI. You can fetch pretrained encodings directly from OpenAI's github π¦.
- π€ Tokenizes arbitrary byte sequences (not just text!)
- π§© Special token support with whitelisting
- π§ͺ Regex-based input splitting
β οΈ Not a drop-in replacement for OpenAIβs tokenizer
βββ bpeprocessor.go # Interface definition
βββ go.mod # Module config
βββ README.md # You're reading it!
βββ regextiktokenproc.go # Regex-enhanced BPE processor
βββ regextiktokenproc_test.go
βββ tiktokenproc.go # Core BPE for OpenAI's .tiktoken format
βββ tiktokenproc_test.go
βββ testdata/
βββ cl100k_base.tiktoken # Sample encoding data
- π§ͺ Fuzz testing in Go is powerful β used to test
decode(encode(x)) == xacross edge cases - π UTF-8 is full of surprises β beware of multi-byte characters
- π οΈ Byte slice manipulation in Go can be... tricky & annoyingπ
- π Goβs regex capabilities are fundamentally different from Pythonβs π β beware of surprises!
The file testdata/cl100k_base.tiktoken is under the MIT License.Β© 2022 OpenAI, Shantanu Jain.
This project itself is also licensed under the MIT License β feel free to use, fork, or contribute!