Skip to content
/ bpe-go Public

A toy Byte Pair Encoding implementation in go

Notifications You must be signed in to change notification settings

bluuuk/bpe-go

Repository files navigation

πŸš€ Go-based Byte Pair Encoding (BPE)

A Go implementation of Byte Pair Encoding (BPE), inspired by Andrej Karpathy's tutorial. Support the tiktoken file format from OpenAI. You can fetch pretrained encodings directly from OpenAI's github πŸ“¦.

✨ Features

  • πŸ”€ Tokenizes arbitrary byte sequences (not just text!)
  • 🧩 Special token support with whitelisting
  • πŸ§ͺ Regex-based input splitting
  • ⚠️ Not a drop-in replacement for OpenAI’s tokenizer

πŸ—‚οΈ Project Structure

β”œβ”€β”€ bpeprocessor.go          # Interface definition
β”œβ”€β”€ go.mod                   # Module config
β”œβ”€β”€ README.md                # You're reading it!
β”œβ”€β”€ regextiktokenproc.go     # Regex-enhanced BPE processor
β”œβ”€β”€ regextiktokenproc_test.go
β”œβ”€β”€ tiktokenproc.go          # Core BPE for OpenAI's .tiktoken format
β”œβ”€β”€ tiktokenproc_test.go
└── testdata/
    └── cl100k_base.tiktoken # Sample encoding data

πŸ’‘ Key Takeaways

  • πŸ§ͺ Fuzz testing in Go is powerful β€” used to test decode(encode(x)) == x across edge cases
  • 🌐 UTF-8 is full of surprises β€” beware of multi-byte characters
  • πŸ› οΈ Byte slice manipulation in Go can be... tricky & annoyingπŸ˜…
  • πŸ” Go’s regex capabilities are fundamentally different from Python’s 🐍 β€” beware of surprises!

πŸ“„ License

The file testdata/cl100k_base.tiktoken is under the MIT License.Β© 2022 OpenAI, Shantanu Jain.

This project itself is also licensed under the MIT License β€” feel free to use, fork, or contribute!

About

A toy Byte Pair Encoding implementation in go

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages