Cijak

A high density Unicode-based encoding library that packs binary data into CJK characters. Cijak encodes up to 14 bits per character, making it significantly more character-efficient than Base64 while maintaining competitive performance through optimized C extensions.

The name "Cijak" comes from the CJK Unicode block, where you can encode 14 bits of data (C1J4K).

Why Cijak?

Cijak is designed for scenarios where:

Character count matters more than byte size (SMS, tweets, chat apps)
You need Unicode-safe encoding that doesn't look like obvious Base64

Performance Comparison

Benchmarked on 3.7 KB of binary data:

Metric	Base64	Cijak	Cijak (Python Fallback)
Encode	3.78 μs	4.21 μs	1340 μs
Decode	8.19 μs	5.24 μs	1338 μs
Character count	4,944	2,120	---
Character compression	-33%	43%	---
UTF-8 size	4,944 B	6,360 B	---
UTF-16 size	9,890 B	4,242 B	---

TL;DR: Cijak uses 57% fewer characters than the original, while Base64 increases character count by 33%. Encoding speed matches Base64, decoding is faster.

Note on Byte Size (UTF-8 vs UTF-16):

Since CJK characters typically use 3 bytes in UTF-8 but only 2 bytes in UTF-16 (or UCS-2), Cijak is highly byte-efficient only in environments that use UTF-16 (like many older Windows/Java internal systems) or where character limits are strictly enforced (like SMS). In standard UTF-8 environments, Base64 is often smaller in total byte size, but Cijak still uses 57% fewer characters.

Installation

pip install cijak

Pre-compiled C extensions (wheels) are available for:

Linux (x86_64, aarch64)
macOS (Intel, Apple Silicon)
Windows (x86_64)
Python 3.8-3.14

If a wheel isn't available, the package automatically falls back to a pure Python implementation (~300x slower, but functionally identical).

Quick Start

from cijak import Cijak

# Initialize encoder (uses CJK Unicode block by default)
encoder = Cijak()

# Encode binary data
data = b'Hello, World!'
encoded = encoder.encode(data)
print(encoded)  # ㇈怙擆羼稠蔦羐漀

# Decode back to bytes
decoded = encoder.decode(encoded)
print(decoded)  # b'Hello, World!'

Advanced Configuration

Custom Unicode Ranges

You can use different Unicode blocks for encoding:

# Use a different range (e.g., Hangul)
encoder = Cijak(
    unicode_range_start=0xAC00,  # Hangul Syllables start
    unicode_range_end=0xD7A3,    # Hangul Syllables end
    marker_base=0x3200           # Different marker range
)

data = b"Custom encoding!"
encoded = encoder.encode(data)

Important: The Unicode range must not contain control characters. The library automatically calculates the optimal bit-packing based on your range size.

Technical Deep Dive

Encoding Scheme

Default: CJK Unified Ideographs (U+4E00 to U+9FFF)
Bit Density: 14 bits/character (calculated from range size)
Padding Marker: A single character (base U+31C0) stores the number of padding bits required for decoding.
Efficiency: ~1.75 bytes per character (vs Base64's 0.75)

How It Works

Binary data is read as a continuous stream and packed into 14-bit chunks
Each chunk is mapped to a CJK codepoint
First character is a marker indicating padding
Remaining bits are left-padded in the last character

Performance Notes

C extension: Direct memory manipulation, zero Python overhead
Fallback: BitReader/BitWriter abstraction in pure Python
Memory: Single-pass encoding/decoding, minimal allocations
Thread-safe: No global state, safe for concurrent use

Building from Source

If you need to build the C extension manually:

git clone https://github.com/NobreHD/Cijak.git
cd Cijak
pip install -e .

Requirements:

C compiler (GCC, Clang, MSVC)
Python development headers

Contributing

Contributions welcome! Areas of interest:

SIMD optimizations for bulk encoding
Additional Unicode range presets
Streaming API for large files
Alternative padding schemes

License

GNU General Public License v3.0 or later (GPLv3+)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
src/cijak		src/cijak
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cijak

Why Cijak?

Performance Comparison

Installation

Quick Start

Advanced Configuration

Custom Unicode Ranges

Technical Deep Dive

Encoding Scheme

How It Works

Performance Notes

Building from Source

Contributing

License

About

Uh oh!

Releases 3

Uh oh!

Languages

NobreHD/Cijak

Folders and files

Latest commit

History

Repository files navigation

Cijak

Why Cijak?

Performance Comparison

Installation

Quick Start

Advanced Configuration

Custom Unicode Ranges

Technical Deep Dive

Encoding Scheme

How It Works

Performance Notes

Building from Source

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Uh oh!

Languages