Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
765 changes: 525 additions & 240 deletions README.md

Large diffs are not rendered by default.

361 changes: 313 additions & 48 deletions SPEC.md

Large diffs are not rendered by default.

129 changes: 57 additions & 72 deletions zon-format/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,60 @@
# Changelog

All notable changes to the ZON Format project will be documented in this file.
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [1.0.4] - 2025-11-30

### Added
- **Colon-less Syntax:** Objects and arrays in nested positions now use `key{...}` and `key[...]` syntax, removing redundant colons.
- **Smart Flattening:** Top-level nested objects are automatically flattened to dot notation (e.g., `config.db{...}`).
- **Control Character Escaping:** All control characters (ASCII 0-31) are now properly escaped to prevent binary file creation.
- **Runtime Schema Validation:** New `zon` builder and `validate()` function for LLM guardrails.
- **Algorithmic Benchmark Generation**: Replaced LLM-based question generation with deterministic algorithm for consistent benchmarks.
- **Expanded Dataset**: Added "products" and "feed" data to unified dataset for real-world e-commerce scenarios.
- **Tricky Questions**: Introduced edge cases (non-existent fields, logic traps, case sensitivity) to stress-test LLM reasoning.
- **Robust Benchmark Runner**: Added exponential backoff and rate limiting to handle Azure OpenAI S0 tier constraints.

### Changed
- **Benchmark Formats**: Refined tested formats to ZON, TOON, JSON, JSON (Minified), and CSV for focused analysis.
- **Documentation**: Updated README and API references with the latest benchmark results (GPT-5 Nano) and accurate token counts.
- **Token Efficiency**: Recalculated efficiency scores based on the expanded dataset, confirming ZON's leadership (1430.6 score).

### Improved
- **Token Efficiency:** Achieved up to 23.8% reduction vs JSON (GPT-4o) thanks to syntax optimizations.
- **Readability:** Cleaner, block-like structure for nested data.

### Fixed
- **Critical Data Integrity**: Fixed roundtrip failures for strings containing newlines, empty strings, and escaped characters.
- **Decoder Logic**: Fixed `_split_by_delimiter` to correctly handle nested arrays and objects within table cells (e.g., `[10, 20]`).
- **Encoder Logic**: Added mandatory quoting for empty strings and strings with newlines to prevent data loss.
- **Rate Limiting**: Resolved 429 errors during benchmarking with robust retry logic.

## [1.0.3] - 2025-11-28

### 🎯 100% LLM Retrieval Accuracy Achieved

**Major Achievement**: ZON now achieves **100% LLM retrieval accuracy** while maintaining superior token efficiency over TOON!

### Changed
- **Explicit Sequential Columns**: Disabled automatic sequential column omission (`[id]` notation)
- All columns now explicitly listed in table headers for better LLM comprehension
- Example: `users:@(5):active,id,lastLogin,name,role` (was `users:@(5)[id]:active,lastLogin,name,role`)
- Trade-off: +1.7% token increase for 100% LLM accuracy

### Performance
- **LLM Accuracy**: 100% (24/24 questions) vs TOON 100%, JSON 91.7%
- **Token Efficiency**: 19,995 tokens (5.0% fewer than TOON's 20,988)
- **Overall Savings vs TOON**: 4.6% (Claude) to 17.6% (GPT-4o)

### Quality
- ✅ All unit tests pass (28/28)
- ✅ All roundtrip tests pass (27/27 datasets)
- ✅ No data loss or corruption
- ✅ Production ready

## [1.0.2] - 2025-11-24

### Changed - "ClearText" Major Format Overhaul
Expand All @@ -14,8 +64,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **YAML-like metadata**: Changed from `M=key="val"` to clean `key:val` syntax
- **Clean @table syntax**: Replaced schema markers with readable `@tablename(count):cols`
- **Aggressive quote removal**: Only quote when absolutely necessary (commas, control chars)
- Spaces no longer trigger quoting: `Blue Lake Trail` instead of `"Blue Lake Trail"`
- Colons allowed in values
- **Compact array syntax**: `[item1,item2,item3]` with minimal inner quotes
- **No spaces after separators**: Removed spaces after `:` and `,` for compactness

Expand All @@ -24,78 +72,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **25.6% better** than TOON (up from 20.8%)
- Tested on 318 records across 6 real-world datasets

#### New Features
- Singleton bypass: 1-item lists flatten to metadata (`items.0.id:1`)
- Pure list handling: Lists without wrapper use default `@data` table name
- Boolean hard rule: Always explicit `T`/`F`, never inferred from empty cells

#### Documentation
- Comprehensive README.md with visual comparisons
- EXAMPLES.md with detailed symbol reference
- Benchmark sample generation scripts
- `/benchmarks/encoded_samples/` with `.json`, `.zon`, and `.toon` comparisons

### Fixed
- Boolean preservation in roundtrip encoding/decoding
- Array index handling in decoder unflatten logic
- Pure list encoding/decoding (was returning empty string)

## [1.0.0] - 2025-11-23

### Added - Initial Release

#### Core Features
- ZON v7.0 format with pipe-based protocol syntax
- Compression rules: Range (R), Liquid (L), Solid (S), Pattern (P), Value (V)
- Anchor-based row references
- Global dictionary for repeated strings
- ZON v1.0 format implementation
- Full encoder/decoder with lossless round-trips
- CLI tool for encoding/decoding
- Comprehensive test suite

#### Performance
- ~27% average compression vs JSON
- ~21% better than TOON on structured data

#### Package
- Python 3.8+ support
- PyPI distribution
- Apache 2.0 license

---

## Upgrade Notes

### From 1.0.0 to 1.0.2

**⚠️ Breaking Change**: The encoded format has changed completely. Data encoded with v1.0.0 will **not** decode correctly with v1.0.2.

**Migration**: Re-encode your data with v1.0.2:

```python
import zon

# Load your JSON data
with open('data.json') as f:
data = json.load(f)

# Encode with new format
encoded = zon.encode(data)

# Decode works as before
decoded = zon.decode(encoded)
```

**Benefits**: The new format is much more readable and efficient. The migration is worth it for:
- ✅ 4.5% additional compression
- ✅ Zero protocol overhead
- ✅ Better LLM readability
- ✅ Cleaner visual appearance

---

## Links

- [PyPI](https://pypi.org/project/zon-format/)
- [GitHub](https://github.com/ZON-Format/ZON)
- [Examples](EXAMPLES.md)
- [README](README.md)
[1.0.4]: https://github.com/ZON-Format/ZON/releases/tag/v1.0.4
[1.0.3]: https://github.com/ZON-Format/ZON/releases/tag/v1.0.3
[1.0.2]: https://github.com/ZON-Format/ZON/releases/tag/v1.0.2
[1.0.0]: https://github.com/ZON-Format/ZON/releases/tag/v1.0.0
Loading