Canon: Semantic comparison for serialization formats

Table of Contents

Purpose
Installation
Quick start
Documentation
Features
Examples
Architecture
- CompareProfile architecture
- CompareProfile architecture
Development
Contributing
Copyright and license

Purpose

Canon provides canonicalization, pretty-printing, and semantic comparison for serialization formats (XML, HTML, JSON, YAML). It produces standardized forms suitable for comparison, testing, digital signatures, and human-readable output.

Key features:

Format support: XML, HTML, JSON, YAML
Canonicalization: W3C XML C14N 1.1, sorted JSON/YAML keys
Semantic comparison: Compare meaning, not formatting
Multiple interfaces: Ruby API, CLI, RSpec matchers
Smart diff output: By-line or by-object modes with syntax highlighting

Installation

Add to your application’s Gemfile:

gem 'canon'

Then execute:

$ bundle install

Or install directly:

$ gem install canon

Quick start

Format documents

require 'canon'

# Canonical form (compact)
Canon.format('<root><b>2</b><a>1</a></root>', :xml)
# => Pretty-printed XML (default behavior)

# Compact canonical form
require 'canon/xml/c14n'
Canon::Xml::C14n.canonicalize('<root><b>2</b><a>1</a></root>', with_comments: false)
# => "<root><b>2</b><a>1</a></root>"

# Pretty-print (human-readable with custom indent)
require 'canon/pretty_printer/xml'
xml_input = '<root><b>2</b><a>1</a></root>'
Canon::PrettyPrinter::Xml.new(indent: 2).format(xml_input)

Compare documents

require 'canon/comparison'

xml1 = '<root><a>1</a><b>2</b></root>'
xml2 = '<root>  <b>2</b>  <a>1</a>  </root>'

Canon::Comparison.equivalent?(xml1, xml2)
# => true (semantically equivalent despite formatting differences)

# Use semantic tree diff for operation-level analysis
result = Canon::Comparison.equivalent?(xml1, xml2,
  verbose: true,
  diff_algorithm: :semantic
)
result.operations  # => [INSERT, DELETE, UPDATE, MOVE operations]

Use in tests

require 'canon/rspec_matchers'

RSpec.describe 'XML generation' do
  it 'generates correct XML' do
    expect(actual_xml).to be_xml_equivalent_to(expected_xml)
  end
end

Command-line interface

# Format a file
$ canon format input.xml --mode pretty

# Compare files
$ canon diff file1.xml file2.xml --verbose

# Get help
$ canon help

Documentation

Using Canon

Ruby API - Using Canon from Ruby code
Command-line interface - CLI commands and options
RSpec matchers - Testing with Canon

Understanding Canon

Match architecture - How comparison works
Format support - XML, HTML, JSON, YAML details
Diff modes - By-line vs by-object comparison

Features

Preprocessing - Document normalization options
Match options - Match dimensions and profiles
Semantic tree diff - Operation-level tree comparison
Semantic tree diff algorithm - Comprehensive guide to semantic diff
Environment configuration - Configure via ENV variables including size limits
Diff formatting - Customizing diff output
Character visualization - Whitespace and special characters
Input validation - Error handling

Advanced topics

Verbose mode - Two-tier diff architecture
Semantic diff report - Diff report format
Normative vs informative diffs - Diff classification
Diff architecture - Technical pipeline details
CompareProfile architecture - Format-specific policies

Features

Canonicalization

XML: W3C Canonical XML Version 1.1 specification with namespace declaration ordering, attribute ordering, character encoding normalization, and proper handling of xml:base, xml:lang, xml:space, and xml:id attributes.

HTML: Consistent formatting for HTML 4/5 and XHTML with automatic detection and appropriate formatting rules.

JSON/YAML: Alphabetically sorted keys at all levels with consistent formatting.

Semantic comparison

Compare documents based on meaning, not formatting:

Whitespace normalization options
Attribute/key order handling
Comment handling with display control
Multiple match dimensions with behaviors
Predefined match profiles (strict, rendered, spec_friendly, content_only)

See Match options for details.

Comment display control

Control which differences are displayed in diff output:

# Show all differences (default)
result = Canon::Comparison.equivalent?(xml1, xml2,
  verbose: true,
  match: { comments: :ignore },
  show_diffs: :all
)

# Show only normative differences (affect equivalence)
result = Canon::Comparison.equivalent?(xml1, xml2,
  verbose: true,
  match: { comments: :ignore },
  show_diffs: :normative
)

# Show only informative differences
result = Canon::Comparison.equivalent?(xml1, xml2,
  verbose: true,
  match: { comments: :ignore },
  show_diffs: :informative
)

CLI usage:

# Show all differences
$ canon diff file1.xml file2.xml --show-diffs all

# Show only normative differences
$ canon diff file1.xml file2.xml --show-diffs normative

# Show only informative differences
$ canon diff file1.xml file2.xml --show-diffs informative

RSpec usage:

expect(actual).to be_xml_equivalent_to(expected)
  .show_diffs(:normative)

Original input string display

When debugging test failures, it’s often helpful to see the exact strings that were passed to the comparison before any preprocessing or normalization. The verbose_diff option displays the original input strings in an RSpec-style format with line numbers.

# Enable original string display in configuration
Canon::Config.configure do |config|
  config.xml.diff.verbose_diff = true
end

# Or programmatically for a specific comparison
result = Canon::Comparison.equivalent?(xml1, xml2,
  verbose: true,
  verbose_diff: true
)

Output format:

==================================================================
  ORIGINAL INPUT STRINGS
==================================================================

Expected (as string):
     1 | <root>
     2 |   <element>value1</element>
     3 | </root>

Actual (as string):
     1 | <root>
     2 |   <element>value2</element>
     3 | </root>

==================================================================

When to use this feature:

Debugging why two documents are considered different
Understanding preprocessing effects (c14n, normalization, etc.)
Verifying the exact input received by the comparison
Comparing raw vs processed content

Environment variable:

export CANON_XML_DIFF_VERBOSE_DIFF=true
export CANON_HTML_DIFF_VERBOSE_DIFF=true
export CANON_JSON_DIFF_VERBOSE_DIFF=true
export CANON_YAML_DIFF_VERBOSE_DIFF=true

Algorithm choice

Canon provides two diff algorithms:

DOM diff (default): Stable, position-based comparison for traditional line-by-line output
Semantic tree diff (experimental): Advanced operation detection (INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT, UPGRADE, DOWNGRADE)

# Use DOM diff (default, stable)
result = Canon::Comparison.equivalent?(doc1, doc2,
  verbose: true,
  diff_algorithm: :dom
)

# Use semantic tree diff (experimental, more intelligent)
result = Canon::Comparison.equivalent?(doc1, doc2,
  verbose: true,
  diff_algorithm: :semantic
)

When to use semantic tree diff:

Need to detect high-level operations (moves, merges, splits)
Documents have significant rearrangement
Want statistical analysis of changes
Need operation-level transformation analysis

When to use DOM diff:

Need stable, well-tested comparison
Want traditional line-by-line output
Documents are similar in structure
Maximum performance for large files

See Semantic tree diff algorithm for comprehensive guide.

Size limits for large files

Canon provides configurable size limits to prevent hangs on pathologically large files:

File size limit: Default 5MB (configurable)
Node count limit: Default 10,000 nodes (configurable)
Diff output limit: Default 10,000 lines (configurable)

# Configure via environment variables
export CANON_MAX_FILE_SIZE=10485760      # 10MB
export CANON_MAX_NODE_COUNT=50000        # 50,000 nodes
export CANON_MAX_DIFF_LINES=20000        # 20,000 lines

bundle exec rspec

# Or programmatically
Canon::Config.instance.xml.diff.max_file_size = 10_485_760
Canon::Config.instance.xml.diff.max_node_count = 50_000
Canon::Config.instance.xml.diff.max_diff_lines = 20_000

See ENV_CONFIG for details on size limit configuration.

Smart diff output

By-line mode: Traditional line-by-line diff with:

DOM-guided semantic matching for XML
Syntax-aware token highlighting
Context lines around changes
Whitespace visualization

By-object mode: Tree-based semantic diff with:

Visual tree structure using box-drawing characters
Shows only what changed (additions, removals, modifications)
Color-coded output

See Diff modes for details.

Enhanced diff features

Three-tier diff classification: Formatting-only ([ dark gray/] light gray), informative (< blue/> cyan), and normative (- red/+ green) differences with directional colors
Directional color coding: Removals and additions use different colors within each tier (red/green for normative, blue/cyan for informative, dark gray/light gray for formatting)
Namespace declaration tracking: Separate dimension for tracking xmlns and xmlns:* attribute changes, reported independently from regular data attributes
Namespace rendering: Explicit namespace display in XML diffs using ns:[uri] or ns:[] format
Informative diff visualization: Visually distinct blue/cyan markers for differences that don’t affect equivalence
Formatting diff detection: Automatically detects and highlights purely cosmetic whitespace/line break differences
Whitespace visualization: Make invisible characters visible with CJK-safe Unicode symbols
Non-ASCII detection: Warnings for unexpected Unicode characters
Customizable: Character maps, context lines, grouping options

See Diff formatting and Character visualization for details.

Input validation

Comprehensive validation with clear error messages showing exact line and column numbers for syntax errors in XML, HTML, JSON, and YAML.

See Input validation for details.

Examples

Ruby API example

require 'canon/comparison'

# Compare with custom options
Canon::Comparison.equivalent?(doc1, doc2,
  match: {
    text_content: :normalize,
    structural_whitespace: :ignore,
    comments: :ignore
  },
  verbose: true
)

See Ruby API documentation.

CLI example

# Compare with semantic diff
$ canon diff file1.xml file2.xml \
  --verbose \
  --text-content normalize \
  --structural-whitespace ignore

See CLI documentation.

RSpec example

# Configure globally
Canon::Config.configure do |config|
  config.xml.match.profile = :spec_friendly
  config.xml.diff.use_color = true
end

# Use in tests
RSpec.describe 'XML generation' do
  it 'generates correct structure' do
    expect(actual_xml).to be_xml_equivalent_to(expected_xml)
  end
end

See RSpec documentation.

Architecture

Canon follows an orchestrator pattern with MECE (Mutually Exclusive, Collectively Exhaustive) principles:

Comparison module (Canon::Comparison): Format detection, validation, and delegation to format-specific comparators (XML, HTML, JSON, YAML).

DiffFormatter module (Canon::DiffFormatter): Diff mode detection and delegation to mode-specific formatters (by-line, by-object).

Three-phase comparison:

Preprocessing: Optional document normalization (c14n, normalize, format)
Semantic matching: Configurable match dimensions with behaviors
Diff rendering: Formatted output with visualization

See Match architecture for details.

CompareProfile architecture

Canon uses the CompareProfile class to encapsulate policy decisions about how differences in various dimensions should be handled during comparison. This provides clean separation of concerns between policy decisions, comparison logic, and difference classification.

Separation of concerns

The comparison system is divided into four distinct components:

CompareProfile: Policy decisions (what to track, what affects equivalence)
XmlComparator/HtmlComparator: Comparison logic (detect differences)
DiffNode: Data representation (represents a difference)
DiffClassifier: Classification logic (normative vs informative vs formatting)

Each component has ONE responsibility with no overlapping concerns:

CompareProfile does NOT classify differences
XmlComparator does NOT make policy decisions
DiffClassifier does NOT compare documents

Policy methods

CompareProfile provides four key policy methods:

track_dimension?(dimension): Should DiffNodes be created for this dimension? Returns true in verbose mode to track all differences for reporting.
affects_equivalence?(dimension): Should differences affect equivalence? Determines the return value of the comparison. Returns false for dimensions with :ignore behavior.
normative_dimension?(dimension): Is this dimension normative (affects equivalence) or informative (display only)? Used by DiffClassifier to set the normative flag on DiffNodes.
supports_formatting_detection?(dimension): Can FormattingDetector apply to this dimension? Returns true only for text/content dimensions (:text_content, :structural_whitespace, :comments).

CompareProfile architecture

Canon uses a CompareProfile system to define format-specific comparison policies. This allows different formats (HTML, XML, JSON, YAML) to have their own default behaviors while maintaining a consistent architecture.

How CompareProfile works

The CompareProfile class provides the foundation for policy-based comparison:

Normative policy: Determines what differences matter for equivalence. Each dimension (:text_content, :structural_whitespace, :comments, etc.) has a behavior (:strict, :normalize, :ignore) that determines whether differences in that dimension affect equivalence.

Dimension-based classification: Each difference has a dimension and the profile determines if that dimension is:

Normative: Affects equivalence (documents not equivalent if different)
Informative: Tracked but doesn’t affect equivalence
Formatting-only: Pure whitespace differences when normalized content matches

Classification hierarchy:

Normative (highest priority): Differences that make documents non-equivalent
Informative (medium priority): Differences that are tracked but don’t affect equivalence
Formatting-only (lowest priority): Pure whitespace/formatting differences

Dimension behaviors

Each dimension can have one of three behaviors:

:strict: Differences in this dimension are normative (affect equivalence)
:normalize: Differences are normalized; only semantic changes are normative
:ignore: Differences are informative only (don’t affect equivalence)

Example 1. Example: Whitespace handling

# Default (strict mode): whitespace differences are normative
xml1 = '<root><p>Hello world</p></root>'
xml2 = '<root><p>Hello\nworld</p></root>'
Canon::Comparison.equivalent?(xml1, xml2)  # => false

# Normalize mode: whitespace-only differences are formatting-only
Canon::Comparison.equivalent?(xml1, xml2,
  match: { text_content: :normalize, structural_whitespace: :normalize }
)  # => true

In normalize mode, the line break is detected as formatting-only because the normalized content ("Hello world") is the same.

Format-specific profiles

Different formats can extend CompareProfile with format-specific policies:

XML (base): Strict policies for all dimensions
HTML (HtmlCompareProfile): Comments ignored by default, whitespace preserved in certain elements
JSON/YAML (future): Key order policies, type handling

See lib/canon/comparison/compare_profile.rb for the base implementation and lib/canon/comparison/html_compare_profile.rb for HTML-specific policies.

Format-specific policies for HTML

Canon provides a format-specific CompareProfile implementation called HtmlCompareProfile that encapsulates policies specific to HTML comparison. This profile is automatically used by HtmlComparator based on detected HTML version.

Comments: Default behavior is :ignore (presentational content in HTML), unless explicitly set to :strict. When comments are set to :strict, they will affect equivalence.

Whitespace preservation: HtmlCompareProfile automatically preserves whitespace in elements where it’s semantically significant (e.g., <pre>, <code>, <textarea>, <script>, <style>). In other elements, whitespace is normalized.

Case sensitivity: HTML5 is case-sensitive for element names, while HTML4 is case-insensitive. HtmlCompareProfile uses HTML5 case-sensitivity by default.

Usage example

When using match: { comments: :ignore }:

track_dimension?(:comments) returns true (track in verbose mode)
affects_equivalence?(:comments) returns false (doesn’t affect equivalence)
normative_dimension?(:comments) returns false (informative only)

This ensures that comment differences are tracked and displayed in verbose mode but don’t make documents non-equivalent.

Example 2. Example: Comment differences with :ignore behavior

xml1 = '<root><!-- comment 1 --><data>value</data></root>'
xml2 = '<root><!-- comment 2 --><data>value</data></root>'

result = Canon::Comparison.equivalent?(xml1, xml2,
  verbose: true,
  match: { comments: :ignore }
)

result.differences           # => [#<DiffNode @dimension=:comments>]
result.differences[0].normative?  # => false (informative)
result.equivalent?           # => true (doesn't affect equivalence)

The comment difference is tracked and displayed, but the documents are still considered equivalent because comments are set to :ignore.

Example 3. Example: HTML comment handling

html1 = '<div><!-- comment --><p>Text</p></div>'
html2 = '<div><p>Text</p></div>'

# HTML defaults: comments are ignored (presentational)
result = Canon::Comparison.equivalent?(html1, html2)
# => true (comments don't affect HTML equivalence by default)

# Explicit strict matching
result = Canon::Comparison.equivalent?(html1, html2,
  match: { comments: :strict }
)
# => false (comments now affect equivalence)

Comments in HTML are considered presentational content (like CSS styles) and don’t affect the semantic meaning unless explicitly configured to :strict.

Example 4. Example: HTML whitespace preservation

html1 = '<pre>Line 1\n  Line 2</pre>'
html2 = '<pre>Line 1\nLine 2</pre>'

# Whitespace is preserved in <pre> elements
result = Canon::Comparison.equivalent?(html1, html2)
# => false (whitespace differs in pre element)

# But normalized in other elements
html3 = '<div>Text    with    spaces</div>'
html4 = '<div>Text with spaces</div>'
result = Canon::Comparison.equivalent?(html3, html4)
# => true (whitespace normalized in regular elements)

HtmlCompareProfile automatically preserves whitespace in elements where it’s semantically significant (<pre>, <code>, <textarea>, <script>, <style>), while normalizing it in other elements.

Future format profiles: The architecture supports additional format-specific profiles for JSON, YAML, and other formats as needed.

Development

After checking out the repo, run bin/setup to install dependencies. Then run rake spec to run the tests. You can also run bin/console for an interactive prompt.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/lutaml/canon.

Copyright and license

Copyright Ribose. BSD-2-Clause License.

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.github/workflows		.github/workflows
bin		bin
docs		docs
exe		exe
lib		lib
sig/xml		sig/xml
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.rubocop_todo.yml		.rubocop_todo.yml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Gemfile		Gemfile
README.adoc		README.adoc
Rakefile		Rakefile
canon.gemspec		canon.gemspec

lutaml/canon

Folders and files

Latest commit

History

Repository files navigation

Canon: Semantic comparison for serialization formats

Purpose

Installation

Quick start

Format documents

Compare documents

Use in tests

Command-line interface

Documentation

Using Canon

Understanding Canon

Features

Advanced topics

Features

Canonicalization

Semantic comparison

Comment display control

Original input string display

Algorithm choice

Size limits for large files

Smart diff output

Enhanced diff features

Input validation

Examples

Ruby API example

CLI example

RSpec example

Architecture

CompareProfile architecture

Separation of concerns

Policy methods

CompareProfile architecture

How CompareProfile works

Dimension behaviors

Format-specific profiles

Format-specific policies for HTML

Usage example

Development

Contributing

Copyright and license

About

Resources

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages