Canon provides canonicalization, pretty-printing, and semantic comparison for serialization formats (XML, HTML, JSON, YAML). It produces standardized forms suitable for comparison, testing, digital signatures, and human-readable output.
Key features:
-
Format support: XML, HTML, JSON, YAML
-
Canonicalization: W3C XML C14N 1.1, sorted JSON/YAML keys
-
Semantic comparison: Compare meaning, not formatting
-
Multiple interfaces: Ruby API, CLI, RSpec matchers
-
Smart diff output: By-line or by-object modes with syntax highlighting
Add to your application’s Gemfile:
gem 'canon'Then execute:
$ bundle installOr install directly:
$ gem install canonrequire 'canon'
# Canonical form (compact)
Canon.format('<root><b>2</b><a>1</a></root>', :xml)
# => Pretty-printed XML (default behavior)
# Compact canonical form
require 'canon/xml/c14n'
Canon::Xml::C14n.canonicalize('<root><b>2</b><a>1</a></root>', with_comments: false)
# => "<root><b>2</b><a>1</a></root>"
# Pretty-print (human-readable with custom indent)
require 'canon/pretty_printer/xml'
xml_input = '<root><b>2</b><a>1</a></root>'
Canon::PrettyPrinter::Xml.new(indent: 2).format(xml_input)require 'canon/comparison'
xml1 = '<root><a>1</a><b>2</b></root>'
xml2 = '<root> <b>2</b> <a>1</a> </root>'
Canon::Comparison.equivalent?(xml1, xml2)
# => true (semantically equivalent despite formatting differences)
# Use semantic tree diff for operation-level analysis
result = Canon::Comparison.equivalent?(xml1, xml2,
verbose: true,
diff_algorithm: :semantic
)
result.operations # => [INSERT, DELETE, UPDATE, MOVE operations]require 'canon/rspec_matchers'
RSpec.describe 'XML generation' do
it 'generates correct XML' do
expect(actual_xml).to be_xml_equivalent_to(expected_xml)
end
end-
Ruby API - Using Canon from Ruby code
-
Command-line interface - CLI commands and options
-
RSpec matchers - Testing with Canon
-
Match architecture - How comparison works
-
Format support - XML, HTML, JSON, YAML details
-
Diff modes - By-line vs by-object comparison
-
Preprocessing - Document normalization options
-
Match options - Match dimensions and profiles
-
Semantic tree diff - Operation-level tree comparison
-
Semantic tree diff algorithm - Comprehensive guide to semantic diff
-
Environment configuration - Configure via ENV variables including size limits
-
Diff formatting - Customizing diff output
-
Character visualization - Whitespace and special characters
-
Input validation - Error handling
-
Verbose mode - Two-tier diff architecture
-
Semantic diff report - Diff report format
-
Normative vs informative diffs - Diff classification
-
Diff architecture - Technical pipeline details
-
CompareProfile architecture - Format-specific policies
XML: W3C Canonical XML Version 1.1 specification with namespace declaration ordering, attribute ordering, character encoding normalization, and proper handling of xml:base, xml:lang, xml:space, and xml:id attributes.
HTML: Consistent formatting for HTML 4/5 and XHTML with automatic detection and appropriate formatting rules.
JSON/YAML: Alphabetically sorted keys at all levels with consistent formatting.
Compare documents based on meaning, not formatting:
-
Whitespace normalization options
-
Attribute/key order handling
-
Comment handling with display control
-
Multiple match dimensions with behaviors
-
Predefined match profiles (strict, rendered, spec_friendly, content_only)
See Match options for details.
Control which differences are displayed in diff output:
# Show all differences (default)
result = Canon::Comparison.equivalent?(xml1, xml2,
verbose: true,
match: { comments: :ignore },
show_diffs: :all
)
# Show only normative differences (affect equivalence)
result = Canon::Comparison.equivalent?(xml1, xml2,
verbose: true,
match: { comments: :ignore },
show_diffs: :normative
)
# Show only informative differences
result = Canon::Comparison.equivalent?(xml1, xml2,
verbose: true,
match: { comments: :ignore },
show_diffs: :informative
)CLI usage:
# Show all differences
$ canon diff file1.xml file2.xml --show-diffs all
# Show only normative differences
$ canon diff file1.xml file2.xml --show-diffs normative
# Show only informative differences
$ canon diff file1.xml file2.xml --show-diffs informativeRSpec usage:
expect(actual).to be_xml_equivalent_to(expected)
.show_diffs(:normative)When debugging test failures, it’s often helpful to see the exact strings that
were passed to the comparison before any preprocessing or normalization. The
verbose_diff option displays the original input strings in an RSpec-style
format with line numbers.
# Enable original string display in configuration
Canon::Config.configure do |config|
config.xml.diff.verbose_diff = true
end
# Or programmatically for a specific comparison
result = Canon::Comparison.equivalent?(xml1, xml2,
verbose: true,
verbose_diff: true
)Output format:
==================================================================
ORIGINAL INPUT STRINGS
==================================================================
Expected (as string):
1 | <root>
2 | <element>value1</element>
3 | </root>
Actual (as string):
1 | <root>
2 | <element>value2</element>
3 | </root>
==================================================================
When to use this feature:
-
Debugging why two documents are considered different
-
Understanding preprocessing effects (c14n, normalization, etc.)
-
Verifying the exact input received by the comparison
-
Comparing raw vs processed content
Environment variable:
export CANON_XML_DIFF_VERBOSE_DIFF=true
export CANON_HTML_DIFF_VERBOSE_DIFF=true
export CANON_JSON_DIFF_VERBOSE_DIFF=true
export CANON_YAML_DIFF_VERBOSE_DIFF=trueCanon provides two diff algorithms:
-
DOM diff (default): Stable, position-based comparison for traditional line-by-line output
-
Semantic tree diff (experimental): Advanced operation detection (INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT, UPGRADE, DOWNGRADE)
# Use DOM diff (default, stable)
result = Canon::Comparison.equivalent?(doc1, doc2,
verbose: true,
diff_algorithm: :dom
)
# Use semantic tree diff (experimental, more intelligent)
result = Canon::Comparison.equivalent?(doc1, doc2,
verbose: true,
diff_algorithm: :semantic
)When to use semantic tree diff:
-
Need to detect high-level operations (moves, merges, splits)
-
Documents have significant rearrangement
-
Want statistical analysis of changes
-
Need operation-level transformation analysis
When to use DOM diff:
-
Need stable, well-tested comparison
-
Want traditional line-by-line output
-
Documents are similar in structure
-
Maximum performance for large files
See Semantic tree diff algorithm for comprehensive guide.
Canon provides configurable size limits to prevent hangs on pathologically large files:
-
File size limit: Default 5MB (configurable)
-
Node count limit: Default 10,000 nodes (configurable)
-
Diff output limit: Default 10,000 lines (configurable)
# Configure via environment variables
export CANON_MAX_FILE_SIZE=10485760 # 10MB
export CANON_MAX_NODE_COUNT=50000 # 50,000 nodes
export CANON_MAX_DIFF_LINES=20000 # 20,000 lines
bundle exec rspec# Or programmatically
Canon::Config.instance.xml.diff.max_file_size = 10_485_760
Canon::Config.instance.xml.diff.max_node_count = 50_000
Canon::Config.instance.xml.diff.max_diff_lines = 20_000See ENV_CONFIG for details on size limit configuration.
By-line mode: Traditional line-by-line diff with:
-
DOM-guided semantic matching for XML
-
Syntax-aware token highlighting
-
Context lines around changes
-
Whitespace visualization
By-object mode: Tree-based semantic diff with:
-
Visual tree structure using box-drawing characters
-
Shows only what changed (additions, removals, modifications)
-
Color-coded output
See Diff modes for details.
-
Three-tier diff classification: Formatting-only (
[dark gray/]light gray), informative (<blue/>cyan), and normative (-red/+green) differences with directional colors -
Directional color coding: Removals and additions use different colors within each tier (red/green for normative, blue/cyan for informative, dark gray/light gray for formatting)
-
Namespace declaration tracking: Separate dimension for tracking
xmlnsandxmlns:*attribute changes, reported independently from regular data attributes -
Namespace rendering: Explicit namespace display in XML diffs using
ns:[uri]orns:[]format -
Informative diff visualization: Visually distinct blue/cyan markers for differences that don’t affect equivalence
-
Formatting diff detection: Automatically detects and highlights purely cosmetic whitespace/line break differences
-
Whitespace visualization: Make invisible characters visible with CJK-safe Unicode symbols
-
Non-ASCII detection: Warnings for unexpected Unicode characters
-
Customizable: Character maps, context lines, grouping options
See Diff formatting and Character visualization for details.
Comprehensive validation with clear error messages showing exact line and column numbers for syntax errors in XML, HTML, JSON, and YAML.
See Input validation for details.
require 'canon/comparison'
# Compare with custom options
Canon::Comparison.equivalent?(doc1, doc2,
match: {
text_content: :normalize,
structural_whitespace: :ignore,
comments: :ignore
},
verbose: true
)# Compare with semantic diff
$ canon diff file1.xml file2.xml \
--verbose \
--text-content normalize \
--structural-whitespace ignoreSee CLI documentation.
# Configure globally
Canon::Config.configure do |config|
config.xml.match.profile = :spec_friendly
config.xml.diff.use_color = true
end
# Use in tests
RSpec.describe 'XML generation' do
it 'generates correct structure' do
expect(actual_xml).to be_xml_equivalent_to(expected_xml)
end
endSee RSpec documentation.
Canon follows an orchestrator pattern with MECE (Mutually Exclusive, Collectively Exhaustive) principles:
Comparison module (Canon::Comparison): Format detection, validation, and
delegation to format-specific comparators (XML, HTML, JSON, YAML).
DiffFormatter module (Canon::DiffFormatter): Diff mode detection and
delegation to mode-specific formatters (by-line, by-object).
Three-phase comparison:
-
Preprocessing: Optional document normalization (c14n, normalize, format)
-
Semantic matching: Configurable match dimensions with behaviors
-
Diff rendering: Formatted output with visualization
See Match architecture for details.
Canon uses the CompareProfile class to encapsulate policy decisions about how differences in various dimensions should be handled during comparison. This provides clean separation of concerns between policy decisions, comparison logic, and difference classification.
The comparison system is divided into four distinct components:
- CompareProfile
-
Policy decisions (what to track, what affects equivalence)
- XmlComparator/HtmlComparator
-
Comparison logic (detect differences)
- DiffNode
-
Data representation (represents a difference)
- DiffClassifier
-
Classification logic (normative vs informative vs formatting)
Each component has ONE responsibility with no overlapping concerns:
-
CompareProfile does NOT classify differences
-
XmlComparator does NOT make policy decisions
-
DiffClassifier does NOT compare documents
CompareProfile provides four key policy methods:
track_dimension?(dimension)-
Should DiffNodes be created for this dimension? Returns
truein verbose mode to track all differences for reporting. affects_equivalence?(dimension)-
Should differences affect equivalence? Determines the return value of the comparison. Returns
falsefor dimensions with:ignorebehavior. normative_dimension?(dimension)-
Is this dimension normative (affects equivalence) or informative (display only)? Used by DiffClassifier to set the normative flag on DiffNodes.
supports_formatting_detection?(dimension)-
Can FormattingDetector apply to this dimension? Returns
trueonly for text/content dimensions (:text_content,:structural_whitespace,:comments).
Canon uses a CompareProfile system to define format-specific comparison policies.
This allows different formats (HTML, XML, JSON, YAML) to have their own default
behaviors while maintaining a consistent architecture.
The CompareProfile class provides the foundation for policy-based comparison:
Normative policy: Determines what differences matter for equivalence. Each
dimension (:text_content, :structural_whitespace, :comments, etc.) has a
behavior (:strict, :normalize, :ignore) that determines whether differences
in that dimension affect equivalence.
Dimension-based classification: Each difference has a dimension and the profile determines if that dimension is:
-
Normative: Affects equivalence (documents not equivalent if different)
-
Informative: Tracked but doesn’t affect equivalence
-
Formatting-only: Pure whitespace differences when normalized content matches
Classification hierarchy:
-
Normative (highest priority): Differences that make documents non-equivalent
-
Informative (medium priority): Differences that are tracked but don’t affect equivalence
-
Formatting-only (lowest priority): Pure whitespace/formatting differences
Each dimension can have one of three behaviors:
-
:strict: Differences in this dimension are normative (affect equivalence) -
:normalize: Differences are normalized; only semantic changes are normative -
:ignore: Differences are informative only (don’t affect equivalence)
# Default (strict mode): whitespace differences are normative
xml1 = '<root><p>Hello world</p></root>'
xml2 = '<root><p>Hello\nworld</p></root>'
Canon::Comparison.equivalent?(xml1, xml2) # => false
# Normalize mode: whitespace-only differences are formatting-only
Canon::Comparison.equivalent?(xml1, xml2,
match: { text_content: :normalize, structural_whitespace: :normalize }
) # => true
In normalize mode, the line break is detected as formatting-only because the normalized content ("Hello world") is the same.
Different formats can extend CompareProfile with format-specific policies:
-
XML (base): Strict policies for all dimensions
-
HTML (HtmlCompareProfile): Comments ignored by default, whitespace preserved in certain elements
-
JSON/YAML (future): Key order policies, type handling
See lib/canon/comparison/compare_profile.rb for the base implementation and
lib/canon/comparison/html_compare_profile.rb for HTML-specific policies.
Canon provides a format-specific CompareProfile implementation called HtmlCompareProfile that encapsulates policies specific to HTML comparison. This profile is automatically used by HtmlComparator based on detected HTML version.
Comments: Default behavior is :ignore (presentational content in HTML),
unless explicitly set to :strict. When comments are set to :strict,
they will affect equivalence.
Whitespace preservation: HtmlCompareProfile automatically preserves
whitespace in elements where it’s semantically significant (e.g., <pre>,
<code>, <textarea>, <script>, <style>). In other elements, whitespace
is normalized.
Case sensitivity: HTML5 is case-sensitive for element names, while HTML4 is case-insensitive. HtmlCompareProfile uses HTML5 case-sensitivity by default.
When using match: { comments: :ignore }:
-
track_dimension?(:comments)returnstrue(track in verbose mode) -
affects_equivalence?(:comments)returnsfalse(doesn’t affect equivalence) -
normative_dimension?(:comments)returnsfalse(informative only)
This ensures that comment differences are tracked and displayed in verbose mode but don’t make documents non-equivalent.
xml1 = '<root><!-- comment 1 --><data>value</data></root>'
xml2 = '<root><!-- comment 2 --><data>value</data></root>'
result = Canon::Comparison.equivalent?(xml1, xml2,
verbose: true,
match: { comments: :ignore }
)
result.differences # => [#<DiffNode @dimension=:comments>]
result.differences[0].normative? # => false (informative)
result.equivalent? # => true (doesn't affect equivalence)The comment difference is tracked and displayed, but the documents are still
considered equivalent because comments are set to :ignore.
html1 = '<div><!-- comment --><p>Text</p></div>'
html2 = '<div><p>Text</p></div>'
# HTML defaults: comments are ignored (presentational)
result = Canon::Comparison.equivalent?(html1, html2)
# => true (comments don't affect HTML equivalence by default)
# Explicit strict matching
result = Canon::Comparison.equivalent?(html1, html2,
match: { comments: :strict }
)
# => false (comments now affect equivalence)Comments in HTML are considered presentational content (like CSS styles) and
don’t affect the semantic meaning unless explicitly configured to :strict.
html1 = '<pre>Line 1\n Line 2</pre>'
html2 = '<pre>Line 1\nLine 2</pre>'
# Whitespace is preserved in <pre> elements
result = Canon::Comparison.equivalent?(html1, html2)
# => false (whitespace differs in pre element)
# But normalized in other elements
html3 = '<div>Text with spaces</div>'
html4 = '<div>Text with spaces</div>'
result = Canon::Comparison.equivalent?(html3, html4)
# => true (whitespace normalized in regular elements)HtmlCompareProfile automatically preserves whitespace in elements where it’s
semantically significant (<pre>, <code>, <textarea>, <script>,
<style>), while normalizing it in other elements.
Future format profiles: The architecture supports additional format-specific profiles for JSON, YAML, and other formats as needed.
After checking out the repo, run bin/setup to install dependencies. Then run
rake spec to run the tests. You can also run bin/console for an interactive
prompt.
Bug reports and pull requests are welcome on GitHub at https://github.com/lutaml/canon.
Copyright Ribose. BSD-2-Clause License.