feat(merodb): add schema inference from database metadata by xilosada · Pull Request #1865 · calimero-network/core

xilosada · 2026-02-05T07:23:37Z

Summary

Adds schema inference capability to merodb, allowing it to inspect databases without requiring an external schema file.

⚠️ Depends On

This PR depends on #1864 - must be merged first. This PR will be rebased on master after #1864 lands.

Planned Changes

Schema Inference (`abi.rs`)

Add infer_schema_from_database(db, context_id) function
Scan State column for EntityIndex entries
Extract field_name and crdt_type from metadata
Build Manifest dynamically from discovered fields

Export Types (`export.rs`)

Add EntityIndex struct with borsh deserialization
Add CrdtType enum (local copy for merodb)
Add Metadata struct with backward-compatible deserialization

CLI Updates (`export/cli.rs`)

Make --state-schema-file optional
Use infer_schema_from_database() when no schema provided

GUI Updates

Update HTML to mark schema as optional
Update JS to allow loading without schema
Update server endpoints to use schema inference

Benefits

Zero-config inspection: Inspect any database without schema files
Development workflow: Quickly debug state without generating schemas
Backward compatible: Still supports explicit schema files

🤖 AI Code Reviewer

Reviewed by 1 agents | Quality score: 85% | Review time: 92.3s

🔴 Critical (1)

1. Breaking change: compute_id modification invalidates existing data

File: crates/storage/src/collections.rs (line 68-74) | Consensus: 1/1 agents ✓

The compute_id function has been modified to include DOMAIN_SEPARATOR_ENTRY in the hash computation. This changes the IDs of ALL existing map entries in production databases. Any data stored with the old ID computation will no longer be accessible because lookups will use the new ID formula, effectively causing data loss.

Suggested fix:

This is a breaking change that requires a migration strategy. Either: (1) Add a new function `compute_id_v2` and use it only for new APIs while keeping `compute_id` unchanged for existing code paths, (2) Implement a database migration that re-keys all existing entries, or (3) Store a version marker to determine which ID computation to use.

🟡 Warning (1)

1. Backward-compatible deserialization may silently consume unrelated data

File: crates/storage/src/entities.rs (line 501-510) | Consensus: 1/1 agents ✓

The custom BorshDeserialize implementation uses unwrap_or(None) to handle missing fields in old data. However, Borsh deserialization of Option<T> reads a discriminant byte first. If old data has ANY trailing bytes (e.g., from a different schema version or corruption), those bytes will be incorrectly parsed as crdt_type or field_name rather than being detected as an error. This could lead to silent data corruption.

Suggested fix:

Consider checking if the reader is at EOF before attempting to read optional fields, or use a versioned serialization format with an explicit version marker at the start of the metadata.

💡 Suggestion (2)

1. Inconsistent field name initialization for LwwRegister fields

File: apps/state-schema-conformance/src/lib.rs (line 126-129) | Consensus: 1/1 agents ✓

Most collections in the init() function have been updated to use new_with_field_name(), but status, user_id, counter, name, and active fields still use LwwRegister::new() without field names. This inconsistency means these fields won't benefit from schema inference and will have non-deterministic IDs across nodes.

Suggested fix:

Either add `new_with_field_name` support to LwwRegister and update these fields, or document why LwwRegister is intentionally excluded from deterministic ID generation.

2. Counter's internal map naming convention is fragile

File: crates/storage/src/collections/counter.rs (line 216-230) | Consensus: 1/1 agents ✓

The Counter uses a string prefix __counter_internal_ to prevent collisions with user-created collections. While documented, this relies on users not creating collections with this prefix. A malicious or careless user could still create a collection named __counter_internal_visits_positive and cause a collision.

Suggested fix:

Consider using a different domain separator mechanism (like a UUID or type-specific bytes) that cannot be expressed as a valid user field name, similar to how DOMAIN_SEPARATOR_ENTRY works.

📝 Nitpick (1)

1. New CrdtType variants lack documentation

File: crates/storage/src/collections/crdt_meta.rs (line 16-34) | Consensus: 1/1 agents ✓

The newly added UserStorage and FrozenStorage variants have only brief one-line comments, while other variants are similarly terse. As the enum grows and is used for merge dispatch during state synchronization, clearer documentation about when each type is used and its merge semantics would be valuable.

Suggested fix:

Add doc comments explaining the purpose and merge behavior of UserStorage and FrozenStorage variants.

_{🤖 Generated by AI Code Reviewer | Review ID: review-1bd71bd9}

meroreviewer · 2026-02-05T07:25:28Z

crates/storage/src/collections.rs

+const DOMAIN_SEPARATOR_COLLECTION: &[u8] = b"__calimero_collection__";
+
+/// Compute the ID for a key in a map.
+/// Uses domain separation to prevent collision with collection IDs.


🔴 Breaking change: compute_id modification invalidates existing data

The compute_id function has been modified to include DOMAIN_SEPARATOR_ENTRY in the hash computation. This changes the IDs of ALL existing map entries in production databases. Any data stored with the old ID computation will no longer be accessible because lookups will use the new ID formula, effectively causing data loss.

Suggested fix:

This is a breaking change that requires a migration strategy. Either: (1) Add a new function `compute_id_v2` and use it only for new APIs while keeping `compute_id` unchanged for existing code paths, (2) Implement a database migration that re-keys all existing entries, or (3) Store a version marker to determine which ID computation to use.

✅ Resolved - This issue has been addressed in the latest changes.

meroreviewer · 2026-02-05T07:25:30Z

crates/storage/src/entities.rs

+        let created_at = u64::deserialize_reader(reader)?;
+        let updated_at = UpdatedAt::deserialize_reader(reader)?;
+        let storage_type = StorageType::deserialize_reader(reader)?;
+


🟡 Backward-compatible deserialization may silently consume unrelated data

The custom BorshDeserialize implementation uses unwrap_or(None) to handle missing fields in old data. However, Borsh deserialization of Option<T> reads a discriminant byte first. If old data has ANY trailing bytes (e.g., from a different schema version or corruption), those bytes will be incorrectly parsed as crdt_type or field_name rather than being detected as an error. This could lead to silent data corruption.

Suggested fix:

Consider checking if the reader is at EOF before attempting to read optional fields, or use a versioned serialization format with an explicit version marker at the start of the metadata.

✅ Resolved - This issue has been addressed in the latest changes.

meroreviewer · 2026-02-05T07:25:32Z

apps/state-schema-conformance/src/lib.rs

+            map_of_lists: UnorderedMap::new_with_field_name(None, "map_of_lists"),
+            list_of_maps: Vector::new_with_field_name(None, "list_of_maps"),
+            string_set: UnorderedSet::new_with_field_name(None, "string_set"),
+            visit_counter: Counter::new_with_field_name(None, "visit_counter"),


💡 Inconsistent field name initialization for LwwRegister fields

Most collections in the init() function have been updated to use new_with_field_name(), but status, user_id, counter, name, and active fields still use LwwRegister::new() without field names. This inconsistency means these fields won't benefit from schema inference and will have non-deterministic IDs across nodes.

Suggested fix:

Either add `new_with_field_name` support to LwwRegister and update these fields, or document why LwwRegister is intentionally excluded from deterministic ID generation.

✅ Resolved - This issue has been addressed in the latest changes.

meroreviewer · 2026-02-05T07:25:34Z

crates/storage/src/collections/counter.rs

+        field_name: &str,
+    ) -> Self {
+        // For Counter, we need to create deterministic IDs for both positive and negative maps
+        // Use a reserved internal prefix to prevent collisions with user-created collections.


💡 Counter's internal map naming convention is fragile

The Counter uses a string prefix __counter_internal_ to prevent collisions with user-created collections. While documented, this relies on users not creating collections with this prefix. A malicious or careless user could still create a collection named __counter_internal_visits_positive and cause a collision.

Suggested fix:

Consider using a different domain separator mechanism (like a UUID or type-specific bytes) that cannot be expressed as a valid user field name, similar to how DOMAIN_SEPARATOR_ENTRY works.

✅ Resolved - This issue has been addressed in the latest changes.

meroreviewer · 2026-02-05T07:25:36Z

crates/storage/src/collections/crdt_meta.rs


 /// Identifies the specific CRDT type
-#[derive(Debug, Clone, PartialEq, Eq)]
+#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, BorshSerialize, BorshDeserialize)]


📝 New CrdtType variants lack documentation

The newly added UserStorage and FrozenStorage variants have only brief one-line comments, while other variants are similarly terse. As the enum grows and is used for merge dispatch during state synchronization, clearer documentation about when each type is used and its merge semantics would be valuable.

Suggested fix:

Add doc comments explaining the purpose and merge behavior of UserStorage and FrozenStorage variants.

✅ Resolved - This issue has been addressed in the latest changes.

github-actions · 2026-02-05T08:15:02Z

Merobox Workflows Failed

The following workflow(s) failed after retries:

blobs/workflows/blobs-example.yml
blobs/workflows/blobs.yml
collaborative-editor/workflows/collaborative-editor.yml
kv-store-init/workflows/kv-store-init.yml
kv-store-with-handlers/workflows/kv-store-with-handlers.yml
kv-store-with-user-and-frozen-storage/workflows/test_frozen_storage.yml
kv-store-with-user-and-frozen-storage/workflows/test_user_storage.yml
kv-store/workflows/bundle-invitation-test.yml
nested-crdt-test/workflows/nested-crdt-test.yml
private_data/workflows/example.yml
private_data/workflows/private-data.yml
team-metrics-custom/workflows/team-metrics-custom.yml
team-metrics-macro/workflows/team-metrics-macro.yml
xcall-example/workflows/xcall-example.yml

Please check the workflow logs for more details.

- Schema inference from database metadata (field_name, crdt_type) - Improved tree view with field names instead of truncated IDs - Better visualization for Counter, Vector, Set values - Show children count and hide empty sections in sidebar

github-actions · 2026-02-05T08:50:53Z

Merobox Proposals Workflows Failed

The following proposal workflow(s) failed:

near
icp
ethereum

Please check the workflow logs for more details.

meroreviewer

🤖 AI Code Reviewer

Reviewed by 1 agents | Quality score: 95% | Review time: 503.6s

✅ Ready to Merge

All previously identified issues have been addressed!

✅ Fixed Issues

The following issues from previous reviews have been addressed:

~~Breaking change: compute_id modification invalidates existing data~~ (crates/storage/src/collections.rs:68)
~~Backward-compatible deserialization may silently consume unrelated data~~ (crates/storage/src/entities.rs:501)
~~Inconsistent field name initialization for LwwRegister fields~~ (apps/state-schema-conformance/src/lib.rs:126)
~~Counter's internal map naming convention is fragile~~ (crates/storage/src/collections/counter.rs:227)
~~New CrdtType variants lack documentation~~ (crates/storage/src/collections/crdt_meta.rs:16)

_{🤖 Generated by AI Code Reviewer | Review ID: review-48ee257a}

meroreviewer

🤖 AI Code Reviewer

Reviewed by 1 agents | Quality score: 85% | Review time: 481.6s

🔴 Critical issues require attention

✅ 5 Fixed | 🆕 5 New

✅ Fixed Issues

The following issues from previous reviews have been addressed:

~~Breaking change: compute_id modification invalidates existing data~~ (crates/storage/src/collections.rs:68)
~~Backward-compatible deserialization may silently consume unrelated data~~ (crates/storage/src/entities.rs:501)
~~Inconsistent field name initialization for LwwRegister fields~~ (apps/state-schema-conformance/src/lib.rs:126)
~~Counter's internal map naming convention is fragile~~ (crates/storage/src/collections/counter.rs:227)
~~New CrdtType variants lack documentation~~ (crates/storage/src/collections/crdt_meta.rs:16)

🆕 New Issues

These issues were found in the latest changes:

🔴 Critical (2)

1. Domain separator addition is a breaking change for existing data

File: crates/storage/src/collections.rs (line 66-75) | Consensus: 1/1 agents ✓

Adding DOMAIN_SEPARATOR_ENTRY to compute_id() changes the hash computation from SHA256(parent || key) to SHA256(parent || "__calimero_entry__" || key). This means all existing map entries stored in databases will have IDs computed with the OLD formula, while new lookups will use the NEW formula. Existing data will become inaccessible because lookups will fail to find entries stored under old IDs.

Suggested fix:

Either: (1) Provide a migration path that rehashes all existing entries, (2) Add a fallback that tries the old ID computation when the new one fails, or (3) Document this as a breaking change requiring fresh databases. Consider whether this should be gated behind a feature flag or version check.

2. Borsh backward compatibility deserialization may read garbage data

File: crates/storage/src/entities.rs (line 495-512) | Consensus: 1/1 agents ✓

The custom BorshDeserialize implementation for Metadata uses unwrap_or(None) to handle missing crdt_type and field_name fields in old data. However, when Metadata is part of a compound structure (e.g., Vec<ChildInfo> in index entries), if old-format Metadata is followed by another item, the deserializer will interpret the first bytes of the NEXT item as the Option discriminant for crdt_type. This will either corrupt deserialization or silently produce incorrect data, since Option::<CrdtType>::deserialize_reader expects a specific format (0x00 for None, 0x01 for Some followed by enum data).

Suggested fix:

Use a versioning scheme or length-prefixed format for Metadata. Alternatively, track remaining bytes before attempting to read optional fields, or use a separate migration to update all stored Metadata before deploying this change.

🟡 Warning (2)

1. Schema inference type defaults may produce incorrect type information

File: tools/merodb/src/abi.rs (line 130-180) | Consensus: 1/1 agents ✓

The schema inference defaults all UnorderedMap to Map<String, String>, Vector to List<String>, etc. These default type parameters may not match the actual data types stored in the database. When users rely on this inferred schema for inspection, they may get incorrect deserialization results for values that are not actually strings.

Suggested fix:

Document clearly that inferred schemas use default type parameters and may need manual refinement. Consider adding a comment in the generated schema indicating it was auto-inferred with default types, or attempt to sample actual entries to infer more accurate type information.

2. Schema inference silently skips non-root fields and unrecognized entries

File: tools/merodb/src/abi.rs (line 70-254) | Consensus: 1/1 agents ✓

The infer_schema_from_database function only processes entries where is_root_field is true and silently skips all other entries. If the root ID detection logic is incorrect (e.g., context_id doesn't match expected format), legitimate fields may be silently excluded from the inferred schema with no warning or indication to the user. Additionally, entries that fail borsh::from_slice::<EntityIndex> deserialization are silently skipped.

Suggested fix:

Add logging or return diagnostic information about skipped entries. Consider returning a count of total entries scanned vs. matched, or track deserialization failures separately so users can diagnose incomplete schema inference.

💡 Suggestion (1)

1. Counter's internal map naming uses reserved prefix that could conflict with future use

File: crates/storage/src/collections/counter.rs (line 218-245) | Consensus: 1/1 agents ✓

The new_with_field_name_internal method uses __counter_internal_ as a reserved prefix for internal maps. While the code documents this prevents collisions with user-created collections, this convention is implicit and not enforced. If future code or user collections accidentally use this prefix, it could cause subtle bugs. The prefix is also fairly long, increasing storage overhead for every Counter instance.

Suggested fix:

Consider using a shorter but equally unique prefix (e.g., `_ci_` or a single special character sequence), or document this reserved prefix convention in the CRDT metadata/schema so tooling can validate against it.

_{🤖 Generated by AI Code Reviewer | Review ID: review-b8594d44}

meroreviewer · 2026-02-05T08:59:01Z

tools/merodb/src/abi.rs

+                    .as_ref()
+                    .map(|id| id.as_bytes() == &root_id_bytes)
+                    .unwrap_or(false);
+


🟡 Schema inference type defaults may produce incorrect type information

The schema inference defaults all UnorderedMap to Map<String, String>, Vector to List<String>, etc. These default type parameters may not match the actual data types stored in the database. When users rely on this inferred schema for inspection, they may get incorrect deserialization results for values that are not actually strings.

Suggested fix:

Document clearly that inferred schemas use default type parameters and may need manual refinement. Consider adding a comment in the generated schema indicating it was auto-inferred with default types, or attempt to sample actual entries to infer more accurate type information.

✅ Resolved - This issue has been addressed in the latest changes.

meroreviewer · 2026-02-05T08:59:03Z

tools/merodb/src/abi.rs


    load_state_schema_from_json_value(&schema_value)
 }
+


🟡 Schema inference silently skips non-root fields and unrecognized entries

The infer_schema_from_database function only processes entries where is_root_field is true and silently skips all other entries. If the root ID detection logic is incorrect (e.g., context_id doesn't match expected format), legitimate fields may be silently excluded from the inferred schema with no warning or indication to the user. Additionally, entries that fail borsh::from_slice::<EntityIndex> deserialization are silently skipped.

Suggested fix:

Add logging or return diagnostic information about skipped entries. Consider returning a count of total entries scanned vs. matched, or track deserialization failures separately so users can diagnose incomplete schema inference.

✅ Resolved - This issue has been addressed in the latest changes.

meroreviewer

🤖 AI Code Reviewer

Reviewed by 1 agents | Quality score: 85% | Review time: 531.7s

🟡 Issues pending resolution

✅ 7 Fixed | 🆕 5 New

✅ Fixed Issues

The following issues from previous reviews have been addressed:

~~Breaking change: compute_id modification invalidates existing data~~ (crates/storage/src/collections.rs:68)
~~Backward-compatible deserialization may silently consume unrelated data~~ (crates/storage/src/entities.rs:501)
~~Inconsistent field name initialization for LwwRegister fields~~ (apps/state-schema-conformance/src/lib.rs:126)
~~Counter's internal map naming convention is fragile~~ (crates/storage/src/collections/counter.rs:227)
~~New CrdtType variants lack documentation~~ (crates/storage/src/collections/crdt_meta.rs:16)
~~Schema inference type defaults may produce incorrect type information~~ (tools/merodb/src/abi.rs:130)
~~Schema inference silently skips non-root fields and unrecognized entries~~ (tools/merodb/src/abi.rs:70)

🆕 New Issues

These issues were found in the latest changes:

🟡 Warning (2)

1. Fragile backward compatibility in Metadata deserialization

File: crates/storage/src/entities.rs (line 494-507) | Consensus: 1/1 agents ✓

The custom BorshDeserialize implementation uses unwrap_or(None) to handle missing crdt_type and field_name fields in old data. This approach is fragile: if there's any leftover data in the reader (from a different serialization version or corruption), it may be incorrectly interpreted as these new fields rather than causing a proper error. The deserialization will silently consume bytes that may not be valid Option or Option data.

Suggested fix:

Consider using a version byte at the start of Metadata serialization to explicitly distinguish old vs new format, or use a more robust migration approach that validates the data being read.

2. Inconsistent CRDT type assignment for Counter internal maps

File: crates/storage/src/collections/counter.rs (line 218-248) | Consensus: 1/1 agents ✓

In new_with_field_name_internal, the positive map is assigned CrdtType::Counter but the negative map uses new_with_field_name_internal which assigns CrdtType::UnorderedMap by default. This inconsistency could cause issues during schema inference or merge operations, where the negative map would appear as a separate UnorderedMap entity rather than being recognized as part of a Counter.

Suggested fix:

Either mark both internal maps with `CrdtType::Counter` (or a new internal marker type), or document why this asymmetry is intentional and safe for merge operations.

💡 Suggestion (2)

1. Schema inference defaults all types to String which may cause deserialization failures

File: tools/merodb/src/abi.rs (line 70-185) | Consensus: 1/1 agents ✓

The infer_schema_from_database function defaults value types to String (e.g., TypeRef::string() for map values and list items) when inferring schema from CRDT types. This means the inferred schema won't match the actual data types stored, potentially causing issues when trying to deserialize values using the inferred schema. Users may get incorrect type information or deserialization errors.

Suggested fix:

Document clearly that inferred schemas provide structural information only (field names and CRDT types) but not value type information. Consider adding a warning in the output or requiring explicit type hints for accurate schema inference.

2. Domain separator collision safety relies on undocumented assumption

File: crates/storage/src/collections.rs (line 57-86) | Consensus: 1/1 agents ✓

The domain separator approach for preventing ID collisions between map entries and collections is sound, but it relies on the assumption that parent IDs are always fixed-length (32 bytes). If the hashing scheme were ever changed or if parent_id format varies, the domain separation could break. The safety invariant should be documented.

Suggested fix:

Add a comment documenting that collision safety depends on parent IDs being fixed 32-byte values, and that the domain separators must remain unique from any valid user key prefix.

📝 Nitpick (1)

1. Inconsistent field_name handling between inner map and storage element

File: crates/storage/src/collections/frozen.rs (line 44-55) | Consensus: 1/1 agents ✓

In new_with_field_name, the inner UnorderedMap gets a prefixed name (__frozen_storage_{field_name}) while the storage Element gets the original field_name. This dual naming could cause confusion during schema inference - the FrozenStorage would be discoverable by its original name, but its internal map would appear with the prefixed name.

Suggested fix:

Consider whether the internal map should be excluded from schema inference (by not setting a field_name), or document this naming convention clearly.

_{🤖 Generated by AI Code Reviewer | Review ID: review-3e6f5b94}

meroreviewer · 2026-02-05T09:00:52Z

tools/merodb/src/abi.rs


    load_state_schema_from_json_value(&schema_value)
 }
+


💡 Schema inference defaults all types to String which may cause deserialization failures

The infer_schema_from_database function defaults value types to String (e.g., TypeRef::string() for map values and list items) when inferring schema from CRDT types. This means the inferred schema won't match the actual data types stored, potentially causing issues when trying to deserialize values using the inferred schema. Users may get incorrect type information or deserialization errors.

Suggested fix:

Document clearly that inferred schemas provide structural information only (field names and CRDT types) but not value type information. Consider adding a warning in the output or requiring explicit type hints for accurate schema inference.

meroreviewer bot reviewed Feb 5, 2026

View reviewed changes

xilosada force-pushed the feat/merodb-schema-inference branch from afeb8cb to 148a530 Compare February 5, 2026 08:48

xilosada force-pushed the feat/merodb-schema-inference branch from cd77a47 to 5c1c97e Compare February 5, 2026 08:50

xilosada changed the base branch from master to feat/combined-deterministic-ids-and-metadata February 5, 2026 08:51

meroreviewer bot reviewed Feb 5, 2026

View reviewed changes

Conversation

xilosada commented Feb 5, 2026

Summary

⚠️ Depends On

Planned Changes

Schema Inference (abi.rs)

Export Types (export.rs)

CLI Updates (export/cli.rs)

GUI Updates

Benefits

Related

Uh oh!

meroreviewer bot left a comment

Choose a reason for hiding this comment

🤖 AI Code Reviewer

🔴 Critical (1)

1. Breaking change: compute_id modification invalidates existing data

🟡 Warning (1)

1. Backward-compatible deserialization may silently consume unrelated data

💡 Suggestion (2)

1. Inconsistent field name initialization for LwwRegister fields

2. Counter's internal map naming convention is fragile

📝 Nitpick (1)

1. New CrdtType variants lack documentation

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Schema Inference (`abi.rs`)

Export Types (`export.rs`)

CLI Updates (`export/cli.rs`)