feat(merodb): add schema inference from database metadata#1865
feat(merodb): add schema inference from database metadata#1865xilosada wants to merge 1 commit intofeat/combined-deterministic-ids-and-metadatafrom
Conversation
There was a problem hiding this comment.
🤖 AI Code Reviewer
Reviewed by 1 agents | Quality score: 85% | Review time: 92.3s
🔴 Critical (1)
1. Breaking change: compute_id modification invalidates existing data
File: crates/storage/src/collections.rs (line 68-74) | Consensus: 1/1 agents ✓
The compute_id function has been modified to include DOMAIN_SEPARATOR_ENTRY in the hash computation. This changes the IDs of ALL existing map entries in production databases. Any data stored with the old ID computation will no longer be accessible because lookups will use the new ID formula, effectively causing data loss.
Suggested fix:
This is a breaking change that requires a migration strategy. Either: (1) Add a new function `compute_id_v2` and use it only for new APIs while keeping `compute_id` unchanged for existing code paths, (2) Implement a database migration that re-keys all existing entries, or (3) Store a version marker to determine which ID computation to use.
Found by: cursor-agent
🟡 Warning (1)
1. Backward-compatible deserialization may silently consume unrelated data
File: crates/storage/src/entities.rs (line 501-510) | Consensus: 1/1 agents ✓
The custom BorshDeserialize implementation uses unwrap_or(None) to handle missing fields in old data. However, Borsh deserialization of Option<T> reads a discriminant byte first. If old data has ANY trailing bytes (e.g., from a different schema version or corruption), those bytes will be incorrectly parsed as crdt_type or field_name rather than being detected as an error. This could lead to silent data corruption.
Suggested fix:
Consider checking if the reader is at EOF before attempting to read optional fields, or use a versioned serialization format with an explicit version marker at the start of the metadata.
Found by: cursor-agent
💡 Suggestion (2)
1. Inconsistent field name initialization for LwwRegister fields
File: apps/state-schema-conformance/src/lib.rs (line 126-129) | Consensus: 1/1 agents ✓
Most collections in the init() function have been updated to use new_with_field_name(), but status, user_id, counter, name, and active fields still use LwwRegister::new() without field names. This inconsistency means these fields won't benefit from schema inference and will have non-deterministic IDs across nodes.
Suggested fix:
Either add `new_with_field_name` support to LwwRegister and update these fields, or document why LwwRegister is intentionally excluded from deterministic ID generation.
Found by: cursor-agent
2. Counter's internal map naming convention is fragile
File: crates/storage/src/collections/counter.rs (line 216-230) | Consensus: 1/1 agents ✓
The Counter uses a string prefix __counter_internal_ to prevent collisions with user-created collections. While documented, this relies on users not creating collections with this prefix. A malicious or careless user could still create a collection named __counter_internal_visits_positive and cause a collision.
Suggested fix:
Consider using a different domain separator mechanism (like a UUID or type-specific bytes) that cannot be expressed as a valid user field name, similar to how DOMAIN_SEPARATOR_ENTRY works.
Found by: cursor-agent
📝 Nitpick (1)
1. New CrdtType variants lack documentation
File: crates/storage/src/collections/crdt_meta.rs (line 16-34) | Consensus: 1/1 agents ✓
The newly added UserStorage and FrozenStorage variants have only brief one-line comments, while other variants are similarly terse. As the enum grows and is used for merge dispatch during state synchronization, clearer documentation about when each type is used and its merge semantics would be valuable.
Suggested fix:
Add doc comments explaining the purpose and merge behavior of UserStorage and FrozenStorage variants.
Found by: cursor-agent
🤖 Generated by AI Code Reviewer | Review ID: review-1bd71bd9
| const DOMAIN_SEPARATOR_COLLECTION: &[u8] = b"__calimero_collection__"; | ||
|
|
||
| /// Compute the ID for a key in a map. | ||
| /// Uses domain separation to prevent collision with collection IDs. |
There was a problem hiding this comment.
🔴 Breaking change: compute_id modification invalidates existing data
The compute_id function has been modified to include DOMAIN_SEPARATOR_ENTRY in the hash computation. This changes the IDs of ALL existing map entries in production databases. Any data stored with the old ID computation will no longer be accessible because lookups will use the new ID formula, effectively causing data loss.
Suggested fix:
This is a breaking change that requires a migration strategy. Either: (1) Add a new function `compute_id_v2` and use it only for new APIs while keeping `compute_id` unchanged for existing code paths, (2) Implement a database migration that re-keys all existing entries, or (3) Store a version marker to determine which ID computation to use.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
| let created_at = u64::deserialize_reader(reader)?; | ||
| let updated_at = UpdatedAt::deserialize_reader(reader)?; | ||
| let storage_type = StorageType::deserialize_reader(reader)?; | ||
|
|
There was a problem hiding this comment.
🟡 Backward-compatible deserialization may silently consume unrelated data
The custom BorshDeserialize implementation uses unwrap_or(None) to handle missing fields in old data. However, Borsh deserialization of Option<T> reads a discriminant byte first. If old data has ANY trailing bytes (e.g., from a different schema version or corruption), those bytes will be incorrectly parsed as crdt_type or field_name rather than being detected as an error. This could lead to silent data corruption.
Suggested fix:
Consider checking if the reader is at EOF before attempting to read optional fields, or use a versioned serialization format with an explicit version marker at the start of the metadata.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
| map_of_lists: UnorderedMap::new_with_field_name(None, "map_of_lists"), | ||
| list_of_maps: Vector::new_with_field_name(None, "list_of_maps"), | ||
| string_set: UnorderedSet::new_with_field_name(None, "string_set"), | ||
| visit_counter: Counter::new_with_field_name(None, "visit_counter"), |
There was a problem hiding this comment.
💡 Inconsistent field name initialization for LwwRegister fields
Most collections in the init() function have been updated to use new_with_field_name(), but status, user_id, counter, name, and active fields still use LwwRegister::new() without field names. This inconsistency means these fields won't benefit from schema inference and will have non-deterministic IDs across nodes.
Suggested fix:
Either add `new_with_field_name` support to LwwRegister and update these fields, or document why LwwRegister is intentionally excluded from deterministic ID generation.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
| field_name: &str, | ||
| ) -> Self { | ||
| // For Counter, we need to create deterministic IDs for both positive and negative maps | ||
| // Use a reserved internal prefix to prevent collisions with user-created collections. |
There was a problem hiding this comment.
💡 Counter's internal map naming convention is fragile
The Counter uses a string prefix __counter_internal_ to prevent collisions with user-created collections. While documented, this relies on users not creating collections with this prefix. A malicious or careless user could still create a collection named __counter_internal_visits_positive and cause a collision.
Suggested fix:
Consider using a different domain separator mechanism (like a UUID or type-specific bytes) that cannot be expressed as a valid user field name, similar to how DOMAIN_SEPARATOR_ENTRY works.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
|
|
||
| /// Identifies the specific CRDT type | ||
| #[derive(Debug, Clone, PartialEq, Eq)] | ||
| #[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, BorshSerialize, BorshDeserialize)] |
There was a problem hiding this comment.
📝 New CrdtType variants lack documentation
The newly added UserStorage and FrozenStorage variants have only brief one-line comments, while other variants are similarly terse. As the enum grows and is used for merge dispatch during state synchronization, clearer documentation about when each type is used and its merge semantics would be valuable.
Suggested fix:
Add doc comments explaining the purpose and merge behavior of UserStorage and FrozenStorage variants.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
Merobox Workflows FailedThe following workflow(s) failed after retries:
Please check the workflow logs for more details. |
afeb8cb to
148a530
Compare
- Schema inference from database metadata (field_name, crdt_type) - Improved tree view with field names instead of truncated IDs - Better visualization for Counter, Vector, Set values - Show children count and hide empty sections in sidebar
cd77a47 to
5c1c97e
Compare
Merobox Proposals Workflows FailedThe following proposal workflow(s) failed:
Please check the workflow logs for more details. |
There was a problem hiding this comment.
🤖 AI Code Reviewer
Reviewed by 1 agents | Quality score: 95% | Review time: 503.6s
✅ Ready to Merge
All previously identified issues have been addressed!
✅ Fixed Issues
The following issues from previous reviews have been addressed:
Breaking change: compute_id modification invalidates existing data(crates/storage/src/collections.rs:68)Backward-compatible deserialization may silently consume unrelated data(crates/storage/src/entities.rs:501)Inconsistent field name initialization for LwwRegister fields(apps/state-schema-conformance/src/lib.rs:126)Counter's internal map naming convention is fragile(crates/storage/src/collections/counter.rs:227)New CrdtType variants lack documentation(crates/storage/src/collections/crdt_meta.rs:16)
🤖 Generated by AI Code Reviewer | Review ID: review-48ee257a
There was a problem hiding this comment.
🤖 AI Code Reviewer
Reviewed by 1 agents | Quality score: 85% | Review time: 481.6s
🔴 Critical issues require attention
✅ 5 Fixed | 🆕 5 New
✅ Fixed Issues
The following issues from previous reviews have been addressed:
Breaking change: compute_id modification invalidates existing data(crates/storage/src/collections.rs:68)Backward-compatible deserialization may silently consume unrelated data(crates/storage/src/entities.rs:501)Inconsistent field name initialization for LwwRegister fields(apps/state-schema-conformance/src/lib.rs:126)Counter's internal map naming convention is fragile(crates/storage/src/collections/counter.rs:227)New CrdtType variants lack documentation(crates/storage/src/collections/crdt_meta.rs:16)
🆕 New Issues
These issues were found in the latest changes:
🔴 Critical (2)
1. Domain separator addition is a breaking change for existing data
File: crates/storage/src/collections.rs (line 66-75) | Consensus: 1/1 agents ✓
Adding DOMAIN_SEPARATOR_ENTRY to compute_id() changes the hash computation from SHA256(parent || key) to SHA256(parent || "__calimero_entry__" || key). This means all existing map entries stored in databases will have IDs computed with the OLD formula, while new lookups will use the NEW formula. Existing data will become inaccessible because lookups will fail to find entries stored under old IDs.
Suggested fix:
Either: (1) Provide a migration path that rehashes all existing entries, (2) Add a fallback that tries the old ID computation when the new one fails, or (3) Document this as a breaking change requiring fresh databases. Consider whether this should be gated behind a feature flag or version check.
Found by: cursor-agent
2. Borsh backward compatibility deserialization may read garbage data
File: crates/storage/src/entities.rs (line 495-512) | Consensus: 1/1 agents ✓
The custom BorshDeserialize implementation for Metadata uses unwrap_or(None) to handle missing crdt_type and field_name fields in old data. However, when Metadata is part of a compound structure (e.g., Vec<ChildInfo> in index entries), if old-format Metadata is followed by another item, the deserializer will interpret the first bytes of the NEXT item as the Option discriminant for crdt_type. This will either corrupt deserialization or silently produce incorrect data, since Option::<CrdtType>::deserialize_reader expects a specific format (0x00 for None, 0x01 for Some followed by enum data).
Suggested fix:
Use a versioning scheme or length-prefixed format for Metadata. Alternatively, track remaining bytes before attempting to read optional fields, or use a separate migration to update all stored Metadata before deploying this change.
Found by: cursor-agent
🟡 Warning (2)
1. Schema inference type defaults may produce incorrect type information
File: tools/merodb/src/abi.rs (line 130-180) | Consensus: 1/1 agents ✓
The schema inference defaults all UnorderedMap to Map<String, String>, Vector to List<String>, etc. These default type parameters may not match the actual data types stored in the database. When users rely on this inferred schema for inspection, they may get incorrect deserialization results for values that are not actually strings.
Suggested fix:
Document clearly that inferred schemas use default type parameters and may need manual refinement. Consider adding a comment in the generated schema indicating it was auto-inferred with default types, or attempt to sample actual entries to infer more accurate type information.
Found by: cursor-agent
2. Schema inference silently skips non-root fields and unrecognized entries
File: tools/merodb/src/abi.rs (line 70-254) | Consensus: 1/1 agents ✓
The infer_schema_from_database function only processes entries where is_root_field is true and silently skips all other entries. If the root ID detection logic is incorrect (e.g., context_id doesn't match expected format), legitimate fields may be silently excluded from the inferred schema with no warning or indication to the user. Additionally, entries that fail borsh::from_slice::<EntityIndex> deserialization are silently skipped.
Suggested fix:
Add logging or return diagnostic information about skipped entries. Consider returning a count of total entries scanned vs. matched, or track deserialization failures separately so users can diagnose incomplete schema inference.
Found by: cursor-agent
💡 Suggestion (1)
1. Counter's internal map naming uses reserved prefix that could conflict with future use
File: crates/storage/src/collections/counter.rs (line 218-245) | Consensus: 1/1 agents ✓
The new_with_field_name_internal method uses __counter_internal_ as a reserved prefix for internal maps. While the code documents this prevents collisions with user-created collections, this convention is implicit and not enforced. If future code or user collections accidentally use this prefix, it could cause subtle bugs. The prefix is also fairly long, increasing storage overhead for every Counter instance.
Suggested fix:
Consider using a shorter but equally unique prefix (e.g., `_ci_` or a single special character sequence), or document this reserved prefix convention in the CRDT metadata/schema so tooling can validate against it.
Found by: cursor-agent
🤖 Generated by AI Code Reviewer | Review ID: review-b8594d44
| .as_ref() | ||
| .map(|id| id.as_bytes() == &root_id_bytes) | ||
| .unwrap_or(false); | ||
|
|
There was a problem hiding this comment.
🟡 Schema inference type defaults may produce incorrect type information
The schema inference defaults all UnorderedMap to Map<String, String>, Vector to List<String>, etc. These default type parameters may not match the actual data types stored in the database. When users rely on this inferred schema for inspection, they may get incorrect deserialization results for values that are not actually strings.
Suggested fix:
Document clearly that inferred schemas use default type parameters and may need manual refinement. Consider adding a comment in the generated schema indicating it was auto-inferred with default types, or attempt to sample actual entries to infer more accurate type information.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
|
|
||
| load_state_schema_from_json_value(&schema_value) | ||
| } | ||
|
|
There was a problem hiding this comment.
🟡 Schema inference silently skips non-root fields and unrecognized entries
The infer_schema_from_database function only processes entries where is_root_field is true and silently skips all other entries. If the root ID detection logic is incorrect (e.g., context_id doesn't match expected format), legitimate fields may be silently excluded from the inferred schema with no warning or indication to the user. Additionally, entries that fail borsh::from_slice::<EntityIndex> deserialization are silently skipped.
Suggested fix:
Add logging or return diagnostic information about skipped entries. Consider returning a count of total entries scanned vs. matched, or track deserialization failures separately so users can diagnose incomplete schema inference.
There was a problem hiding this comment.
✅ Resolved - This issue has been addressed in the latest changes.
There was a problem hiding this comment.
🤖 AI Code Reviewer
Reviewed by 1 agents | Quality score: 85% | Review time: 531.7s
🟡 Issues pending resolution
✅ 7 Fixed | 🆕 5 New
✅ Fixed Issues
The following issues from previous reviews have been addressed:
Breaking change: compute_id modification invalidates existing data(crates/storage/src/collections.rs:68)Backward-compatible deserialization may silently consume unrelated data(crates/storage/src/entities.rs:501)Inconsistent field name initialization for LwwRegister fields(apps/state-schema-conformance/src/lib.rs:126)Counter's internal map naming convention is fragile(crates/storage/src/collections/counter.rs:227)New CrdtType variants lack documentation(crates/storage/src/collections/crdt_meta.rs:16)Schema inference type defaults may produce incorrect type information(tools/merodb/src/abi.rs:130)Schema inference silently skips non-root fields and unrecognized entries(tools/merodb/src/abi.rs:70)
🆕 New Issues
These issues were found in the latest changes:
🟡 Warning (2)
1. Fragile backward compatibility in Metadata deserialization
File: crates/storage/src/entities.rs (line 494-507) | Consensus: 1/1 agents ✓
The custom BorshDeserialize implementation uses unwrap_or(None) to handle missing crdt_type and field_name fields in old data. This approach is fragile: if there's any leftover data in the reader (from a different serialization version or corruption), it may be incorrectly interpreted as these new fields rather than causing a proper error. The deserialization will silently consume bytes that may not be valid Option or Option data.
Suggested fix:
Consider using a version byte at the start of Metadata serialization to explicitly distinguish old vs new format, or use a more robust migration approach that validates the data being read.
Found by: cursor-agent
2. Inconsistent CRDT type assignment for Counter internal maps
File: crates/storage/src/collections/counter.rs (line 218-248) | Consensus: 1/1 agents ✓
In new_with_field_name_internal, the positive map is assigned CrdtType::Counter but the negative map uses new_with_field_name_internal which assigns CrdtType::UnorderedMap by default. This inconsistency could cause issues during schema inference or merge operations, where the negative map would appear as a separate UnorderedMap entity rather than being recognized as part of a Counter.
Suggested fix:
Either mark both internal maps with `CrdtType::Counter` (or a new internal marker type), or document why this asymmetry is intentional and safe for merge operations.
Found by: cursor-agent
💡 Suggestion (2)
1. Schema inference defaults all types to String which may cause deserialization failures
File: tools/merodb/src/abi.rs (line 70-185) | Consensus: 1/1 agents ✓
The infer_schema_from_database function defaults value types to String (e.g., TypeRef::string() for map values and list items) when inferring schema from CRDT types. This means the inferred schema won't match the actual data types stored, potentially causing issues when trying to deserialize values using the inferred schema. Users may get incorrect type information or deserialization errors.
Suggested fix:
Document clearly that inferred schemas provide structural information only (field names and CRDT types) but not value type information. Consider adding a warning in the output or requiring explicit type hints for accurate schema inference.
Found by: cursor-agent
2. Domain separator collision safety relies on undocumented assumption
File: crates/storage/src/collections.rs (line 57-86) | Consensus: 1/1 agents ✓
The domain separator approach for preventing ID collisions between map entries and collections is sound, but it relies on the assumption that parent IDs are always fixed-length (32 bytes). If the hashing scheme were ever changed or if parent_id format varies, the domain separation could break. The safety invariant should be documented.
Suggested fix:
Add a comment documenting that collision safety depends on parent IDs being fixed 32-byte values, and that the domain separators must remain unique from any valid user key prefix.
Found by: cursor-agent
📝 Nitpick (1)
1. Inconsistent field_name handling between inner map and storage element
File: crates/storage/src/collections/frozen.rs (line 44-55) | Consensus: 1/1 agents ✓
In new_with_field_name, the inner UnorderedMap gets a prefixed name (__frozen_storage_{field_name}) while the storage Element gets the original field_name. This dual naming could cause confusion during schema inference - the FrozenStorage would be discoverable by its original name, but its internal map would appear with the prefixed name.
Suggested fix:
Consider whether the internal map should be excluded from schema inference (by not setting a field_name), or document this naming convention clearly.
Found by: cursor-agent
🤖 Generated by AI Code Reviewer | Review ID: review-3e6f5b94
|
|
||
| load_state_schema_from_json_value(&schema_value) | ||
| } | ||
|
|
There was a problem hiding this comment.
💡 Schema inference defaults all types to String which may cause deserialization failures
The infer_schema_from_database function defaults value types to String (e.g., TypeRef::string() for map values and list items) when inferring schema from CRDT types. This means the inferred schema won't match the actual data types stored, potentially causing issues when trying to deserialize values using the inferred schema. Users may get incorrect type information or deserialization errors.
Suggested fix:
Document clearly that inferred schemas provide structural information only (field names and CRDT types) but not value type information. Consider adding a warning in the output or requiring explicit type hints for accurate schema inference.
Summary
Adds schema inference capability to merodb, allowing it to inspect databases without requiring an external schema file.
This PR depends on #1864 - must be merged first. This PR will be rebased on master after #1864 lands.
Planned Changes
Schema Inference (
abi.rs)infer_schema_from_database(db, context_id)functionfield_nameandcrdt_typefrom metadataManifestdynamically from discovered fieldsExport Types (
export.rs)EntityIndexstruct with borsh deserializationCrdtTypeenum (local copy for merodb)Metadatastruct with backward-compatible deserializationCLI Updates (
export/cli.rs)--state-schema-fileoptionalinfer_schema_from_database()when no schema providedGUI Updates
Benefits
Related