feat(merodb): add schema inference from database metadata #1865

meroreviewer · 2026-02-05T08:59:03Z

🟡 Schema inference silently skips non-root fields and unrecognized entries

The infer_schema_from_database function only processes entries where is_root_field is true and silently skips all other entries. If the root ID detection logic is incorrect (e.g., context_id doesn't match expected format), legitimate fields may be silently excluded from the inferred schema with no warning or indication to the user. Additionally, entries that fail borsh::from_slice::<EntityIndex> deserialization are silently skipped.

Suggested fix:

Add logging or return diagnostic information about skipped entries. Consider returning a count of total entries scanned vs. matched, or track deserialization failures separately so users can diagnose incomplete schema inference.

✅ Resolved - This issue has been addressed in the latest changes.

meroreviewer · 2026-02-05T09:00:52Z

💡 Schema inference defaults all types to String which may cause deserialization failures

The infer_schema_from_database function defaults value types to String (e.g., TypeRef::string() for map values and list items) when inferring schema from CRDT types. This means the inferred schema won't match the actual data types stored, potentially causing issues when trying to deserialize values using the inferred schema. Users may get incorrect type information or deserialization errors.

Suggested fix:

Document clearly that inferred schemas provide structural information only (field names and CRDT types) but not value type information. Consider adding a warning in the output or requiring explicit type hints for accurate schema inference.

✅ Resolved - This issue has been addressed in the latest changes.

meroreviewer · 2026-02-05T16:04:53Z

🟡 Schema inference with no context_id uses arbitrary fallback

When context_id is None, the function uses [0u8; 32] as the root ID. This is documented with a warning, but the root ID comparison logic at line 137-141 checks if parent_id equals root_id_bytes. If the actual context has a different ID, legitimate root fields will be incorrectly filtered out, resulting in an empty or incomplete inferred schema.

Suggested fix:

Consider requiring `context_id` as a non-optional parameter, or scan all EntityIndex entries first to discover available context IDs and warn the user to specify one.

meroreviewer · 2026-02-05T15:05:21Z

🟡 Schema inference with no context_id uses arbitrary fallback

When context_id is None, the function uses [0u8; 32] as the root ID and continues processing. The comment acknowledges 'we can't determine root fields reliably', but the function still returns a potentially incorrect schema. This could lead to confusing results when used without a context_id.

Suggested fix:

Consider returning an error or a clearly-marked 'unreliable' schema when context_id is None, or scan for all unique context IDs first and warn the user about multiple contexts. At minimum, add a warning log when using the fallback.

✅ Resolved - This issue has been addressed in the latest changes.

meroreviewer · 2026-02-05T08:59:01Z

🟡 Schema inference type defaults may produce incorrect type information

The schema inference defaults all UnorderedMap to Map<String, String>, Vector to List<String>, etc. These default type parameters may not match the actual data types stored in the database. When users rely on this inferred schema for inspection, they may get incorrect deserialization results for values that are not actually strings.

Suggested fix:

Document clearly that inferred schemas use default type parameters and may need manual refinement. Consider adding a comment in the generated schema indicating it was auto-inferred with default types, or attempt to sample actual entries to infer more accurate type information.

✅ Resolved - This issue has been addressed in the latest changes.

-Original file line number
+Diff line change
@@ Expand Up / @@ -37,7 +37,7 @@ impl KvStore { @@
         #[app::init]
         pub fn init() -> KvStore {
             KvStore {
-                items: UnorderedMap::new(),
+                items: UnorderedMap::new_with_field_name("items"),
             }
         }
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -68,11 +68,11 @@ impl NestedCrdtTest { @@
         #[app::init]
         pub fn init() -> NestedCrdtTest {
             NestedCrdtTest {
-                counters: UnorderedMap::new(),
-                registers: UnorderedMap::new(),
-                metadata: UnorderedMap::new(),
-                metrics: Vector::new(),
-                tags: UnorderedMap::new(),
+                counters: UnorderedMap::new_with_field_name("counters"),
+                registers: UnorderedMap::new_with_field_name("registers"),
+                metadata: UnorderedMap::new_with_field_name("metadata"),
+                metrics: Vector::new_with_field_name("metrics"),
+                tags: UnorderedMap::new_with_field_name("tags"),
             }
         }
@@ Expand Down @@

-Original file line number
+Diff line change
@@ -1,6 +1,7 @@
     use std::fs;
     use std::path::Path;
+    use calimero_storage::collections::CrdtType;
     use calimero_wasm_abi::schema::Manifest;
     use eyre::Result;
@@ Expand Down Expand Up @@
         load_state_schema_from_json_value(&schema_value)
     }
+    /// Infer state schema from database by reading field names and CRDT types from metadata
+    ///
+    /// This function scans the State column for EntityIndex entries and builds a schema
+    /// based on field_name and crdt_type found in metadata. This enables schema-free
+    /// database inspection when field names are stored in metadata.
+    ///
+    /// # Arguments
+    /// * `db` - The database to scan
+    /// * `context_id` - Optional context ID to filter by. If None, scans all contexts (may find fields from multiple contexts)
+    pub fn infer_schema_from_database(
+        db: &rocksdb::DBWithThreadMode<rocksdb::SingleThreaded>,
+        context_id: Option<&[u8]>,
+    ) -> Result<Manifest> {
+        use calimero_wasm_abi::schema::{
+            CollectionType, CrdtCollectionType, Field, ScalarType, TypeDef, TypeRef,
+        };
+        use std::collections::BTreeMap;
+        let state_cf = db
+            .cf_handle("State")
+            .ok_or_else(|| eyre::eyre!("State column family not found"))?;
+        let mut fields = Vec::new();
+        let mut seen_field_names = std::collections::HashSet::new();
+        // Root ID depends on context:
+        // - If context_id is provided, root ID is that context_id (Id::root() returns context_id())
+        // - If no context_id, we can't determine root fields reliably, so use all zeros as fallback
+        let root_id_bytes: [u8; 32] = match context_id {
+            Some(ctx_id) => ctx_id.try_into().map_err(|_| {
+                eyre::eyre!(
+                    "context_id must be exactly 32 bytes, got {} bytes",
+                    ctx_id.len()
+                )
+            })?,
+            None => {
+                eprintln!(
+                    "[WARNING] No context_id provided for schema inference. \
+                    Using [0; 32] as fallback root ID. This may produce incorrect or incomplete \
+                    schema if the database contains multiple contexts. Consider providing a \
+                    specific context_id for accurate schema inference."
+                );
+                [0u8; 32]
+            }
+        };
+        // Scan State column for EntityIndex entries
+        let iter = db.iterator_cf(&state_cf, rocksdb::IteratorMode::Start);
+        for item in iter {
+            let (key, value) = item?;
+            // Filter by context_id if provided (key format: context_id (32 bytes) + state_key (32 bytes))
+            if let Some(expected_context_id) = context_id {
+                if key.len() < 32 || &key[..32] != expected_context_id {
+                    continue;
+                }
+            }
+            // Try to deserialize as EntityIndex
+            if let Ok(index) = borsh::from_slice::<crate::export::EntityIndex>(&value) {
+                // Check if this is a root-level field (parent_id is None or equals root/context_id)
+                let is_root_field = index.parent_id.is_none()
+                    || index
+                        .parent_id
+                        .as_ref()
+                        .map(|id| id.as_bytes() == &root_id_bytes)
+                        .unwrap_or(false);
+                if is_root_field {
+                    // Check if we have field_name in metadata
+                    if let Some(ref field_name) = index.metadata.field_name {
+                        if !seen_field_names.contains(field_name) {
+                            seen_field_names.insert(field_name.clone());
+                            // Infer type from crdt_type
+                            let type_ref = if let Some(crdt_type) = index.metadata.crdt_type {
+                                match crdt_type {
+                                    CrdtType::UnorderedMap => {
+                                        // Default to Map<String, String> - can be refined later
+                                        TypeRef::Collection {
+                                            collection: CollectionType::Map {
+                                                key: Box::new(TypeRef::string()),
+                                                value: Box::new(TypeRef::string()),
+                                            },
+                                            crdt_type: Some(CrdtCollectionType::UnorderedMap),
+                                            inner_type: None,
+                                        }
+                                    }
+                                    CrdtType::Vector => TypeRef::Collection {
+                                        collection: CollectionType::List {
+                                            items: Box::new(TypeRef::string()),
+                                        },
+                                        crdt_type: Some(CrdtCollectionType::Vector),
+                                        inner_type: None,
+                                    },
+                                    CrdtType::UnorderedSet => TypeRef::Collection {
+                                        collection: CollectionType::List {
+                                            items: Box::new(TypeRef::string()),
+                                        },
+                                        crdt_type: Some(CrdtCollectionType::UnorderedSet),
+                                        inner_type: None,
+                                    },
+                                    CrdtType::Counter => TypeRef::Collection {
+                                        // Counter is stored as Map<String, u64> internally
+                                        collection: CollectionType::Map {
+                                            key: Box::new(TypeRef::string()),
+                                            value: Box::new(TypeRef::Scalar(ScalarType::U64)),
+                                        },
+                                        crdt_type: Some(CrdtCollectionType::Counter),
+                                        inner_type: None,
+                                    },
+                                    CrdtType::Rga => TypeRef::Collection {
+                                        collection: CollectionType::Record { fields: Vec::new() },
+                                        crdt_type: Some(CrdtCollectionType::ReplicatedGrowableArray),
+                                        inner_type: None,
+                                    },
+                                    CrdtType::LwwRegister => TypeRef::Collection {
+                                        collection: CollectionType::Record { fields: Vec::new() },
+                                        crdt_type: Some(CrdtCollectionType::LwwRegister),
+                                        inner_type: Some(Box::new(TypeRef::string())),
+                                    },
+                                    CrdtType::UserStorage => TypeRef::Collection {
+                                        collection: CollectionType::Map {
+                                            key: Box::new(TypeRef::string()),
+                                            value: Box::new(TypeRef::string()),
+                                        },
+                                        crdt_type: Some(CrdtCollectionType::UnorderedMap),
+                                        inner_type: None,
+                                    },
+                                    CrdtType::FrozenStorage => TypeRef::Collection {
+                                        collection: CollectionType::Map {
+                                            key: Box::new(TypeRef::string()),
+                                            value: Box::new(TypeRef::string()),
+                                        },
+                                        crdt_type: Some(CrdtCollectionType::UnorderedMap),
+                                        inner_type: None,
+                                    },
+                                    CrdtType::Record => {
+                                        // Record type - would need to inspect children to infer fields
+                                        TypeRef::Collection {
+                                            collection: CollectionType::Record { fields: Vec::new() },
+                                            crdt_type: None,
+                                            inner_type: None,
+                                        }
+                                    }
+                                    CrdtType::Custom(_) => {
+                                        // Custom type - can't infer without schema
+                                        TypeRef::Collection {
+                                            collection: CollectionType::Record { fields: Vec::new() },
+                                            crdt_type: None,
+                                            inner_type: None,
+                                        }
+                                    }
+                                }
+                            } else {
+                                // No CRDT type - default to LWW register
+                                TypeRef::Collection {
+                                    collection: CollectionType::Record { fields: Vec::new() },
+                                    crdt_type: Some(CrdtCollectionType::LwwRegister),
+                                    inner_type: Some(Box::new(TypeRef::string())),
+                                }
+                            };
+                            fields.push(Field {
+                                name: field_name.clone(),
+                                type_: type_ref,
+                                nullable: None,
+                            });
+                        }
+                    }
+                }
+            }
+        }
+        // Create a record type with all inferred fields
+        let state_root_type = "InferredStateRoot".to_string();
+        let mut types = BTreeMap::new();
+        types.insert(
+            state_root_type.clone(),
+            TypeDef::Record {
+                fields: fields.clone(),
+            },
+        );
+        Ok(Manifest {
+            schema_version: "wasm-abi/1".to_string(),
+            types,
+            methods: Vec::new(),
+            events: Vec::new(),
+            state_root: Some(state_root_type),
+        })
+    }

feat(merodb): add schema inference from database metadata #1865

Are you sure you want to change the base?

feat(merodb): add schema inference from database metadata #1865

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

meroreviewer bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!