-
Notifications
You must be signed in to change notification settings - Fork 2
Description
RFC: Native Vector Search Support via FalkorDB HNSW Indexes
Summary
Implement native vector similarity search in GraFlo by leveraging FalkorDB's built-in HNSW vector indexes. This enables hybrid queries combining semantic similarity with graph traversal in a single database, eliminating the need for external vector databases (Qdrant, Milvus, Pinecone).
Motivation
Current State: The Two-Database Problem
Most knowledge graph applications requiring semantic search today must maintain two separate systems:
┌─────────────────┐ ┌─────────────────┐
│ Graph DB │ │ Vector DB │
│ (Neo4j/ │ ←→ │ (Qdrant/ │
│ FalkorDB) │sync │ Milvus) │
└─────────────────┘ └─────────────────┘
↓ ↓
Graph Queries Similarity Search
(relationships) (embeddings)
Problems with this architecture:
| Issue | Impact |
|---|---|
| Data synchronization | Embeddings can drift out of sync with graph nodes |
| Operational complexity | Two databases to maintain, monitor, backup |
| Query latency | Network hop between DBs adds ~5-20ms per query |
| Transactional integrity | No atomic operations across both systems |
| Query expressiveness | Cannot combine vector similarity with graph patterns in one query |
Proposed State: Unified Hybrid Search
┌─────────────────────────────────────┐
│ FalkorDB │
│ ┌─────────────┐ ┌──────────────┐ │
│ │ Graph Store │ │ Vector Index │ │
│ │ (Cypher) │←→│ (HNSW) │ │
│ └─────────────┘ └──────────────┘ │
└─────────────────────────────────────┘
↓
Hybrid Queries (single roundtrip)
Technical Approach
FalkorDB Vector Capabilities
FalkorDB (v4.0+) supports HNSW vector indexes natively:
-- Create vector index
CALL db.idx.vector.createNodeIndex(
'Document', -- Label
'embedding', -- Property containing vector
'cosine', -- Distance metric: cosine | L2 | IP
1536 -- Vector dimension (e.g., OpenAI ada-002)
)
-- Query similar nodes
CALL db.idx.vector.queryNodes(
'Document', -- Label
'embedding', -- Property
10, -- Top K results
$query_vector -- Query embedding
) YIELD node, scoreHybrid Query Examples
Example 1: Semantic search + relationship filtering
Find documents similar to a query, but only from authors in AI organizations:
CALL db.idx.vector.queryNodes('Document', 'embedding', 20, $query_vector)
YIELD node AS doc, score
MATCH (doc)-[:AUTHORED_BY]->(author:Person)-[:WORKS_AT]->(org:Organization)
WHERE org.domain = 'artificial_intelligence'
RETURN doc.title, doc.abstract, author.name, org.name, score
ORDER BY score DESC
LIMIT 10Example 2: Graph-constrained semantic search
Find the most relevant research papers within 2 hops of a known paper:
MATCH (seed:Paper {doi: $seed_doi})-[:CITES*1..2]->(candidate:Paper)
WITH candidate
CALL db.idx.vector.queryNodes('Paper', 'embedding', 5, $query_vector)
YIELD node, score
WHERE node = candidate
RETURN node.title, scoreExample 3: Multi-modal retrieval
Combine text embeddings with structured filters:
CALL db.idx.vector.queryNodes('Product', 'description_embedding', 50, $query_vector)
YIELD node AS product, score
WHERE product.price < 100 AND product.in_stock = true
MATCH (product)-[:IN_CATEGORY]->(cat:Category)
WHERE cat.name IN ['Electronics', 'Gadgets']
RETURN product.name, product.price, score
ORDER BY score DESC
LIMIT 10Proposed API
New Methods for FalkordbConnection
class FalkordbConnection(Connection):
def create_vector_index(
self,
label: str,
property_name: str,
dimension: int,
metric: Literal["cosine", "L2", "IP"] = "cosine"
) -> None:
"""Create HNSW vector index on a node label."""
def drop_vector_index(
self,
label: str,
property_name: str
) -> None:
"""Drop existing vector index."""
def vector_search(
self,
label: str,
property_name: str,
query_vector: list[float],
top_k: int = 10,
filter_query: str | None = None,
filter_params: dict | None = None
) -> list[dict]:
"""
Perform vector similarity search with optional Cypher filter.
Parameters
----------
label : str
Node label to search
property_name : str
Property containing embeddings
query_vector : list[float]
Query embedding vector
top_k : int
Number of results to return
filter_query : str, optional
Additional Cypher WHERE clause (e.g., "node.year > 2020")
filter_params : dict, optional
Parameters for filter_query
Returns
-------
list[dict]
List of {node: dict, score: float} results
"""
def hybrid_search(
self,
label: str,
property_name: str,
query_vector: list[float],
traversal_pattern: str,
top_k: int = 10,
**params
) -> list[dict]:
"""
Combine vector search with graph traversal.
Parameters
----------
traversal_pattern : str
Cypher pattern starting from vector results, e.g.:
"(node)-[:AUTHORED_BY]->(author:Person)"
"""Configuration Extension
class FalkordbConfig(DBConfig):
# Existing fields...
# New vector search defaults
vector_index_metric: Literal["cosine", "L2", "IP"] = "cosine"
vector_search_default_k: int = 10Implementation Plan
| Phase | Task | Complexity |
|---|---|---|
| 1 | Add create_vector_index / drop_vector_index |
Low |
| 2 | Add vector_search with basic filtering |
Medium |
| 3 | Add hybrid_search with traversal patterns |
Medium |
| 4 | Add batch embedding upsert optimization | Low |
| 5 | Documentation and examples | Low |
Benchmark Considerations
Before production adoption, benchmarks should compare:
| Metric | FalkorDB Native | External Vector DB |
|---|---|---|
| Query latency (p50, p99) | TBD | TBD |
| Indexing throughput | TBD | TBD |
| Memory usage | TBD | TBD |
| Max vector dimension | TBD | TBD |
| Max vectors per index | TBD | TBD |
Alternatives Considered
| Alternative | Pros | Cons |
|---|---|---|
| External Qdrant/Milvus | Mature, highly optimized | Sync complexity, no hybrid queries |
| Neo4j + GDS | Native integration | Requires GDS license, different API |
| pgvector (PostgreSQL) | SQL familiarity | No native graph traversal |
Open Questions
- Should
hybrid_searchsupport arbitrary Cypher templates or use a structured API? - How to handle embedding updates (re-index vs incremental)?
- Should we expose HNSW tuning parameters (M, efConstruction)?
References
Feedback welcome! Please comment with use cases, concerns, or implementation suggestions.