Skip to content

RFC: Native Vector Search Support via FalkorDB HNSW Indexes #124

@JulienDbrt

Description

@JulienDbrt

RFC: Native Vector Search Support via FalkorDB HNSW Indexes

Summary

Implement native vector similarity search in GraFlo by leveraging FalkorDB's built-in HNSW vector indexes. This enables hybrid queries combining semantic similarity with graph traversal in a single database, eliminating the need for external vector databases (Qdrant, Milvus, Pinecone).

Motivation

Current State: The Two-Database Problem

Most knowledge graph applications requiring semantic search today must maintain two separate systems:

┌─────────────────┐     ┌─────────────────┐
│   Graph DB      │     │   Vector DB     │
│   (Neo4j/       │ ←→  │   (Qdrant/      │
│    FalkorDB)    │sync │    Milvus)      │
└─────────────────┘     └─────────────────┘
        ↓                       ↓
   Graph Queries          Similarity Search
   (relationships)          (embeddings)

Problems with this architecture:

Issue Impact
Data synchronization Embeddings can drift out of sync with graph nodes
Operational complexity Two databases to maintain, monitor, backup
Query latency Network hop between DBs adds ~5-20ms per query
Transactional integrity No atomic operations across both systems
Query expressiveness Cannot combine vector similarity with graph patterns in one query

Proposed State: Unified Hybrid Search

┌─────────────────────────────────────┐
│           FalkorDB                  │
│  ┌─────────────┐  ┌──────────────┐  │
│  │ Graph Store │  │ Vector Index │  │
│  │  (Cypher)   │←→│   (HNSW)     │  │
│  └─────────────┘  └──────────────┘  │
└─────────────────────────────────────┘
                 ↓
    Hybrid Queries (single roundtrip)

Technical Approach

FalkorDB Vector Capabilities

FalkorDB (v4.0+) supports HNSW vector indexes natively:

-- Create vector index
CALL db.idx.vector.createNodeIndex(
  'Document',           -- Label
  'embedding',          -- Property containing vector
  'cosine',             -- Distance metric: cosine | L2 | IP
  1536                  -- Vector dimension (e.g., OpenAI ada-002)
)

-- Query similar nodes
CALL db.idx.vector.queryNodes(
  'Document',           -- Label
  'embedding',          -- Property
  10,                   -- Top K results
  $query_vector         -- Query embedding
) YIELD node, score

Hybrid Query Examples

Example 1: Semantic search + relationship filtering

Find documents similar to a query, but only from authors in AI organizations:

CALL db.idx.vector.queryNodes('Document', 'embedding', 20, $query_vector)
YIELD node AS doc, score
MATCH (doc)-[:AUTHORED_BY]->(author:Person)-[:WORKS_AT]->(org:Organization)
WHERE org.domain = 'artificial_intelligence'
RETURN doc.title, doc.abstract, author.name, org.name, score
ORDER BY score DESC
LIMIT 10

Example 2: Graph-constrained semantic search

Find the most relevant research papers within 2 hops of a known paper:

MATCH (seed:Paper {doi: $seed_doi})-[:CITES*1..2]->(candidate:Paper)
WITH candidate
CALL db.idx.vector.queryNodes('Paper', 'embedding', 5, $query_vector)
YIELD node, score
WHERE node = candidate
RETURN node.title, score

Example 3: Multi-modal retrieval

Combine text embeddings with structured filters:

CALL db.idx.vector.queryNodes('Product', 'description_embedding', 50, $query_vector)
YIELD node AS product, score
WHERE product.price < 100 AND product.in_stock = true
MATCH (product)-[:IN_CATEGORY]->(cat:Category)
WHERE cat.name IN ['Electronics', 'Gadgets']
RETURN product.name, product.price, score
ORDER BY score DESC
LIMIT 10

Proposed API

New Methods for FalkordbConnection

class FalkordbConnection(Connection):
    
    def create_vector_index(
        self,
        label: str,
        property_name: str,
        dimension: int,
        metric: Literal["cosine", "L2", "IP"] = "cosine"
    ) -> None:
        """Create HNSW vector index on a node label."""
        
    def drop_vector_index(
        self,
        label: str,
        property_name: str
    ) -> None:
        """Drop existing vector index."""
        
    def vector_search(
        self,
        label: str,
        property_name: str,
        query_vector: list[float],
        top_k: int = 10,
        filter_query: str | None = None,
        filter_params: dict | None = None
    ) -> list[dict]:
        """
        Perform vector similarity search with optional Cypher filter.
        
        Parameters
        ----------
        label : str
            Node label to search
        property_name : str
            Property containing embeddings
        query_vector : list[float]
            Query embedding vector
        top_k : int
            Number of results to return
        filter_query : str, optional
            Additional Cypher WHERE clause (e.g., "node.year > 2020")
        filter_params : dict, optional
            Parameters for filter_query
            
        Returns
        -------
        list[dict]
            List of {node: dict, score: float} results
        """
        
    def hybrid_search(
        self,
        label: str,
        property_name: str,
        query_vector: list[float],
        traversal_pattern: str,
        top_k: int = 10,
        **params
    ) -> list[dict]:
        """
        Combine vector search with graph traversal.
        
        Parameters
        ----------
        traversal_pattern : str
            Cypher pattern starting from vector results, e.g.:
            "(node)-[:AUTHORED_BY]->(author:Person)"
        """

Configuration Extension

class FalkordbConfig(DBConfig):
    # Existing fields...
    
    # New vector search defaults
    vector_index_metric: Literal["cosine", "L2", "IP"] = "cosine"
    vector_search_default_k: int = 10

Implementation Plan

Phase Task Complexity
1 Add create_vector_index / drop_vector_index Low
2 Add vector_search with basic filtering Medium
3 Add hybrid_search with traversal patterns Medium
4 Add batch embedding upsert optimization Low
5 Documentation and examples Low

Benchmark Considerations

Before production adoption, benchmarks should compare:

Metric FalkorDB Native External Vector DB
Query latency (p50, p99) TBD TBD
Indexing throughput TBD TBD
Memory usage TBD TBD
Max vector dimension TBD TBD
Max vectors per index TBD TBD

Alternatives Considered

Alternative Pros Cons
External Qdrant/Milvus Mature, highly optimized Sync complexity, no hybrid queries
Neo4j + GDS Native integration Requires GDS license, different API
pgvector (PostgreSQL) SQL familiarity No native graph traversal

Open Questions

  1. Should hybrid_search support arbitrary Cypher templates or use a structured API?
  2. How to handle embedding updates (re-index vs incremental)?
  3. Should we expose HNSW tuning parameters (M, efConstruction)?

References


Feedback welcome! Please comment with use cases, concerns, or implementation suggestions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions