RFC: Native Vector Search Support via FalkorDB HNSW Indexes

# RFC: Native Vector Search Support via FalkorDB HNSW Indexes

## Summary

Implement native vector similarity search in GraFlo by leveraging FalkorDB's built-in HNSW vector indexes. This enables **hybrid queries** combining semantic similarity with graph traversal in a single database, eliminating the need for external vector databases (Qdrant, Milvus, Pinecone).

## Motivation

### Current State: The Two-Database Problem

Most knowledge graph applications requiring semantic search today must maintain two separate systems:

```
┌─────────────────┐     ┌─────────────────┐
│   Graph DB      │     │   Vector DB     │
│   (Neo4j/       │ ←→  │   (Qdrant/      │
│    FalkorDB)    │sync │    Milvus)      │
└─────────────────┘     └─────────────────┘
        ↓                       ↓
   Graph Queries          Similarity Search
   (relationships)          (embeddings)
```

**Problems with this architecture:**

| Issue | Impact |
|-------|--------|
| Data synchronization | Embeddings can drift out of sync with graph nodes |
| Operational complexity | Two databases to maintain, monitor, backup |
| Query latency | Network hop between DBs adds ~5-20ms per query |
| Transactional integrity | No atomic operations across both systems |
| Query expressiveness | Cannot combine vector similarity with graph patterns in one query |

### Proposed State: Unified Hybrid Search

```
┌─────────────────────────────────────┐
│           FalkorDB                  │
│  ┌─────────────┐  ┌──────────────┐  │
│  │ Graph Store │  │ Vector Index │  │
│  │  (Cypher)   │←→│   (HNSW)     │  │
│  └─────────────┘  └──────────────┘  │
└─────────────────────────────────────┘
                 ↓
    Hybrid Queries (single roundtrip)
```

## Technical Approach

### FalkorDB Vector Capabilities

FalkorDB (v4.0+) supports HNSW vector indexes natively:

```cypher
-- Create vector index
CALL db.idx.vector.createNodeIndex(
  'Document',           -- Label
  'embedding',          -- Property containing vector
  'cosine',             -- Distance metric: cosine | L2 | IP
  1536                  -- Vector dimension (e.g., OpenAI ada-002)
)

-- Query similar nodes
CALL db.idx.vector.queryNodes(
  'Document',           -- Label
  'embedding',          -- Property
  10,                   -- Top K results
  $query_vector         -- Query embedding
) YIELD node, score
```

### Hybrid Query Examples

**Example 1: Semantic search + relationship filtering**

Find documents similar to a query, but only from authors in AI organizations:

```cypher
CALL db.idx.vector.queryNodes('Document', 'embedding', 20, $query_vector)
YIELD node AS doc, score
MATCH (doc)-[:AUTHORED_BY]->(author:Person)-[:WORKS_AT]->(org:Organization)
WHERE org.domain = 'artificial_intelligence'
RETURN doc.title, doc.abstract, author.name, org.name, score
ORDER BY score DESC
LIMIT 10
```

**Example 2: Graph-constrained semantic search**

Find the most relevant research papers within 2 hops of a known paper:

```cypher
MATCH (seed:Paper {doi: $seed_doi})-[:CITES*1..2]->(candidate:Paper)
WITH candidate
CALL db.idx.vector.queryNodes('Paper', 'embedding', 5, $query_vector)
YIELD node, score
WHERE node = candidate
RETURN node.title, score
```

**Example 3: Multi-modal retrieval**

Combine text embeddings with structured filters:

```cypher
CALL db.idx.vector.queryNodes('Product', 'description_embedding', 50, $query_vector)
YIELD node AS product, score
WHERE product.price < 100 AND product.in_stock = true
MATCH (product)-[:IN_CATEGORY]->(cat:Category)
WHERE cat.name IN ['Electronics', 'Gadgets']
RETURN product.name, product.price, score
ORDER BY score DESC
LIMIT 10
```

## Proposed API

### New Methods for `FalkordbConnection`

```python
class FalkordbConnection(Connection):
    
    def create_vector_index(
        self,
        label: str,
        property_name: str,
        dimension: int,
        metric: Literal["cosine", "L2", "IP"] = "cosine"
    ) -> None:
        """Create HNSW vector index on a node label."""
        
    def drop_vector_index(
        self,
        label: str,
        property_name: str
    ) -> None:
        """Drop existing vector index."""
        
    def vector_search(
        self,
        label: str,
        property_name: str,
        query_vector: list[float],
        top_k: int = 10,
        filter_query: str | None = None,
        filter_params: dict | None = None
    ) -> list[dict]:
        """
        Perform vector similarity search with optional Cypher filter.
        
        Parameters
        ----------
        label : str
            Node label to search
        property_name : str
            Property containing embeddings
        query_vector : list[float]
            Query embedding vector
        top_k : int
            Number of results to return
        filter_query : str, optional
            Additional Cypher WHERE clause (e.g., "node.year > 2020")
        filter_params : dict, optional
            Parameters for filter_query
            
        Returns
        -------
        list[dict]
            List of {node: dict, score: float} results
        """
        
    def hybrid_search(
        self,
        label: str,
        property_name: str,
        query_vector: list[float],
        traversal_pattern: str,
        top_k: int = 10,
        **params
    ) -> list[dict]:
        """
        Combine vector search with graph traversal.
        
        Parameters
        ----------
        traversal_pattern : str
            Cypher pattern starting from vector results, e.g.:
            "(node)-[:AUTHORED_BY]->(author:Person)"
        """
```

### Configuration Extension

```python
class FalkordbConfig(DBConfig):
    # Existing fields...
    
    # New vector search defaults
    vector_index_metric: Literal["cosine", "L2", "IP"] = "cosine"
    vector_search_default_k: int = 10
```

## Implementation Plan

| Phase | Task | Complexity |
|-------|------|------------|
| 1 | Add `create_vector_index` / `drop_vector_index` | Low |
| 2 | Add `vector_search` with basic filtering | Medium |
| 3 | Add `hybrid_search` with traversal patterns | Medium |
| 4 | Add batch embedding upsert optimization | Low |
| 5 | Documentation and examples | Low |

## Benchmark Considerations

Before production adoption, benchmarks should compare:

| Metric | FalkorDB Native | External Vector DB |
|--------|-----------------|-------------------|
| Query latency (p50, p99) | TBD | TBD |
| Indexing throughput | TBD | TBD |
| Memory usage | TBD | TBD |
| Max vector dimension | TBD | TBD |
| Max vectors per index | TBD | TBD |

## Alternatives Considered

| Alternative | Pros | Cons |
|-------------|------|------|
| External Qdrant/Milvus | Mature, highly optimized | Sync complexity, no hybrid queries |
| Neo4j + GDS | Native integration | Requires GDS license, different API |
| pgvector (PostgreSQL) | SQL familiarity | No native graph traversal |

## Open Questions

1. Should `hybrid_search` support arbitrary Cypher templates or use a structured API?
2. How to handle embedding updates (re-index vs incremental)?
3. Should we expose HNSW tuning parameters (M, efConstruction)?

## References

- [FalkorDB Vector Similarity Documentation](https://docs.falkordb.com/cypher/indexing/vector-index.html)
- [HNSW Algorithm Paper](https://arxiv.org/abs/1603.09320)
- [GraFlo FalkorDB Connector PR](https://github.com/growgraph/graflo/pull/123)

---

**Feedback welcome!** Please comment with use cases, concerns, or implementation suggestions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Native Vector Search Support via FalkorDB HNSW Indexes #124

RFC: Native Vector Search Support via FalkorDB HNSW Indexes

Summary

Motivation

Current State: The Two-Database Problem

Proposed State: Unified Hybrid Search

Technical Approach

FalkorDB Vector Capabilities

Hybrid Query Examples

Proposed API

New Methods for `FalkordbConnection`

Configuration Extension

Implementation Plan

Benchmark Considerations

Alternatives Considered

Open Questions

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue	Impact
Data synchronization	Embeddings can drift out of sync with graph nodes
Operational complexity	Two databases to maintain, monitor, backup
Query latency	Network hop between DBs adds ~5-20ms per query
Transactional integrity	No atomic operations across both systems
Query expressiveness	Cannot combine vector similarity with graph patterns in one query

Phase	Task	Complexity
1	Add `create_vector_index` / `drop_vector_index`	Low
2	Add `vector_search` with basic filtering	Medium
3	Add `hybrid_search` with traversal patterns	Medium
4	Add batch embedding upsert optimization	Low
5	Documentation and examples	Low

Metric	FalkorDB Native	External Vector DB
Query latency (p50, p99)	TBD	TBD
Indexing throughput	TBD	TBD
Memory usage	TBD	TBD
Max vector dimension	TBD	TBD
Max vectors per index	TBD	TBD

Alternative	Pros	Cons
External Qdrant/Milvus	Mature, highly optimized	Sync complexity, no hybrid queries
Neo4j + GDS	Native integration	Requires GDS license, different API
pgvector (PostgreSQL)	SQL familiarity	No native graph traversal

RFC: Native Vector Search Support via FalkorDB HNSW Indexes #124

Description

RFC: Native Vector Search Support via FalkorDB HNSW Indexes

Summary

Motivation

Current State: The Two-Database Problem

Proposed State: Unified Hybrid Search

Technical Approach

FalkorDB Vector Capabilities

Hybrid Query Examples

Proposed API

New Methods for FalkordbConnection

Configuration Extension

Implementation Plan

Benchmark Considerations

Alternatives Considered

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

New Methods for `FalkordbConnection`