HetaRAG is a hybrid, deep-retrieval RAG framework that unifies multiple heterogeneous data stores—vector indices, knowledge graphs, full-text search engines, and relational databases. The knowledge base built on this heterogeneous database enables deep-search question answering within RAG and supports the generation of in-depth research reports. The code currently open-sourced comprises early-stage integrations of exploratory RAG components from our preliminary research; we will continue refining the system design and releasing further code in the future.
2025-09-29Our paper is available on Arxiv📄!2025-09-03Codes are now release!2025-09-03Project quick guide, now live here🔗!
-
Document Parsing: Supports multiple document parsing backends, including MinerU and Docling, for handling complex layouts and multi-modal content.
-
Knowledge Graph Integration: Automatically extracts entities and relations to build a knowledge graph (HiRAG or LeanRAG).
-
Flexible Database Support: Integrates with various databases for different needs:
- Vector Stores: Milvus
- Search Engines: Elasticsearch
- Graph Databases: Neo4j
- Relational Databases: MySQL
-
DeepRetrieval: Supports multiple retrieval paradigms, including Hybrid Retrieval (combining vector search and keyword search), Query Rewrite ,Rerank , and DeepSearch modules.
-
DeepWriter: A multimodal report generation module. Generate fact-grounded, query-driven reports, with fine-grained citations, from unstructured documents.
-
Head-ups: Deep-fusion retrieval across heterogeneous stores is pending merge.
Configure the following four types of database services using Docker, including
- Elasticsearch: For full-text search and document indexing.
- Milvus: For vector similarity search.
- Neo4j: For graph database.
- MySQL: For relational database.
These databases can be installed with a single command via Docker. For detailed installation instructions, please refer to the README.
- Python 3.10+
- Conda for environment management
-
Clone the repository:
git clone https://github.com/your-github-username/hrag.git cd hrag -
Create a virtual environment:
# Upgrade pip and install uv pip install --upgrade pip pip install uv # Create and activate a virtual environment using uv uv venv h-rag --python=3.10 source h-rag/bin/activate # For Unix/macOS h-rag\Scripts\activate # For Windows # Alternatively, you can use conda to create and activate the environment conda create -n h-rag python=3.10 conda activate h-rag
-
Install the required dependencies:
uv pip install -e .
Please refer to the document Read the Docs.
This project is licensed under the MIT License. See the LICENSE file for details.
We utilized the following repos during development:
If you find our paper and codes useful, please kindly cite us via:
@misc{yan2025hetaraghybriddeepretrievalaugmented,
title={HetaRAG: Hybrid Deep Retrieval-Augmented Generation across Heterogeneous Data Stores},
author={Guohang Yan and Yue Zhang and Pinlong Cai and Ding Wang and Song Mao and Hongwei Zhang and Yaoze Zhang and Hairong Zhang and Xinyu Cai and Botian Shi},
year={2025},
eprint={2509.21336},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2509.21336},
}