RAGGen - RAG Dataset Generator

A universal tool for converting files into high-quality RAG datasets.

Features

Supports a variety of commonly used formats for storing textual data.
Robust PDF parsing with OCR utilizing marker library.
Preserves headers, supports fixing header levels using LLM (OpenAI API).
Respects tables, does not split them into chunks.
Supports embedding metadata directly into chunk text.
Supports adding custom metadata for each input.
Multiple outputs formats including pandas DataFrames and Langchain documents.
Checksum-based result caching.

Supported formats

PDF (via marker).
Word (via mammoth).
HTML (via html2text).
Markdown

Installation

pip install raggen

Usage

from raggen import RAGGen, RAGInput

# Initialize RAGGen
gen = RAGGen(cache_dir="cache")

# Define inputs
inputs = ["sample1.pdf", "sample2.html", "sample3.md"]

# Input with custom metadata
inputs.append(RAGInput(
    path = "sample4.docx",
    metadata = {"title": "Doc title"}
))

# Generate RAG dataset as list
data = gen(inputs, output_format="df", flatten=True)

TODO

Contribution

Feel free to fork this repo and make pull requests.

If you like my work, please, support me:

BTC: 32F3zAnQQGwZzsG7R35rPUS269Xz11cZ8B

Lisense

Free to use under Apache-2.0. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
raggen		raggen
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGGen - RAG Dataset Generator

Features

Supported formats

Installation

Usage

TODO

Contribution

Lisense

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

alex-karev/raggen

Folders and files

Latest commit

History

Repository files navigation

RAGGen - RAG Dataset Generator

Features

Supported formats

Installation

Usage

TODO

Contribution

Lisense

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages