File to LLM

GO library to convert files of multiple formats to text understandable by LLM

File2LLM is specifically designed to work with LLMs. Unlike other Golang solutions, it preserves text location, padding, and formatting, adding structural boundaries that are understandable by LLMs. It also performs additional processing to ensure that the extracted text is properly interpretable by LLMs.

File2LLM can handle nested file formats (such as archives) by recursively reading them and creating structured file information suitable for LLM input.

It's optimized with custom CGo code and Assembler.

Example

Get the main file2llm library

go get -u github.com/opengs/file2llm

Install dependencies to work with PDF and images (OCR). This is optional.

sudo apt install -y libpoppler-glib-dev libcairo2 libcairo2-dev libtesseract-dev

This will extract text from PDF including images

package main

import (
	"context"
	"os"

	"github.com/opengs/file2llm/ocr"
	"github.com/opengs/file2llm/parser"
)

func main() {
	fp, err := os.Open("file.pdf")
	if err != nil {
		panic(err.Error())
	}
	defer fp.Close()

  // Initialize OCR to be able to extract text from images
	ocrProvider := ocr.NewTesseractProvider(ocr.DefaultTesseractConfig())
	if err := ocrProvider.Init(); err != nil {
		panic(err.Error())
	}
	defer ocrProvider.Destroy()

	p := parser.New(ocrProvider)
	result := p.Parse(context.Background(), fp)
	println(result.String())
}

Run code with build tags to enable features from file2llm library.

go run -tags=file2llm_feature_tesseract,file2llm_feature_pdf main.go

Features

	CGO	Build tags	Requires OCR	Required libraries	Notes
png	NO		YES
jpeg	NO		YES
webp	NO		YES
gif	NO		YES		Extracts first frame
bmp	NO		YES
tiff	NO		YES
pdf	YES	file2llm_feature_pdf	optional	poppler-utils libpoppler-dev libpoppler-glib-dev libcairo2 libcairo2-dev	Extracts text from embeded images using OCR if available

OCR Provider	CGO	Required tags	Required libraries
Tesseract	YES	file2llm_feature_tesseract	tesseract libtesseract-dev
Tesseract Server	NO

License

AGPL3.0. Commercial license in progress.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
.vscode		.vscode
chunker		chunker
embedder		embedder
ocr		ocr
parser		parser
source		source
storage		storage
test_data		test_data
.gitignore		.gitignore
.template.env		.template.env
LICENSE.md		LICENSE.md
README.md		README.md
engine.go		engine.go
engine_test.go		engine_test.go
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

File to LLM

GO library to convert files of multiple formats to text understandable by LLM

Example

Features

License

About

Uh oh!

Releases 17

Packages

Uh oh!

Uh oh!

Languages

License

opengs/file2llm

Folders and files

Latest commit

History

Repository files navigation

File to LLM

GO library to convert files of multiple formats to text understandable by LLM

Example

Features

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Uh oh!

Languages

Packages