File2LLM is specifically designed to work with LLMs. Unlike other Golang solutions, it preserves text location, padding, and formatting, adding structural boundaries that are understandable by LLMs. It also performs additional processing to ensure that the extracted text is properly interpretable by LLMs.
File2LLM can handle nested file formats (such as archives) by recursively reading them and creating structured file information suitable for LLM input.
It's optimized with custom CGo code and Assembler.
Get the main file2llm library
go get -u github.com/opengs/file2llmInstall dependencies to work with PDF and images (OCR). This is optional.
sudo apt install -y libpoppler-glib-dev libcairo2 libcairo2-dev libtesseract-devThis will extract text from PDF including images
package main
import (
"context"
"os"
"github.com/opengs/file2llm/ocr"
"github.com/opengs/file2llm/parser"
)
func main() {
fp, err := os.Open("file.pdf")
if err != nil {
panic(err.Error())
}
defer fp.Close()
// Initialize OCR to be able to extract text from images
ocrProvider := ocr.NewTesseractProvider(ocr.DefaultTesseractConfig())
if err := ocrProvider.Init(); err != nil {
panic(err.Error())
}
defer ocrProvider.Destroy()
p := parser.New(ocrProvider)
result := p.Parse(context.Background(), fp)
println(result.String())
}Run code with build tags to enable features from file2llm library.
go run -tags=file2llm_feature_tesseract,file2llm_feature_pdf main.go| CGO | Build tags | Requires OCR | Required libraries | Notes | |
|---|---|---|---|---|---|
| png | NO | YES | |||
| jpeg | NO | YES | |||
| webp | NO | YES | |||
| gif | NO | YES | Extracts first frame | ||
| bmp | NO | YES | |||
| tiff | NO | YES | |||
| YES | file2llm_feature_pdf | optional | poppler-utils libpoppler-dev libpoppler-glib-dev libcairo2 libcairo2-dev | Extracts text from embeded images using OCR if available |
| OCR Provider | CGO | Required tags | Required libraries |
|---|---|---|---|
| Tesseract | YES | file2llm_feature_tesseract | tesseract libtesseract-dev |
| Tesseract Server | NO |
AGPL3.0. Commercial license in progress.