file2llm

package module

v0.0.18 Latest Latest Go to latest Published: Jun 15, 2025 License: AGPL-3.0 Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/opengs/file2llm

Links

Open Source Insights

README ¶

File to LLM

GO library to convert files of multiple formats to text understandable by LLM

File2LLM is specifically designed to work with LLMs. Unlike other Golang solutions, it preserves text location, padding, and formatting, adding structural boundaries that are understandable by LLMs. It also performs additional processing to ensure that the extracted text is properly interpretable by LLMs.

File2LLM can handle nested file formats (such as archives) by recursively reading them and creating structured file information suitable for LLM input.

It's optimized with custom CGo code and Assembler.

Example

Get the main file2llm library

go get -u github.com/opengs/file2llm

Install dependencies to work with PDF and images (OCR). This is optional.

sudo apt install -y libpoppler-glib-dev libcairo2 libcairo2-dev libtesseract-dev

This will extract text from PDF including images

package main

import (
	"context"
	"os"

	"github.com/opengs/file2llm/ocr"
	"github.com/opengs/file2llm/parser"
)

func main() {
	fp, err := os.Open("file.pdf")
	if err != nil {
		panic(err.Error())
	}
	defer fp.Close()

  // Initialize OCR to be able to extract text from images
	ocrProvider := ocr.NewTesseractProvider(ocr.DefaultTesseractConfig())
	if err := ocrProvider.Init(); err != nil {
		panic(err.Error())
	}
	defer ocrProvider.Destroy()

	p := parser.New(ocrProvider)
	result := p.Parse(context.Background(), fp)
	println(result.String())
}

Run code with build tags to enable features from file2llm library.

go run -tags=file2llm_feature_tesseract,file2llm_feature_pdf main.go

Features

	CGO	Build tags	Requires OCR	Required libraries	Notes
png	NO		YES
jpeg	NO		YES
webp	NO		YES
gif	NO		YES		Extracts first frame
bmp	NO		YES
tiff	NO		YES
pdf	YES	file2llm_feature_pdf	optional	poppler-utils libpoppler-dev libpoppler-glib-dev libcairo2 libcairo2-dev	Extracts text from embeded images using OCR if available

OCR Provider	CGO	Required tags	Required libraries
Tesseract	YES	file2llm_feature_tesseract	tesseract libtesseract-dev
Tesseract Server	NO

License

AGPL3.0. Commercial license in progress.

Documentation ¶

Index ¶

type Config
type Engine
- func (e *Engine) Process(ctx context.Context) error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	// Number of simultaniously processing sources
	Parallelism uint32
}

type Engine ¶

type Engine struct {
	// contains filtered or unexported fields
}

func (*Engine) Process ¶

func (e *Engine) Process(ctx context.Context) error

Source Files ¶

View all Source files

engine.go

Directories ¶

Path	Synopsis
chunker
slide_chunk
embedder
lib
ollama
openai
testlib
ocr
gosseract
parser
bgra
source
fs
storage
pgvector
pgvector/migrations
testlib

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL