file2llm

package module
v0.0.18 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 15, 2025 License: AGPL-3.0 Imports: 11 Imported by: 0

README

File to LLM

GO library to convert files of multiple formats to text understandable by LLM

application/pdf application/msword application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.ms-powerpoint application/application/vnd.openxmlformats-officedocument.presentationml.presentation application/vnd.oasis.opendocument.text application/vnd.apple.pages application/rtf message/rfc822
image/png image/jpeg image/webp image/bmp image/gif image/tiff
application/zip application/vnd.rar application/x-7z-compressed application/gzip application/tar application/x-bzip2

File2LLM is specifically designed to work with LLMs. Unlike other Golang solutions, it preserves text location, padding, and formatting, adding structural boundaries that are understandable by LLMs. It also performs additional processing to ensure that the extracted text is properly interpretable by LLMs.

File2LLM can handle nested file formats (such as archives) by recursively reading them and creating structured file information suitable for LLM input.

It's optimized with custom CGo code and Assembler.

Example

Get the main file2llm library

go get -u github.com/opengs/file2llm

Install dependencies to work with PDF and images (OCR). This is optional.

sudo apt install -y libpoppler-glib-dev libcairo2 libcairo2-dev libtesseract-dev

This will extract text from PDF including images

package main

import (
	"context"
	"os"

	"github.com/opengs/file2llm/ocr"
	"github.com/opengs/file2llm/parser"
)

func main() {
	fp, err := os.Open("file.pdf")
	if err != nil {
		panic(err.Error())
	}
	defer fp.Close()

  // Initialize OCR to be able to extract text from images
	ocrProvider := ocr.NewTesseractProvider(ocr.DefaultTesseractConfig())
	if err := ocrProvider.Init(); err != nil {
		panic(err.Error())
	}
	defer ocrProvider.Destroy()

	p := parser.New(ocrProvider)
	result := p.Parse(context.Background(), fp)
	println(result.String())
}

Run code with build tags to enable features from file2llm library.

go run -tags=file2llm_feature_tesseract,file2llm_feature_pdf main.go

Features

CGO Build tags Requires OCR Required libraries Notes
png NO YES
jpeg NO YES
webp NO YES
gif NO YES Extracts first frame
bmp NO YES
tiff NO YES
pdf YES file2llm_feature_pdf optional poppler-utils libpoppler-dev libpoppler-glib-dev libcairo2 libcairo2-dev Extracts text from embeded images using OCR if available
OCR Provider CGO Required tags Required libraries
Tesseract YES file2llm_feature_tesseract tesseract libtesseract-dev
Tesseract Server NO

License

AGPL3.0. Commercial license in progress.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	// Number of simultaniously processing sources
	Parallelism uint32
}

type Engine

type Engine struct {
	// contains filtered or unexported fields
}

func (*Engine) Process

func (e *Engine) Process(ctx context.Context) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL