extraction

package
v1.3.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 21, 2026 License: Apache-2.0 Imports: 11 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Extensions

func Extensions() []string

Extensions returns the supported document extensions (e.g. ".pdf", ".docx").

func IsDocument

func IsDocument(ext string) bool

IsDocument returns true if the extension is a supported document format.

Types

type CSVText added in v1.2.1

type CSVText struct {
	// contains filtered or unexported fields
}

CSVText converts CSV content into an indexable text representation.

The output contains three sections joined by newlines:

  1. All column header names (if a header row is present).
  2. Deduplicated string values from every non-numeric column.
  3. The first few data rows written back as CSV.

A column is considered numeric when every non-empty value in that column can be parsed as a float64. Columns with at least one non-numeric value are treated as string columns.

func NewCSVText added in v1.2.1

func NewCSVText() *CSVText

NewCSVText creates a CSVText with default settings.

func (*CSVText) Text added in v1.2.1

func (c *CSVText) Text(content []byte) (string, error)

Text converts CSV bytes into a searchable string.

type DocumentText

type DocumentText struct{}

DocumentText extracts plain text from binary document files using tabula.

func NewDocumentText

func NewDocumentText() *DocumentText

NewDocumentText creates a new DocumentText.

func (*DocumentText) Text

func (d *DocumentText) Text(path string) (string, error)

Text extracts readable text from the file at the given path. It validates that the file exists, has a supported extension, and is within the maximum size limit before passing it to tabula.

type Extractors added in v1.2.1

type Extractors struct {
	// contains filtered or unexported fields
}

Extractors maps file extensions to text extractors.

func NewExtractors added in v1.2.1

func NewExtractors() *Extractors

NewExtractors creates an Extractors with CSV and plain-text extractors.

func (*Extractors) For added in v1.2.1

func (e *Extractors) For(ext string) TextExtractor

For returns the text extractor for the given file extension.

type PDFTextRenderer added in v1.3.1

type PDFTextRenderer struct{}

PDFTextRenderer extracts text from individual PDF pages using tabula.

func NewPDFTextRenderer added in v1.3.1

func NewPDFTextRenderer() *PDFTextRenderer

NewPDFTextRenderer creates a new PDFTextRenderer.

func (*PDFTextRenderer) Close added in v1.3.1

func (r *PDFTextRenderer) Close() error

Close is a no-op; PDFTextRenderer holds no persistent resources.

func (*PDFTextRenderer) PageCount added in v1.3.1

func (r *PDFTextRenderer) PageCount(path string) (int, error)

PageCount returns the number of pages in the PDF.

func (*PDFTextRenderer) Render added in v1.3.1

func (r *PDFTextRenderer) Render(path string, page int) (string, error)

Render returns the text content of the given 1-based page.

type PPTXTextRenderer added in v1.3.1

type PPTXTextRenderer struct{}

PPTXTextRenderer extracts text from individual presentation slides using tabula.

func NewPPTXTextRenderer added in v1.3.1

func NewPPTXTextRenderer() *PPTXTextRenderer

NewPPTXTextRenderer creates a new PPTXTextRenderer.

func (*PPTXTextRenderer) Close added in v1.3.1

func (r *PPTXTextRenderer) Close() error

Close is a no-op; PPTXTextRenderer holds no persistent resources.

func (*PPTXTextRenderer) PageCount added in v1.3.1

func (r *PPTXTextRenderer) PageCount(path string) (int, error)

PageCount returns the number of slides in the presentation.

func (*PPTXTextRenderer) Render added in v1.3.1

func (r *PPTXTextRenderer) Render(path string, page int) (string, error)

Render returns the text content of the given 1-based slide.

type PageBoundary added in v1.3.1

type PageBoundary struct {
	Page       int // 1-based page number
	ByteOffset int // byte offset in the concatenated text
}

PageBoundary records where a page's text begins in a concatenated string.

type SinglePageTextRenderer added in v1.3.1

type SinglePageTextRenderer struct{}

SinglePageTextRenderer extracts text from document formats that are treated as a single page (DOCX, ODT, EPUB). PageCount always returns 1 and only page 1 can be rendered.

func NewSinglePageTextRenderer added in v1.3.1

func NewSinglePageTextRenderer() *SinglePageTextRenderer

NewSinglePageTextRenderer creates a new SinglePageTextRenderer.

func (*SinglePageTextRenderer) Close added in v1.3.1

func (r *SinglePageTextRenderer) Close() error

Close is a no-op; SinglePageTextRenderer holds no persistent resources.

func (*SinglePageTextRenderer) PageCount added in v1.3.1

func (r *SinglePageTextRenderer) PageCount(path string) (int, error)

PageCount always returns 1 for single-page document formats.

func (*SinglePageTextRenderer) Render added in v1.3.1

func (r *SinglePageTextRenderer) Render(path string, page int) (string, error)

Render returns the full document text. Only page 1 is valid.

type SourceText added in v1.2.1

type SourceText struct{}

SourceText treats file content as plain text. Binary files (containing null bytes in the first 8 KB) produce empty text.

func NewSourceText added in v1.2.1

func NewSourceText() *SourceText

NewSourceText creates a SourceText.

func (*SourceText) Text added in v1.2.1

func (s *SourceText) Text(content []byte) (string, error)

Text returns the content as a string, or empty if the content appears binary.

type TextExtractor added in v1.2.1

type TextExtractor interface {
	Text(content []byte) (string, error)
}

TextExtractor converts raw file bytes into indexable plain text. Returns empty string when the content should be skipped.

type TextRenderer added in v1.3.1

type TextRenderer interface {
	io.Closer

	// PageCount returns the number of extractable pages in the document.
	PageCount(path string) (int, error)

	// Render returns the text content of the given 1-based page.
	Render(path string, page int) (string, error)
}

TextRenderer extracts text from individual document pages. For PDFs this means pages; for spreadsheets, sheets; for presentations, slides.

type TextRendererRegistry added in v1.3.1

type TextRendererRegistry struct {
	// contains filtered or unexported fields
}

TextRendererRegistry maps file extensions to TextRenderer implementations.

func NewTextRendererRegistry added in v1.3.1

func NewTextRendererRegistry() *TextRendererRegistry

NewTextRendererRegistry creates an empty TextRendererRegistry.

func (*TextRendererRegistry) Close added in v1.3.1

func (r *TextRendererRegistry) Close() error

Close closes all registered text renderers, deduplicating shared instances.

func (*TextRendererRegistry) For added in v1.3.1

For returns the TextRenderer for the given extension, or nil and false if none is registered.

func (*TextRendererRegistry) Register added in v1.3.1

func (r *TextRendererRegistry) Register(ext string, renderer TextRenderer)

Register associates a file extension (e.g. ".pdf") with a TextRenderer.

func (*TextRendererRegistry) Supports added in v1.3.1

func (r *TextRendererRegistry) Supports(ext string) bool

Supports returns true if a TextRenderer is registered for the given extension.

type XLSXTextRenderer added in v1.3.1

type XLSXTextRenderer struct{}

XLSXTextRenderer extracts text from individual spreadsheet sheets using tabula.

func NewXLSXTextRenderer added in v1.3.1

func NewXLSXTextRenderer() *XLSXTextRenderer

NewXLSXTextRenderer creates a new XLSXTextRenderer.

func (*XLSXTextRenderer) Close added in v1.3.1

func (r *XLSXTextRenderer) Close() error

Close is a no-op; XLSXTextRenderer holds no persistent resources.

func (*XLSXTextRenderer) PageCount added in v1.3.1

func (r *XLSXTextRenderer) PageCount(path string) (int, error)

PageCount returns the number of sheets in the workbook.

func (*XLSXTextRenderer) Render added in v1.3.1

func (r *XLSXTextRenderer) Render(path string, page int) (string, error)

Render returns the text content of the given 1-based sheet.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL