extraction

package
v1.2.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 26, 2026 License: Apache-2.0 Imports: 8 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Extensions

func Extensions() []string

Extensions returns the supported document extensions (e.g. ".pdf", ".docx").

func IsDocument

func IsDocument(ext string) bool

IsDocument returns true if the extension is a supported document format.

Types

type CSVText added in v1.2.1

type CSVText struct {
	// contains filtered or unexported fields
}

CSVText converts CSV content into an indexable text representation.

The output contains three sections joined by newlines:

  1. All column header names (if a header row is present).
  2. Deduplicated string values from every non-numeric column.
  3. The first few data rows written back as CSV.

A column is considered numeric when every non-empty value in that column can be parsed as a float64. Columns with at least one non-numeric value are treated as string columns.

func NewCSVText added in v1.2.1

func NewCSVText() *CSVText

NewCSVText creates a CSVText with default settings.

func (*CSVText) Text added in v1.2.1

func (c *CSVText) Text(content []byte) (string, error)

Text converts CSV bytes into a searchable string.

type DocumentText

type DocumentText struct{}

DocumentText extracts plain text from binary document files using tabula.

func NewDocumentText

func NewDocumentText() *DocumentText

NewDocumentText creates a new DocumentText.

func (*DocumentText) Text

func (d *DocumentText) Text(path string) (string, error)

Text extracts readable text from the file at the given path. It validates that the file exists, has a supported extension, and is within the maximum size limit before passing it to tabula.

type Extractors added in v1.2.1

type Extractors struct {
	// contains filtered or unexported fields
}

Extractors maps file extensions to text extractors.

func NewExtractors added in v1.2.1

func NewExtractors() *Extractors

NewExtractors creates an Extractors with CSV and plain-text extractors.

func (*Extractors) For added in v1.2.1

func (e *Extractors) For(ext string) TextExtractor

For returns the text extractor for the given file extension.

type SourceText added in v1.2.1

type SourceText struct{}

SourceText treats file content as plain text. Binary files (containing null bytes in the first 8 KB) produce empty text.

func NewSourceText added in v1.2.1

func NewSourceText() *SourceText

NewSourceText creates a SourceText.

func (*SourceText) Text added in v1.2.1

func (s *SourceText) Text(content []byte) (string, error)

Text returns the content as a string, or empty if the content appears binary.

type TextExtractor added in v1.2.1

type TextExtractor interface {
	Text(content []byte) (string, error)
}

TextExtractor converts raw file bytes into indexable plain text. Returns empty string when the content should be skipped.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL