extraction

package

v1.2.3 Latest Latest Go to latest Published: Mar 26, 2026 License: Apache-2.0 Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/helixml/kodit

Links

Documentation ¶

Index ¶

func Extensions() []string
func IsDocument(ext string) bool
type CSVText
- func NewCSVText() *CSVText
- func (c *CSVText) Text(content []byte) (string, error)
type DocumentText
- func NewDocumentText() *DocumentText
- func (d *DocumentText) Text(path string) (string, error)
type Extractors
- func NewExtractors() *Extractors
- func (e *Extractors) For(ext string) TextExtractor
type SourceText
- func NewSourceText() *SourceText
- func (s *SourceText) Text(content []byte) (string, error)
type TextExtractor

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Extensions ¶

func Extensions() []string

Extensions returns the supported document extensions (e.g. ".pdf", ".docx").

func IsDocument ¶

func IsDocument(ext string) bool

IsDocument returns true if the extension is a supported document format.

Types ¶

type CSVText ¶ added in v1.2.1

type CSVText struct {
	// contains filtered or unexported fields
}

CSVText converts CSV content into an indexable text representation.

The output contains three sections joined by newlines:

All column header names (if a header row is present).
Deduplicated string values from every non-numeric column.
The first few data rows written back as CSV.

A column is considered numeric when every non-empty value in that column can be parsed as a float64. Columns with at least one non-numeric value are treated as string columns.

func NewCSVText ¶ added in v1.2.1

func NewCSVText() *CSVText

NewCSVText creates a CSVText with default settings.

func (*CSVText) Text ¶ added in v1.2.1

func (c *CSVText) Text(content []byte) (string, error)

Text converts CSV bytes into a searchable string.

type DocumentText ¶

type DocumentText struct{}

DocumentText extracts plain text from binary document files using tabula.

func NewDocumentText ¶

func NewDocumentText() *DocumentText

NewDocumentText creates a new DocumentText.

func (*DocumentText) Text ¶

func (d *DocumentText) Text(path string) (string, error)

Text extracts readable text from the file at the given path. It validates that the file exists, has a supported extension, and is within the maximum size limit before passing it to tabula.

type Extractors ¶ added in v1.2.1

type Extractors struct {
	// contains filtered or unexported fields
}

Extractors maps file extensions to text extractors.

func NewExtractors ¶ added in v1.2.1

func NewExtractors() *Extractors

NewExtractors creates an Extractors with CSV and plain-text extractors.

func (*Extractors) For ¶ added in v1.2.1

func (e *Extractors) For(ext string) TextExtractor

For returns the text extractor for the given file extension.

type SourceText ¶ added in v1.2.1

type SourceText struct{}

SourceText treats file content as plain text. Binary files (containing null bytes in the first 8 KB) produce empty text.

func NewSourceText ¶ added in v1.2.1

func NewSourceText() *SourceText

NewSourceText creates a SourceText.

func (*SourceText) Text ¶ added in v1.2.1

func (s *SourceText) Text(content []byte) (string, error)

Text returns the content as a string, or empty if the content appears binary.

type TextExtractor ¶ added in v1.2.1

type TextExtractor interface {
	Text(content []byte) (string, error)
}

TextExtractor converts raw file bytes into indexable plain text. Returns empty string when the content should be skipped.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL