Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Extensions ¶
func Extensions() []string
Extensions returns the supported document extensions (e.g. ".pdf", ".docx").
func IsDocument ¶
IsDocument returns true if the extension is a supported document format.
Types ¶
type CSVText ¶ added in v1.2.1
type CSVText struct {
// contains filtered or unexported fields
}
CSVText converts CSV content into an indexable text representation.
The output contains three sections joined by newlines:
- All column header names (if a header row is present).
- Deduplicated string values from every non-numeric column.
- The first few data rows written back as CSV.
A column is considered numeric when every non-empty value in that column can be parsed as a float64. Columns with at least one non-numeric value are treated as string columns.
func NewCSVText ¶ added in v1.2.1
func NewCSVText() *CSVText
NewCSVText creates a CSVText with default settings.
type DocumentText ¶
type DocumentText struct{}
DocumentText extracts plain text from binary document files using tabula.
func NewDocumentText ¶
func NewDocumentText() *DocumentText
NewDocumentText creates a new DocumentText.
type Extractors ¶ added in v1.2.1
type Extractors struct {
// contains filtered or unexported fields
}
Extractors maps file extensions to text extractors.
func NewExtractors ¶ added in v1.2.1
func NewExtractors() *Extractors
NewExtractors creates an Extractors with CSV and plain-text extractors.
func (*Extractors) For ¶ added in v1.2.1
func (e *Extractors) For(ext string) TextExtractor
For returns the text extractor for the given file extension.
type SourceText ¶ added in v1.2.1
type SourceText struct{}
SourceText treats file content as plain text. Binary files (containing null bytes in the first 8 KB) produce empty text.
func NewSourceText ¶ added in v1.2.1
func NewSourceText() *SourceText
NewSourceText creates a SourceText.
type TextExtractor ¶ added in v1.2.1
TextExtractor converts raw file bytes into indexable plain text. Returns empty string when the content should be skipped.