Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Extensions ¶
func Extensions() []string
Extensions returns the supported document extensions (e.g. ".pdf", ".docx").
func IsDocument ¶
IsDocument returns true if the extension is a supported document format.
Types ¶
type CSVText ¶ added in v1.2.1
type CSVText struct {
// contains filtered or unexported fields
}
CSVText converts CSV content into an indexable text representation.
The output contains three sections joined by newlines:
- All column header names (if a header row is present).
- Deduplicated string values from every non-numeric column.
- The first few data rows written back as CSV.
A column is considered numeric when every non-empty value in that column can be parsed as a float64. Columns with at least one non-numeric value are treated as string columns.
func NewCSVText ¶ added in v1.2.1
func NewCSVText() *CSVText
NewCSVText creates a CSVText with default settings.
type DocumentText ¶
type DocumentText struct{}
DocumentText extracts plain text from binary document files using tabula.
func NewDocumentText ¶
func NewDocumentText() *DocumentText
NewDocumentText creates a new DocumentText.
type Extractors ¶ added in v1.2.1
type Extractors struct {
// contains filtered or unexported fields
}
Extractors maps file extensions to text extractors.
func NewExtractors ¶ added in v1.2.1
func NewExtractors() *Extractors
NewExtractors creates an Extractors with CSV and plain-text extractors.
func (*Extractors) For ¶ added in v1.2.1
func (e *Extractors) For(ext string) TextExtractor
For returns the text extractor for the given file extension.
type PDFTextRenderer ¶ added in v1.3.1
type PDFTextRenderer struct{}
PDFTextRenderer extracts text from individual PDF pages using tabula.
func NewPDFTextRenderer ¶ added in v1.3.1
func NewPDFTextRenderer() *PDFTextRenderer
NewPDFTextRenderer creates a new PDFTextRenderer.
func (*PDFTextRenderer) Close ¶ added in v1.3.1
func (r *PDFTextRenderer) Close() error
Close is a no-op; PDFTextRenderer holds no persistent resources.
type PPTXTextRenderer ¶ added in v1.3.1
type PPTXTextRenderer struct{}
PPTXTextRenderer extracts text from individual presentation slides using tabula.
func NewPPTXTextRenderer ¶ added in v1.3.1
func NewPPTXTextRenderer() *PPTXTextRenderer
NewPPTXTextRenderer creates a new PPTXTextRenderer.
func (*PPTXTextRenderer) Close ¶ added in v1.3.1
func (r *PPTXTextRenderer) Close() error
Close is a no-op; PPTXTextRenderer holds no persistent resources.
type PageBoundary ¶ added in v1.3.1
type PageBoundary struct {
Page int // 1-based page number
ByteOffset int // byte offset in the concatenated text
}
PageBoundary records where a page's text begins in a concatenated string.
type SinglePageTextRenderer ¶ added in v1.3.1
type SinglePageTextRenderer struct{}
SinglePageTextRenderer extracts text from document formats that are treated as a single page (DOCX, ODT, EPUB). PageCount always returns 1 and only page 1 can be rendered.
func NewSinglePageTextRenderer ¶ added in v1.3.1
func NewSinglePageTextRenderer() *SinglePageTextRenderer
NewSinglePageTextRenderer creates a new SinglePageTextRenderer.
func (*SinglePageTextRenderer) Close ¶ added in v1.3.1
func (r *SinglePageTextRenderer) Close() error
Close is a no-op; SinglePageTextRenderer holds no persistent resources.
type SourceText ¶ added in v1.2.1
type SourceText struct{}
SourceText treats file content as plain text. Binary files (containing null bytes in the first 8 KB) produce empty text.
func NewSourceText ¶ added in v1.2.1
func NewSourceText() *SourceText
NewSourceText creates a SourceText.
type TextExtractor ¶ added in v1.2.1
TextExtractor converts raw file bytes into indexable plain text. Returns empty string when the content should be skipped.
type TextRenderer ¶ added in v1.3.1
type TextRenderer interface {
io.Closer
// PageCount returns the number of extractable pages in the document.
PageCount(path string) (int, error)
// Render returns the text content of the given 1-based page.
Render(path string, page int) (string, error)
}
TextRenderer extracts text from individual document pages. For PDFs this means pages; for spreadsheets, sheets; for presentations, slides.
type TextRendererRegistry ¶ added in v1.3.1
type TextRendererRegistry struct {
// contains filtered or unexported fields
}
TextRendererRegistry maps file extensions to TextRenderer implementations.
func NewTextRendererRegistry ¶ added in v1.3.1
func NewTextRendererRegistry() *TextRendererRegistry
NewTextRendererRegistry creates an empty TextRendererRegistry.
func (*TextRendererRegistry) Close ¶ added in v1.3.1
func (r *TextRendererRegistry) Close() error
Close closes all registered text renderers, deduplicating shared instances.
func (*TextRendererRegistry) For ¶ added in v1.3.1
func (r *TextRendererRegistry) For(ext string) (TextRenderer, bool)
For returns the TextRenderer for the given extension, or nil and false if none is registered.
func (*TextRendererRegistry) Register ¶ added in v1.3.1
func (r *TextRendererRegistry) Register(ext string, renderer TextRenderer)
Register associates a file extension (e.g. ".pdf") with a TextRenderer.
func (*TextRendererRegistry) Supports ¶ added in v1.3.1
func (r *TextRendererRegistry) Supports(ext string) bool
Supports returns true if a TextRenderer is registered for the given extension.
type XLSXTextRenderer ¶ added in v1.3.1
type XLSXTextRenderer struct{}
XLSXTextRenderer extracts text from individual spreadsheet sheets using tabula.
func NewXLSXTextRenderer ¶ added in v1.3.1
func NewXLSXTextRenderer() *XLSXTextRenderer
NewXLSXTextRenderer creates a new XLSXTextRenderer.
func (*XLSXTextRenderer) Close ¶ added in v1.3.1
func (r *XLSXTextRenderer) Close() error
Close is a no-op; XLSXTextRenderer holds no persistent resources.