Documentation
¶
Overview ¶
Package textextract provides full-text extraction functionality using Apache Tika. This is used to extract text from PDF, Office, and other document formats.
Index ¶
- Variables
- func DetectDocumentType(contentType string) string
- func GetSupportedMimeTypes() []string
- type Client
- func (c *Client) ExtractText(ctx context.Context, data []byte, contentType string) (*Result, error)
- func (c *Client) ExtractTextFromFile(ctx context.Context, filePath string) (*Result, error)
- func (c *Client) IsAvailable(ctx context.Context) bool
- func (c *Client) IsSupported(contentType string) bool
- type Config
- type Result
Constants ¶
This section is empty.
Variables ¶
var SupportedMimeTypes = []string{
"application/pdf",
"application/msword",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"application/vnd.ms-excel",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"application/vnd.ms-powerpoint",
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
"application/rtf",
"text/plain",
"text/rtf",
}
Supported MIME types for text extraction.
Functions ¶
func DetectDocumentType ¶
DetectDocumentType detects the type of document from content type.
func GetSupportedMimeTypes ¶
func GetSupportedMimeTypes() []string
GetSupportedMimeTypes returns the list of supported MIME types.
Types ¶
type Client ¶
type Client struct {
// contains filtered or unexported fields
}
Client provides text extraction functionality.
func (*Client) ExtractText ¶
ExtractText extracts text from a document.
func (*Client) ExtractTextFromFile ¶
ExtractTextFromFile extracts text from a file.
func (*Client) IsAvailable ¶
IsAvailable checks if Tika is available.
func (*Client) IsSupported ¶
IsSupported checks if a MIME type is supported.
type Config ¶
type Config struct {
// TikaServerURL is the URL of the Tika server (e.g., http://localhost:9998)
TikaServerURL string
// TikaJarPath is the path to tika-app.jar (for embedded mode)
TikaJarPath string
// JavaPath is the path to the java executable
JavaPath string
// Timeout is the HTTP timeout for Tika server requests
Timeout time.Duration
// UseEmbedded determines whether to use embedded Tika (java -jar tika-app.jar)
UseEmbedded bool
}
Config holds the text extraction configuration.
func ConfigFromEnv ¶
func ConfigFromEnv() *Config
ConfigFromEnv creates extraction config from environment variables.
func DefaultConfig ¶
func DefaultConfig() *Config
DefaultConfig returns the default text extraction configuration.
type Result ¶
type Result struct {
Text string `json:"text"`
Metadata map[string]string `json:"metadata,omitempty"`
ContentType string `json:"content_type"`
Author string `json:"author,omitempty"`
Title string `json:"title,omitempty"`
Created string `json:"created,omitempty"`
Modified string `json:"modified,omitempty"`
PageCount int `json:"page_count,omitempty"`
WordCount int `json:"word_count,omitempty"`
CharCount int `json:"char_count,omitempty"`
}
Result represents the extraction result with metadata.
func (*Result) GetSummary ¶
GetSummary returns a summary of the extracted text.
func (*Result) MarshalJSON ¶
MarshalJSON implements custom JSON marshaling.