textextract

package
v0.101.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 20, 2026 License: MIT Imports: 14 Imported by: 0

Documentation

Overview

Package textextract provides full-text extraction functionality using Apache Tika. This is used to extract text from PDF, Office, and other document formats.

Index

Constants

This section is empty.

Variables

View Source
var SupportedMimeTypes = []string{
	"application/pdf",
	"application/msword",
	"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
	"application/vnd.ms-excel",
	"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
	"application/vnd.ms-powerpoint",
	"application/vnd.openxmlformats-officedocument.presentationml.presentation",
	"application/rtf",
	"text/plain",
	"text/rtf",
}

Supported MIME types for text extraction.

Functions

func DetectDocumentType

func DetectDocumentType(contentType string) string

DetectDocumentType detects the type of document from content type.

func GetSupportedMimeTypes

func GetSupportedMimeTypes() []string

GetSupportedMimeTypes returns the list of supported MIME types.

Types

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client provides text extraction functionality.

func NewClient

func NewClient(config *Config) *Client

NewClient creates a new text extraction client.

func (*Client) ExtractText

func (c *Client) ExtractText(ctx context.Context, data []byte, contentType string) (*Result, error)

ExtractText extracts text from a document.

func (*Client) ExtractTextFromFile

func (c *Client) ExtractTextFromFile(ctx context.Context, filePath string) (*Result, error)

ExtractTextFromFile extracts text from a file.

func (*Client) IsAvailable

func (c *Client) IsAvailable(ctx context.Context) bool

IsAvailable checks if Tika is available.

func (*Client) IsSupported

func (c *Client) IsSupported(contentType string) bool

IsSupported checks if a MIME type is supported.

type Config

type Config struct {
	// TikaServerURL is the URL of the Tika server (e.g., http://localhost:9998)
	TikaServerURL string
	// TikaJarPath is the path to tika-app.jar (for embedded mode)
	TikaJarPath string
	// JavaPath is the path to the java executable
	JavaPath string
	// Timeout is the HTTP timeout for Tika server requests
	Timeout time.Duration
	// UseEmbedded determines whether to use embedded Tika (java -jar tika-app.jar)
	UseEmbedded bool
}

Config holds the text extraction configuration.

func ConfigFromEnv

func ConfigFromEnv() *Config

ConfigFromEnv creates extraction config from environment variables.

func DefaultConfig

func DefaultConfig() *Config

DefaultConfig returns the default text extraction configuration.

type Result

type Result struct {
	Text        string            `json:"text"`
	Metadata    map[string]string `json:"metadata,omitempty"`
	ContentType string            `json:"content_type"`
	Author      string            `json:"author,omitempty"`
	Title       string            `json:"title,omitempty"`
	Created     string            `json:"created,omitempty"`
	Modified    string            `json:"modified,omitempty"`
	PageCount   int               `json:"page_count,omitempty"`
	WordCount   int               `json:"word_count,omitempty"`
	CharCount   int               `json:"char_count,omitempty"`
}

Result represents the extraction result with metadata.

func Merge

func Merge(results []*Result) *Result

Merge merges multiple extraction results.

func (*Result) GetSummary

func (r *Result) GetSummary(maxLength int) string

GetSummary returns a summary of the extracted text.

func (*Result) MarshalJSON

func (r *Result) MarshalJSON() ([]byte, error)

MarshalJSON implements custom JSON marshaling.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL