docprocessing

package
v0.16.11 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 15, 2025 License: Apache-2.0 Imports: 21 Imported by: 0

README

Document Processing Tool

The Document Processing tool provides intelligent document conversion capabilities for PDF, DOCX, XLSX, PPTX, HTML, CSV, PNG, and JPG files using the powerful Docling library. It converts documents to structured Markdown while preserving formatting, extracting tables, images, and metadata.

Experimental! This tool is in active development and has more than a few rough edges.

Features

  • Multi-format Support: PDF, DOCX, XLSX, PPTX, HTML, CSV, PNG, JPG document processing
  • Processing Profiles: Simplified interface with preset configurations for common use cases
  • Intelligent Conversion: Preserves document structure and formatting
  • OCR Support: Extract text from scanned documents
  • Hardware Acceleration: Supports MPS (macOS), CUDA, and CPU processing
  • Caching System: Intelligent caching to avoid reprocessing identical documents
  • Metadata Extraction: Extracts document metadata (title, author, page count, etc.)
  • Table & Image Extraction: Preserves tables and images in markdown format
  • Diagram Analysis: Advanced diagram detection and description using vision models
  • Mermaid Generation: Convert diagrams to editable Mermaid syntax by using an external LLM provider
  • Auto-Save: Automatically saves processed content to files by default

Installation

Prerequisites

Note: mcp-devtools will attempt to install the docling package if it's unavailable.

  1. Python 3.13+ with Docling installed:

    pip install docling
    
  2. Optional: Hardware Acceleration

    • macOS: Install PyTorch with MPS support
    • NVIDIA GPUs: Install PyTorch with CUDA support
    • CPU: Works out of the box
Configuration

The tool can be configured via environment variables:

# Python Configuration
DOCLING_PYTHON_PATH="/path/to/python"  # Auto-detected if not set

# Cache Configuration
DOCLING_CACHE_DIR="~/.mcp-devtools/docling-cache"
DOCLING_CACHE_ENABLED="true"

# Hardware Acceleration
DOCLING_HARDWARE_ACCELERATION="auto"  # auto, mps, cuda, cpu

# Processing Configuration
DOCLING_TIMEOUT="300"        # 5 minutes
DOCLING_MAX_FILE_SIZE="100"  # 100 MB

# OCR Configuration
DOCLING_OCR_LANGUAGES="en,fr,de"

# Vision Model Configuration
DOCLING_VISION_MODEL="SmolDocling"

# Certificate Configuration (for MITM proxies)
DOCLING_EXTRA_CA_CERTS="/path/to/mitm-ca-bundle.pem"

# LLM Configuration (for advanced diagram processing)
DOCLING_VLM_API_URL="http://localhost:11434/v1"
DOCLING_VLM_MODEL="qwen2.5vl:7b-q8_0"
DOCLING_VLM_API_KEY="your-api-key-here"

Usage

The tool now features a simplified interface using processing profiles that automatically configure all necessary parameters:

{
  "source": "/path/to/document.pdf"
}

This uses the default text-and-image profile and automatically saves the processed content to /path/to/document.md.

Processing Profiles

Choose from preset profiles that configure multiple parameters automatically:

basic - Fast Text Extraction
{
  "source": "/path/to/document.pdf",
  "profile": "basic"
}
  • Text extraction only
  • Fastest processing
  • No image or diagram analysis
text-and-image - Balanced Processing (Default)
{
  "source": "/path/to/document.pdf",
  "profile": "text-and-image"
}
  • Text and image extraction
  • Table processing
  • Good balance of speed and features
scanned - OCR Processing
{
  "source": "/path/to/scanned-document.pdf",
  "profile": "scanned"
}
  • Optimised for scanned documents
  • OCR enabled by default
  • Best for image-based PDFs
llm-smoldocling - Vision Enhancement
{
  "source": "/path/to/document.pdf",
  "profile": "llm-smoldocling"
}
  • Enhanced with SmolDocling vision model
  • Diagram detection and description
  • Chart data extraction
  • No external LLM required
  • Slower than text-and-image
llm-external - Advanced Diagram Processing
{
  "source": "/path/to/document.pdf",
  "profile": "llm-external"
}
  • Full diagram-to-Mermaid conversion
  • Requires LLM environment variables
  • Most advanced processing capabilities
  • Slower processing time
  • Best for documents with diagrams and charts
  • Only available when DOCLING_LLM_* environment variables are configured
Output Control
Save to File (Default)
{
  "source": "/path/to/document.pdf"
}
  • Automatically saves to /path/to/document.md and if images are extracted, they will be saved in the same directory
  • Returns success message with file path
Custom Save Location
{
  "source": "/path/to/document.pdf",
  "save_to": "/custom/path/output.md"
}
  • Saves to specified location
  • Must be an absolute path
Return Content Inline
{
  "source": "/path/to/document.pdf",
  "inline": true
}
  • Returns content in the response
  • No file is saved

OCR (Optical Character Recognition)

The tool supports OCR processing for extracting text from scanned documents and images. Understanding when to use OCR is important for optimal results:

OCR Disabled (Default for most profiles)
  • Best for: Digital documents (native PDFs, Word documents, Excel files)
  • How it works: Extracts text directly from the document's digital structure
  • Advantages:
    • Faster processing
    • Perfect text accuracy (no recognition errors)
    • Preserves original formatting and fonts
    • Lower resource usage
  • Limitations: Cannot process scanned documents or image-based PDFs
OCR Enabled (Default for scanned profile)
  • Best for: Scanned documents, image-based PDFs, photos of documents
  • How it works: Uses computer vision to recognise text from images
  • Advantages:
    • Can process any document type, including scanned pages
    • Handles handwritten text (with varying accuracy)
    • Works with photos and screenshots of documents
  • Limitations:
    • Slower processing
    • May introduce text recognition errors
    • Formatting may not be perfectly preserved
    • Higher resource usage
OCR Language Support

When using the scanned profile or enabling OCR manually, you can specify languages:

{
  "profile": "scanned",
  "ocr_languages": ["en", "fr", "de", "es"]
}

Supported languages include: English (en), French (fr), German (de), Spanish (es), Italian (it), Portuguese (pt), Dutch (nl), Russian (ru), Chinese (zh), Japanese (ja), Korean (ko), and many others.

Diagram Analysis and Mermaid Generation

Basic Diagram Analysis (llm-smoldocling profile)

Uses the built-in SmolDocling vision model for diagram detection and description:

{
  "source": "/path/to/document.pdf",
  "profile": "llm-smoldocling"
}
Advanced Mermaid Generation (llm-external profile)

For diagram-to-Mermaid conversion, first configure external LLM integration:

# Required environment variables
export DOCLING_VLM_API_URL="http://localhost:11434/v1"   # Any OpenAI-compatible endpoint
export DOCLING_VLM_MODEL="qwen2.5vl:7b-q8_0"                     # Vision-capable model
export DOCLING_VLM_API_KEY="your-api-key-here"            # API key

# Optional configuration
export DOCLING_LLM_MAX_TOKENS="16384"        # Maximum tokens for LLM response
export DOCLING_LLM_TEMPERATURE="0.1"         # Temperature for LLM inference
export DOCLING_LLM_TIMEOUT="240"             # Timeout for LLM requests in seconds

Then use the llm-external profile:

{
  "source": "/path/to/document.pdf",
  "profile": "llm-external"
}
Supported LLM Providers

The tool supports any OpenAI-compatible API endpoint, e.g:

  • Ollama (local): http://localhost:11434/v1
  • LM Studio (local): http://localhost:1234/v1
  • OpenAI: https://api.openai.com/v1
  • OpenRouter: https://openrouter.ai/api/v1

Ensure you select a model that supports vision input (e.g., qwen2.5vl:7b-q8_0, gpt-4-vision-preview, claude-3-sonnet).

Diagram Analysis Features
  • Automatic Detection: Identifies diagrams, flowcharts, architecture diagrams, and charts
  • Type Classification: Classifies diagram types with confidence scoring
  • Mermaid Conversion: Generates valid Mermaid syntax for diagrams
  • Element Extraction: Extracts text elements and structural components
  • AWS Colour Coding: Applies consistent colour schemes for architecture diagrams
  • Validation: Validates generated Mermaid syntax for correctness
  • Fallback Handling: Gracefully falls back to basic analysis if LLM is unavailable

Response Format

File Save Response (Default)
{
  "success": true,
  "message": "Content successfully exported to file",
  "save_path": "/path/to/document.md",
  "source": "/path/to/document.pdf",
  "cache_hit": false,
  "metadata": {
    "file_size": 15420,
    "document_title": "Document Title",
    "document_author": "Author Name",
    "page_count": 10,
    "word_count": 1500
  },
  "processing_info": {
    "processing_mode": "advanced",
    "processing_method": "advanced+vision:standard",
    "hardware_acceleration": "mps",
    "ocr_enabled": false,
    "processing_time": 2.5,
    "timestamp": "2025-07-09T22:12:15+10:00"
  }
}
Inline Content Response
{
  "source": "/path/to/document.pdf",
  "content": "# Document Title\n\nDocument content in markdown...",
  "cache_hit": false,
  "metadata": {
    "title": "Document Title",
    "author": "Author Name",
    "subject": "Document Subject",
    "page_count": 10,
    "word_count": 1500
  },
  "images": [
    {
      "id": "image_1",
      "type": "picture",
      "caption": "Figure 1",
      "file_path": "/path/to/extracted/image_1.png",
      "width": 800,
      "height": 600
    }
  ],
  "diagrams": [
    {
      "id": "diagram_1",
      "type": "flowchart",
      "description": "Process flow diagram showing...",
      "mermaid_code": "flowchart TD\n    A[Start] --> B[Process]\n    B --> C[End]",
      "confidence": 0.95
    }
  ],
  "processing_info": {
    "processing_mode": "advanced",
    "processing_method": "advanced+vision:smoldocling+llm:enhanced",
    "hardware_acceleration": "mps",
    "ocr_enabled": false,
    "processing_time": 8.2,
    "timestamp": "2025-07-09T22:12:15+10:00"
  }
}

Error Handling

The tool provides detailed error information:

{
  "source": "/path/to/document.pdf",
  "error": "Processing failed: File not found",
  "system_info": {
    "platform": "darwin",
    "python_available": true,
    "docling_available": false,
    "hardware_acceleration": ["cpu", "mps"]
  }
}

Architecture

Components
  1. DocumentProcessorTool: Main MCP tool interface with simplified profile system
  2. Config: Configuration management with environment variable support
  3. CacheManager: Intelligent caching system with TTL support
  4. LLMClient: External LLM integration for advanced diagram processing
  5. Python Wrapper: Subprocess interface to Docling Python library
File Structure
internal/tools/docprocessing/
├── README.md                    # This file
├── document_processor.go        # Main tool implementation
├── types.go                     # Type definitions including profiles
├── config.go                    # Configuration management
├── cache.go                     # Caching system
├── llm_client.go               # LLM integration for diagram processing
└── python/
    ├── docling_processor.py     # Main Python wrapper script
    ├── image_processing.py      # Image extraction and processing
    └── table_processing.py      # Table extraction and formatting
Processing Profiles Implementation

The profile system automatically configures multiple parameters:

  • Profile Selection: Chooses appropriate processing mode, vision settings, and features
  • Dependency Resolution: Automatically enables required services and parameters
  • Environment Awareness: llm-external profile only available when LLM is configured
  • Backward Compatibility: Individual parameters still work for advanced users

Performance

Caching

The tool implements intelligent caching based on:

  • Document source (file path/URL)
  • Processing parameters (profile, mode, OCR settings, etc.)
  • File modification time (for local files)

Cache entries have a 24-hour TTL by default and are stored as JSON files.

Hardware Acceleration

Processing performance varies by hardware:

  • CPU: Baseline performance, works everywhere
  • MPS (macOS): 2-5x faster on Apple Silicon
  • CUDA: 3-10x faster on NVIDIA GPUs
Profile Performance Comparison
  • basic: Fastest
  • text-and-image: Moderate
  • scanned: Slower
  • llm-smoldocling: Moderate
  • llm-external: Slowest

Use With Custom MITM Certs

The document processing tool performs a pip install docling (if it's not found) and downloads models, so corporate environments with MITM proxies may need additional certs. Set the DOCLING_EXTRA_CA_CERTS environment variable:

export DOCLING_EXTRA_CA_CERTS="/path/to/mitm-ca-bundle.pem"
Supported Certificate Formats
  • .pem - PEM encoded certificates
  • .crt - Certificate files
  • .cer - Certificate files
  • .ca-bundle - Certificate bundles

Troubleshooting

Common Issues
  1. "Python path is required but not found"

    • Install Python 3.10+ (ideally 3.13+) and ensure it's in PATH
    • Or set DOCLING_PYTHON_PATH environment variable
  2. "Docling not available"

    • Install Docling: pip install docling
    • Verify installation: python -c "import docling; print('OK')"
  3. "Processing timeout"

    • Increase timeout with DOCLING_TIMEOUT environment variable
    • Or pass timeout parameter in request
  4. "Hardware acceleration not working"

    • Install appropriate PyTorch version for your hardware
    • Check system compatibility with python -c "import torch; print(torch.backends.mps.is_available())"
  5. "LLM external profile not available"

    • Ensure all DOCLING_LLM_* environment variables are set
    • Verify LLM endpoint is accessible
    • Check model supports vision input
  6. "Certificate path does not exist"

    • Verify the path specified in DOCLING_EXTRA_CA_CERTS exists
    • Ensure the certificate file or directory is readable
Debug Mode

Enable debug mode to see detailed processing information:

{
  "source": "/path/to/document.pdf",
  "debug": true
}

Potential Future Enhancements

Document Structure Enhancement
  • Reading Order Detection: Improve paragraph and section ordering algorithms
  • Metadata Extraction: Enhanced title, author, reference detection using NLP
  • Language Detection: Automatic document language identification with confidence scores
  • Figure-Caption Matching: Automatic association of figures with their captions using proximity and semantic analysis
Processing Pipeline Options
  • Batch Processing: Support for processing multiple documents efficiently with shared model loading
  • Resource Limits: Configurable page limits, file size limits, CPU thread limits for enterprise deployment
  • Remote Services: Optional integration with cloud-based OCR or vision services (Azure, AWS, GCP)
  • Custom Model Pipelines: Extensible architecture for adding new models via plugin system
Advanced Output Formats
  • Custom Chunking: Integration with HybridChunker for RAG applications
  • Semantic Markup: Add semantic tags for better downstream processing
Diagram/Chart Processing (External Integration)
  • External Service Integration: Use services like "Diagram to Mermaid Converter" APIs
  • Vision Model Integration: Potentially add support for using an external LLM API for diagram processing
  • OCR + Pattern Recognition: Extract text from diagrams and attempt to reconstruct logical structure
  • Flowchart Recognition: Specific support for flowchart-to-Mermaid conversion
Performance and Scalability
  • Streaming Processing: Support for processing large documents in chunks
  • Distributed Processing: Support for processing across multiple nodes
Quality and Accuracy Improvements
  • Confidence Scoring: Add confidence scores for all extracted elements
  • Quality Metrics: Implement quality assessment for extracted content
  • Error Recovery: Better handling of corrupted or unusual document formats
Smart Defaults and Auto-Detection
  • Language Detection: Auto-detect instead of requiring ocr_languages
  • Processing Mode: Auto-select based on document analysis
  • Table Processing: Always use optimal settings

License

This tool is part of the mcp-devtools project and follows the same license terms.

Documentation

Index

Constants

View Source
const (
	EnvOpenAIAPIBase  = "DOCLING_VLM_API_URL"     // e.g., "https://api.openai.com/v1"
	EnvOpenAIModel    = "DOCLING_VLM_MODEL"       // e.g., "gpt-4-vision-preview"
	EnvOpenAIAPIKey   = "DOCLING_VLM_API_KEY"     // API key for the provider (consistent with VLM naming)
	EnvLLMMaxTokens   = "DOCLING_LLM_MAX_TOKENS"  // Maximum tokens for LLM response (default: 16384)
	EnvLLMTemperature = "DOCLING_LLM_TEMPERATURE" // Temperature for LLM inference (default: 0.1)
	EnvLLMTimeout     = "DOCLING_LLM_TIMEOUT"     // Timeout for LLM requests in seconds (default: 240)

	// Prompt configuration environment variables
	EnvPromptBase         = "DOCLING_LLM_PROMPT_BASE"         // Base prompt for diagram analysis
	EnvPromptFlowchart    = "DOCLING_LLM_PROMPT_FLOWCHART"    // Flowchart-specific prompt
	EnvPromptArchitecture = "DOCLING_LLM_PROMPT_ARCHITECTURE" // Architecture diagram prompt
	EnvPromptChart        = "DOCLING_LLM_PROMPT_CHART"        // Chart analysis prompt
	EnvPromptGeneric      = "DOCLING_LLM_PROMPT_GENERIC"      // Generic diagram prompt
)

Environment variable constants for LLM integration

View Source
const (
	DefaultMaxTokens   = 16384
	DefaultTemperature = 0.1
	DefaultTimeout     = 240
)

Default LLM configuration values

View Source
const (
	// VLM Pipeline Configuration
	EnvVLMAPIURL        = "DOCLING_VLM_API_URL"        // User-provided API endpoint URL (e.g., "http://localhost:1234/v1")
	EnvVLMModel         = "DOCLING_VLM_MODEL"          // Model name/ID (e.g., "gpt-4-vision-preview", "SmolVLM-Instruct")
	EnvVLMAPIKey        = "DOCLING_VLM_API_KEY"        // Authentication key for external APIs
	EnvVLMTimeout       = "DOCLING_VLM_TIMEOUT"        // Request timeout in seconds (default: 240)
	EnvVLMFallbackLocal = "DOCLING_VLM_FALLBACK_LOCAL" // Enable local model fallback (default: true)

	// Image Processing Configuration
	EnvImageScale = "DOCLING_IMAGE_SCALE" // Image resolution scale factor (default: 3.0, range: 1.0-4.0)

	// Performance Optimisation Configuration
	EnvDisablePictureClassification = "DOCLING_DISABLE_PICTURE_CLASSIFICATION" // Disable picture classification to speed up processing (default: false)
	EnvDisablePictureDescription    = "DOCLING_DISABLE_PICTURE_DESCRIPTION"    // Disable picture description to speed up processing (default: false)
	EnvAcceleratorProcesses         = "DOCLING_ACCELERATOR_PROCESSES"          // Number of accelerator processes (default: CPU cores - 1)
)

Environment variable constants for VLM Pipeline integration and image processing

View Source
const (
	DefaultDiagramPrompt = `` /* 514-byte string literal not displayed */

)

Default prompts

Variables

This section is empty.

Functions

func CleanupEmbeddedScripts

func CleanupEmbeddedScripts() error

CleanupEmbeddedScripts removes the temporary directory containing extracted scripts This should be called during graceful shutdown, but the OS will clean up temp files anyway

func GetEmbeddedScriptPath

func GetEmbeddedScriptPath() (string, error)

GetEmbeddedScriptPath extracts the embedded Python scripts to a temporary directory and returns the path to the main docling_processor.py script. This is thread-safe and only extracts once per process.

func IsEmbeddedScriptsAvailable

func IsEmbeddedScriptsAvailable() bool

IsEmbeddedScriptsAvailable checks if the embedded Python scripts are available

func IsLLMConfigured

func IsLLMConfigured() bool

IsLLMConfigured checks if the required environment variables are set

func ReadEmbeddedFile

func ReadEmbeddedFile(path string) ([]byte, error)

ReadEmbeddedFile reads an embedded file and returns its content

Types

type BatchProcessingRequest

type BatchProcessingRequest struct {
	Sources        []string       `json:"sources"`                   // Multiple document sources
	ProcessingMode ProcessingMode `json:"processing_mode,omitempty"` // Processing mode for all documents
	OutputFormat   OutputFormat   `json:"output_format,omitempty"`   // Output format for all documents
	EnableOCR      bool           `json:"enable_ocr,omitempty"`      // Enable OCR for all documents
	OCRLanguages   []string       `json:"ocr_languages,omitempty"`   // OCR languages for all documents
	PreserveImages bool           `json:"preserve_images,omitempty"` // Extract images from all documents
	CacheEnabled   *bool          `json:"cache_enabled,omitempty"`   // Cache setting for all documents
	Timeout        *int           `json:"timeout,omitempty"`         // Timeout for each document
	MaxConcurrency int            `json:"max_concurrency,omitempty"` // Maximum concurrent processing
}

BatchProcessingRequest represents a request to process multiple documents

type BatchProcessingResponse

type BatchProcessingResponse struct {
	Results   []DocumentProcessingResponse `json:"results"`    // Individual processing results
	Summary   BatchSummary                 `json:"summary"`    // Batch processing summary
	TotalTime time.Duration                `json:"total_time"` // Total processing time
	Timestamp time.Time                    `json:"timestamp"`  // Batch processing timestamp
}

BatchProcessingResponse represents the response from batch processing

type BatchSummary

type BatchSummary struct {
	TotalDocuments  int `json:"total_documents"`  // Total number of documents
	SuccessfulCount int `json:"successful_count"` // Number of successfully processed documents
	FailedCount     int `json:"failed_count"`     // Number of failed documents
	CacheHitCount   int `json:"cache_hit_count"`  // Number of cache hits
	TotalPages      int `json:"total_pages"`      // Total pages processed
	TotalWords      int `json:"total_words"`      // Total words processed
	TotalImages     int `json:"total_images"`     // Total images extracted
	TotalTables     int `json:"total_tables"`     // Total tables extracted
}

BatchSummary provides summary statistics for batch processing

type BoundingBox

type BoundingBox struct {
	X      float64 `json:"x"`      // X coordinate (left)
	Y      float64 `json:"y"`      // Y coordinate (top)
	Width  float64 `json:"width"`  // Width
	Height float64 `json:"height"` // Height
}

BoundingBox represents the position and size of an element on a page

type CacheManager

type CacheManager struct {
	// contains filtered or unexported fields
}

CacheManager handles caching of document processing results

func NewCacheManager

func NewCacheManager(config *Config) *CacheManager

NewCacheManager creates a new cache manager

func (*CacheManager) CleanExpired

func (cm *CacheManager) CleanExpired() error

CleanExpired removes expired cache entries

func (*CacheManager) CleanOldFiles

func (cm *CacheManager) CleanOldFiles(maxAge time.Duration) error

CleanOldFiles removes cache files older than the specified duration This is useful for cleaning up files that may not have proper TTL metadata

func (*CacheManager) Clear

func (cm *CacheManager) Clear() error

Clear removes all cached results

func (*CacheManager) ClearFileCache

func (cm *CacheManager) ClearFileCache(source string) error

ClearFileCache removes all cache entries for a specific source file

func (*CacheManager) Delete

func (cm *CacheManager) Delete(cacheKey string) error

Delete removes a cached result

func (*CacheManager) GenerateCacheKey

func (cm *CacheManager) GenerateCacheKey(req *DocumentProcessingRequest) string

GenerateCacheKey generates a cache key for the given request

func (*CacheManager) Get

func (cm *CacheManager) Get(cacheKey string) (*DocumentProcessingResponse, bool)

Get retrieves a cached result if it exists and is valid

func (*CacheManager) GetCacheFilePath

func (cm *CacheManager) GetCacheFilePath(cacheKey string) string

GetCacheFilePath returns the file path for a cache key

func (*CacheManager) GetStats

func (cm *CacheManager) GetStats() (*CacheStats, error)

GetStats returns cache statistics

func (*CacheManager) PerformMaintenance

func (cm *CacheManager) PerformMaintenance(maxAge time.Duration) error

PerformMaintenance performs routine cache maintenance including: - Removing expired entries - Removing old files (older than maxAge)

func (*CacheManager) Set

func (cm *CacheManager) Set(cacheKey string, response *DocumentProcessingResponse) error

Set stores a result in the cache

type CacheStats

type CacheStats struct {
	Enabled      bool   `json:"enabled"`
	Directory    string `json:"directory"`
	TotalFiles   int    `json:"total_files"`
	TotalSize    int64  `json:"total_size"`    // Size in bytes
	ExpiredFiles int    `json:"expired_files"` // Number of expired files
}

CacheStats provides statistics about the cache

type CachedResponse

type CachedResponse struct {
	Response  DocumentProcessingResponse `json:"response"`
	CacheKey  string                     `json:"cache_key"`
	Timestamp time.Time                  `json:"timestamp"`
	TTL       time.Duration              `json:"ttl"` // Time to live
}

CachedResponse represents a cached document processing response

type Config

type Config struct {
	// Python Configuration
	PythonPath string // Path to Python executable with Docling installed

	// Cache Configuration
	CacheDir     string // Directory for caching processed documents
	CacheEnabled bool   // Enable/disable caching

	// Hardware Configuration
	HardwareAcceleration HardwareAcceleration // Hardware acceleration mode

	// Processing Configuration
	Timeout     int // Processing timeout in seconds
	MaxFileSize int // Maximum file size in MB

	// OCR Configuration
	OCRLanguages []string // Default OCR languages

	// Vision Model Configuration
	VisionModel string // Vision model to use

	// Certificate Configuration
	ExtraCACerts string // Path to additional CA certificates file or directory
}

Config holds the configuration for document processing

func DefaultConfig

func DefaultConfig() *Config

DefaultConfig returns the default configuration

func LoadConfig

func LoadConfig() *Config

LoadConfig loads configuration from environment variables

func (*Config) EnsureCacheDir

func (c *Config) EnsureCacheDir() error

EnsureCacheDir creates the cache directory if it doesn't exist

func (*Config) GetCertificateEnvironment

func (c *Config) GetCertificateEnvironment() []string

GetCertificateEnvironment returns environment variables for certificate configuration

func (*Config) GetScriptPath

func (c *Config) GetScriptPath() string

GetScriptPath returns the path to the Python wrapper script

func (*Config) GetSystemInfo

func (c *Config) GetSystemInfo() *SystemInfo

GetSystemInfo returns system information for diagnostics

func (*Config) ResolveHardwareAcceleration

func (c *Config) ResolveHardwareAcceleration() HardwareAcceleration

ResolveHardwareAcceleration resolves the hardware acceleration setting

func (*Config) Validate

func (c *Config) Validate() error

Validate validates the configuration

func (*Config) ValidateCertificates

func (c *Config) ValidateCertificates() error

ValidateCertificates validates the certificate configuration

type DiagramAnalysis

type DiagramAnalysis struct {
	Description    string                 `json:"description"`
	DiagramType    string                 `json:"diagram_type"`
	MermaidCode    string                 `json:"mermaid_code"`
	Elements       []DiagramElement       `json:"elements"`
	Confidence     float64                `json:"confidence"`
	Properties     map[string]interface{} `json:"properties"`
	ProcessingTime time.Duration          `json:"processing_time"`
	TokenUsage     *TokenUsage            `json:"token_usage,omitempty"` // Token usage from LLM provider (if available)
}

DiagramAnalysis represents the result of LLM-based diagram analysis

type DiagramElement

type DiagramElement struct {
	Type        string       `json:"type"`                   // Element type (text, shape, connector, etc.)
	Content     string       `json:"content,omitempty"`      // Text content of the element
	Position    string       `json:"position,omitempty"`     // Position description within diagram
	BoundingBox *BoundingBox `json:"bounding_box,omitempty"` // Position within the diagram
}

DiagramElement represents a text or structural element within a diagram

type DiagramLLMClient

type DiagramLLMClient struct {
	// contains filtered or unexported fields
}

DiagramLLMClient handles LLM-based diagram analysis using OpenAI API

func NewDiagramLLMClient

func NewDiagramLLMClient() (*DiagramLLMClient, error)

NewDiagramLLMClient creates a new LLM client for diagram analysis using OpenAI API

func (*DiagramLLMClient) AnalyseDiagram

func (c *DiagramLLMClient) AnalyseDiagram(diagram *ExtractedDiagram) (*DiagramAnalysis, error)

AnalyseDiagram performs LLM-based analysis of a diagram

type DocumentMetadata

type DocumentMetadata struct {
	Title        string            `json:"title,omitempty"`         // Document title
	Author       string            `json:"author,omitempty"`        // Document author
	Subject      string            `json:"subject,omitempty"`       // Document subject
	Creator      string            `json:"creator,omitempty"`       // Document creator
	Producer     string            `json:"producer,omitempty"`      // Document producer
	CreationDate *time.Time        `json:"creation_date,omitempty"` // Creation date
	ModifiedDate *time.Time        `json:"modified_date,omitempty"` // Last modified date
	PageCount    int               `json:"page_count,omitempty"`    // Number of pages
	WordCount    int               `json:"word_count,omitempty"`    // Estimated word count
	Language     string            `json:"language,omitempty"`      // Detected language
	Format       string            `json:"format"`                  // Original document format
	FileSize     int64             `json:"file_size,omitempty"`     // File size in bytes
	Properties   map[string]string `json:"properties,omitempty"`    // Additional properties
}

DocumentMetadata contains metadata about the processed document

type DocumentProcessingRequest

type DocumentProcessingRequest struct {
	Source                   string               `json:"source"`                                // File path, URL, or base64 content
	Profile                  ProcessingProfile    `json:"profile,omitempty"`                     // Processing profile (replaces multiple parameters)
	ProcessingMode           ProcessingMode       `json:"processing_mode,omitempty"`             // Processing mode (default: basic)
	OutputFormat             OutputFormat         `json:"output_format,omitempty"`               // Output format (default: markdown)
	EnableOCR                bool                 `json:"enable_ocr,omitempty"`                  // Enable OCR processing
	OCRLanguages             []string             `json:"ocr_languages,omitempty"`               // OCR language codes
	PreserveImages           bool                 `json:"preserve_images,omitempty"`             // Extract and preserve images
	Timeout                  *int                 `json:"timeout,omitempty"`                     // Processing timeout in seconds
	MaxFileSize              *int                 `json:"max_file_size,omitempty"`               // Maximum file size in MB
	ReturnInlineOnly         *bool                `json:"return_inline_only,omitempty"`          // Return content inline in the response only. When false (default), the tool will save the processed content to a file in the same directory as the source file, and also return the content inline.
	SaveTo                   string               `json:"save_to,omitempty"`                     // File path to save content when return_inline_only=false
	ClearFileCache           bool                 `json:"clear_file_cache,omitempty"`            // Force clear all cache entries for this source file before processing
	TableFormerMode          TableFormerMode      `json:"table_former_mode,omitempty"`           // TableFormer processing mode for table structure recognition
	CellMatching             *bool                `json:"cell_matching,omitempty"`               // Control table cell matching (true: use PDF cells, false: use predicted cells)
	VisionMode               VisionProcessingMode `json:"vision_mode,omitempty"`                 // Vision processing mode for enhanced document understanding
	DiagramDescription       bool                 `json:"diagram_description,omitempty"`         // Enable diagram and chart description using vision models
	ChartDataExtraction      bool                 `json:"chart_data_extraction,omitempty"`       // Enable data extraction from charts and graphs
	EnableRemoteServices     bool                 `json:"enable_remote_services,omitempty"`      // Allow communication with external vision model services
	ConvertDiagramsToMermaid bool                 `json:"convert_diagrams_to_mermaid,omitempty"` // Convert detected diagrams to Mermaid syntax using AI vision models
	GenerateDiagrams         bool                 `json:"generate_diagrams,omitempty"`           // Generate enhanced diagram analysis using external LLM (requires DOCLING_VLM_API_URL, DOCLING_VLM_MODEL, DOCLING_VLM_API_KEY environment variables)
	ExtractImages            bool                 `json:"extract_images,omitempty"`              // Extract individual images, charts, and diagrams as base64-encoded data with AI recreation prompts
	Debug                    bool                 `json:"debug,omitempty"`                       // Return debug information including environment variables (secrets masked)
}

DocumentProcessingRequest represents the input parameters for document processing

type DocumentProcessingResponse

type DocumentProcessingResponse struct {
	Source         string             `json:"source"`             // Original source
	Content        string             `json:"content"`            // Processed content (markdown)
	Metadata       *DocumentMetadata  `json:"metadata,omitempty"` // Document metadata
	Images         []ExtractedImage   `json:"images,omitempty"`   // Extracted images
	Tables         []ExtractedTable   `json:"tables,omitempty"`   // Extracted tables
	Diagrams       []ExtractedDiagram `json:"diagrams,omitempty"` // Extracted diagrams
	ProcessingInfo ProcessingInfo     `json:"processing_info"`    // Processing information
	CacheHit       bool               `json:"cache_hit"`          // Whether result came from cache
	Error          string             `json:"error,omitempty"`    // Error message if processing failed
}

DocumentProcessingResponse represents the output from document processing

type DocumentProcessorTool

type DocumentProcessorTool struct {
	// contains filtered or unexported fields
}

DocumentProcessorTool implements document processing using Docling via Python subprocess

func (*DocumentProcessorTool) Definition

func (t *DocumentProcessorTool) Definition() mcp.Tool

Definition returns the MCP tool definition

func (*DocumentProcessorTool) Execute

func (t *DocumentProcessorTool) Execute(ctx context.Context, logger *logrus.Logger, cache *sync.Map, args map[string]interface{}) (*mcp.CallToolResult, error)

Execute processes the document using the Python wrapper

type ErrorInfo

type ErrorInfo struct {
	Code        string            `json:"code"`              // Error code
	Message     string            `json:"message"`           // Error message
	Details     string            `json:"details,omitempty"` // Additional error details
	Source      string            `json:"source,omitempty"`  // Source that caused the error
	Timestamp   time.Time         `json:"timestamp"`         // When the error occurred
	Context     map[string]string `json:"context,omitempty"` // Additional context
	Recoverable bool              `json:"recoverable"`       // Whether the error is recoverable
}

ErrorInfo represents detailed error information

type ExtractedDiagram

type ExtractedDiagram struct {
	ID          string                 `json:"id"`                     // Unique diagram identifier
	Type        string                 `json:"type"`                   // Type of diagram (flowchart, chart, diagram, etc.)
	Caption     string                 `json:"caption,omitempty"`      // Diagram caption if available
	Description string                 `json:"description,omitempty"`  // Generated description of the diagram
	DiagramType string                 `json:"diagram_type,omitempty"` // Classified diagram type (flowchart, chart, etc.)
	MermaidCode string                 `json:"mermaid_code,omitempty"` // Generated Mermaid syntax for the diagram
	Base64Data  string                 `json:"base64_data,omitempty"`  // Base64-encoded image data for LLM vision processing
	Elements    []DiagramElement       `json:"elements,omitempty"`     // Text elements within the diagram
	PageNumber  int                    `json:"page_number,omitempty"`  // Page number where diagram appears
	BoundingBox *BoundingBox           `json:"bounding_box,omitempty"` // Position on page
	Confidence  float64                `json:"confidence,omitempty"`   // Confidence score for diagram analysis
	Properties  map[string]interface{} `json:"properties,omitempty"`   // Additional diagram-specific properties
}

ExtractedDiagram represents a diagram extracted from the document

type ExtractedImage

type ExtractedImage struct {
	ID            string       `json:"id"`                       // Unique image identifier
	Type          string       `json:"type"`                     // Type of image (picture, table, chart, diagram)
	Caption       string       `json:"caption,omitempty"`        // Image caption if available
	AltText       string       `json:"alt_text,omitempty"`       // Alternative text
	Format        string       `json:"format"`                   // Image format (PNG, JPEG, etc.)
	Width         int          `json:"width,omitempty"`          // Image width in pixels
	Height        int          `json:"height,omitempty"`         // Image height in pixels
	Size          int64        `json:"size,omitempty"`           // Image size in bytes
	FilePath      string       `json:"file_path,omitempty"`      // Path to saved image file
	PageNumber    int          `json:"page_number,omitempty"`    // Page number where image appears
	BoundingBox   *BoundingBox `json:"bounding_box,omitempty"`   // Position on page
	ExtractedText []string     `json:"extracted_text,omitempty"` // Text elements extracted from the image
}

ExtractedImage represents an image extracted from the document

type ExtractedTable

type ExtractedTable struct {
	ID          string       `json:"id"`                     // Unique table identifier
	Caption     string       `json:"caption,omitempty"`      // Table caption if available
	Headers     []string     `json:"headers,omitempty"`      // Column headers
	Rows        [][]string   `json:"rows"`                   // Table data rows
	PageNumber  int          `json:"page_number,omitempty"`  // Page number where table appears
	BoundingBox *BoundingBox `json:"bounding_box,omitempty"` // Position on page
	Markdown    string       `json:"markdown,omitempty"`     // Markdown representation
	CSV         string       `json:"csv,omitempty"`          // CSV representation
}

ExtractedTable represents a table extracted from the document

type HardwareAcceleration

type HardwareAcceleration string

HardwareAcceleration defines the hardware acceleration mode

const (
	HardwareAccelerationAuto HardwareAcceleration = "auto" // Auto-detect best option
	HardwareAccelerationMPS  HardwareAcceleration = "mps"  // Metal Performance Shaders (macOS)
	HardwareAccelerationCUDA HardwareAcceleration = "cuda" // CUDA (NVIDIA GPUs)
	HardwareAccelerationCPU  HardwareAcceleration = "cpu"  // CPU-only processing
)

type LLMConfig

type LLMConfig struct {
	Provider string
	Model    string
	APIKey   string
	BaseURL  string
}

LLMConfig contains configuration for the LLM client

type OutputFormat

type OutputFormat string

OutputFormat defines the output format for processed documents

const (
	OutputFormatMarkdown OutputFormat = "markdown" // Markdown output (default)
	OutputFormatJSON     OutputFormat = "json"     // JSON metadata
	OutputFormatBoth     OutputFormat = "both"     // Both markdown and JSON
)

type ProcessingInfo

type ProcessingInfo struct {
	ProcessingMode       ProcessingMode       `json:"processing_mode"`           // Mode used for processing
	ProcessingMethod     string               `json:"processing_method"`         // Concise description of processing method used
	HardwareAcceleration HardwareAcceleration `json:"hardware_acceleration"`     // Hardware acceleration used
	VisionModel          string               `json:"vision_model,omitempty"`    // Vision model used (if any)
	OCREnabled           bool                 `json:"ocr_enabled"`               // Whether OCR was enabled
	OCRLanguages         []string             `json:"ocr_languages,omitempty"`   // OCR languages used
	ProcessingTime       float64              `json:"processing_time"`           // Time taken to process in seconds
	PythonVersion        string               `json:"python_version,omitempty"`  // Python version used
	DoclingVersion       string               `json:"docling_version,omitempty"` // Docling version used
	CacheKey             string               `json:"cache_key,omitempty"`       // Cache key used
	Timestamp            time.Time            `json:"timestamp"`                 // Processing timestamp
	TokenUsage           *TokenUsage          `json:"token_usage,omitempty"`     // Token usage from external LLM (if available)
}

ProcessingInfo contains information about the processing operation

type ProcessingMode

type ProcessingMode string

ProcessingMode defines the type of document processing to perform

const (
	ProcessingModeBasic    ProcessingMode = "basic"    // Fast, code-only processing
	ProcessingModeAdvanced ProcessingMode = "advanced" // Vision model with layout preservation
	ProcessingModeOCR      ProcessingMode = "ocr"      // OCR for scanned documents
	ProcessingModeTables   ProcessingMode = "tables"   // Table extraction focus
	ProcessingModeImages   ProcessingMode = "images"   // Image extraction focus
)

type ProcessingProfile

type ProcessingProfile string

ProcessingProfile defines preset configurations for common document processing scenarios

const (
	ProfileBasic          ProcessingProfile = "basic"           // Text extraction only (fast processing)
	ProfileTextAndImage   ProcessingProfile = "text-and-image"  // Text and image extraction with tables
	ProfileScanned        ProcessingProfile = "scanned"         // OCR-focused processing for scanned documents
	ProfileLLMSmolDocling ProcessingProfile = "llm-smoldocling" // Text and image extraction enhanced with SmolDocling vision model
	ProfileLLMExternal    ProcessingProfile = "llm-external"    // Text and image extraction enhanced with external vision LLM for diagram conversion to Mermaid
)

type SystemInfo

type SystemInfo struct {
	Platform             string                 `json:"platform"`                        // Operating system
	Architecture         string                 `json:"architecture"`                    // CPU architecture
	PythonPath           string                 `json:"python_path,omitempty"`           // Path to Python executable
	PythonVersion        string                 `json:"python_version,omitempty"`        // Python version
	DoclingVersion       string                 `json:"docling_version,omitempty"`       // Docling version
	DoclingAvailable     bool                   `json:"docling_available"`               // Whether Docling is available
	HardwareAcceleration []HardwareAcceleration `json:"hardware_acceleration_available"` // Available acceleration options
	CacheDirectory       string                 `json:"cache_directory,omitempty"`       // Cache directory path
	CacheEnabled         bool                   `json:"cache_enabled"`                   // Whether caching is enabled
	MaxFileSize          int                    `json:"max_file_size"`                   // Maximum file size in MB
	DefaultTimeout       int                    `json:"default_timeout"`                 // Default timeout in seconds
}

SystemInfo represents system information for diagnostics

type TableFormerMode

type TableFormerMode string

TableFormerMode defines the TableFormer processing mode for table structure recognition

const (
	TableFormerModeFast     TableFormerMode = "fast"     // Faster but less accurate table processing
	TableFormerModeAccurate TableFormerMode = "accurate" // More accurate but slower table processing (default)
)

type TokenUsage

type TokenUsage struct {
	PromptTokens     int `json:"prompt_tokens,omitempty"`     // Tokens used in prompts
	CompletionTokens int `json:"completion_tokens,omitempty"` // Tokens used in completions
	TotalTokens      int `json:"total_tokens,omitempty"`      // Total tokens used
}

TokenUsage represents token consumption from external LLM providers

type VisionProcessingMode

type VisionProcessingMode string

VisionProcessingMode defines the vision model processing mode for enhanced document understanding

const (
	VisionModeStandard    VisionProcessingMode = "standard"    // Standard vision processing
	VisionModeSmolDocling VisionProcessingMode = "smoldocling" // Compact vision-language model (256M parameters)
	VisionModeAdvanced    VisionProcessingMode = "advanced"    // Advanced vision processing with remote services
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL