pdf

package

v0.16.7 Latest Latest Go to latest Published: Jul 14, 2025 License: Apache-2.0 Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sammcj/mcp-devtools

Links

Open Source Insights

README ¶

PDF Processing Tool

This tool provides PDF text and image extraction capabilities using the pdfcpu library. It processes PDF files to extract content while attempting to retain formatting and accurate layout, creating markdown files with embedded image references.

Features

Text Extraction: Extracts text content from PDF pages using pdfcpu's content extraction
Image Extraction: Extracts embedded images from PDFs with proper naming and organisation
Markdown Output: Generates well-formatted markdown with preserved structure
Multi-page Support: Process all pages or specific page ranges
Automatic Linking: Links extracted images in the correct locations within the markdown
Flexible Output: Choose output directory or use the same directory as the source PDF

Usage

Basic Usage

{
  "name": "pdf",
  "arguments": {
    "file_path": "/absolute/path/to/document.pdf"
  }
}

Advanced Usage

{
  "name": "pdf",
  "arguments": {
    "file_path": "/absolute/path/to/document.pdf",
    "output_dir": "/absolute/path/to/output",
    "extract_images": true,
    "pages": "1-5"
  }
}

Parameters

file_path (required): Absolute file path to the PDF document to process
output_dir (optional): Output directory for markdown and images (defaults to same directory as PDF)
extract_images (optional): Whether to extract images from the PDF (default: true)
pages (optional): Page range to process. Options:
- "all" - Process all pages (default)
- "1-5" - Process pages 1 through 5
- "1,3,5" - Process pages 1, 3, and 5
- "1-3,7,10-12" - Process pages 1-3, 7, and 10-12

Output

The tool creates:

Markdown File: {basename}.md in the output directory containing:
- Document title and metadata
- Page-by-page content with headers
- Embedded image references where appropriate
Image Directory: {basename}_images/ containing:
- All extracted images with descriptive names
- Images organised by page number
JSON Response: Contains:
- Path to generated markdown file
- List of extracted image paths
- Processing statistics

Example Output Structure

/output/directory/
├── document.md                 # Generated markdown
└── document_images/           # Extracted images
    ├── document_page_1_img_1.jpg
    ├── document_page_1_img_2.png
    ├── document_page_2_img_1.jpg
    └── ...

Limitations

Text Quality: Text extraction quality depends on the PDF structure. Complex layouts may not be perfectly preserved.
Font Dependencies: Some text rendering may be affected by missing font information.
Table Extraction: Tables are extracted as raw text without structure preservation.
Scanned PDFs: This tool does not perform OCR on scanned documents.

Technical Details

The tool uses pdfcpu's content extraction capabilities:

api.ExtractContentFile() for raw page content
api.ExtractImagesFile() for image extraction
api.PageCountFile() for page counting

The extracted content is processed to:

Remove PDF-specific commands and directives
Extract readable text from PDF text operations
Format content as markdown with appropriate headers
Link extracted images in context

Error Handling

The tool provides detailed error messages for:

Invalid file paths or missing files
Unsupported file formats
Invalid page ranges
Extraction failures

Failed pages are noted in the output with appropriate error messages, allowing partial processing to continue.

Dependencies

pdfcpu v0.11.0+
Go standard library packages for file operations and text processing

Documentation ¶

Index ¶

type PDFRequest
type PDFResponse
type PDFTool
- func (t *PDFTool) Definition() mcp.Tool
- func (t *PDFTool) Execute(ctx context.Context, logger *logrus.Logger, cache *sync.Map, ...) (*mcp.CallToolResult, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type PDFRequest ¶

type PDFRequest struct {
	// FilePath is the absolute path to the PDF file to process
	FilePath string `json:"file_path"`

	// OutputDir is the directory where markdown and images will be saved
	OutputDir string `json:"output_dir"`

	// ExtractImages indicates whether to extract images from the PDF
	ExtractImages bool `json:"extract_images"`

	// Pages specifies which pages to process (e.g., "1-5", "1,3,5", "all")
	Pages string `json:"pages"`
}

PDFRequest represents a request to process a PDF file

type PDFResponse ¶

type PDFResponse struct {
	// FilePath is the original PDF file that was processed
	FilePath string `json:"file_path"`

	// MarkdownFile is the path to the generated markdown file
	MarkdownFile string `json:"markdown_file"`

	// ExtractedImages is a list of extracted image file paths
	ExtractedImages []string `json:"extracted_images"`

	// PagesProcessed is the number of pages that were processed
	PagesProcessed int `json:"pages_processed"`

	// TotalPages is the total number of pages in the PDF
	TotalPages int `json:"total_pages"`

	// OutputDir is the directory where files were saved
	OutputDir string `json:"output_dir"`
}

PDFResponse represents the result of PDF processing

type PDFTool ¶

type PDFTool struct{}

PDFTool implements PDF processing with pdfcpu

func (*PDFTool) Definition ¶

func (t *PDFTool) Definition() mcp.Tool

Definition returns the tool's definition for MCP registration

func (*PDFTool) Execute ¶

func (t *PDFTool) Execute(ctx context.Context, logger *logrus.Logger, cache *sync.Map, args map[string]interface{}) (*mcp.CallToolResult, error)

Execute processes the PDF file

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL