classify-docs

command module

v0.0.0-...-349ac01 Latest Latest Go to latest Published: Dec 2, 2025 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/JaimeStill/go-agents

Links

Open Source Insights

README ¶

Document Classification Tool

A proof-of-concept tool for processing PDF documents and analyzing security classification markings using go-agents vision capabilities. This tool validates document processing patterns for the future go-agents-document-context library.

Overview

The classify-docs tool demonstrates document processing → agent analysis architecture by:

Processing PDFs - Extracting pages and converting to images for vision API consumption
Validating Patterns - Testing sequential processing with context accumulation for both classification and prompt generation
Informing Design - Prototyping interfaces and patterns for future document processing libraries

Why This Tool?

Azure OpenAI Vision API does not support native PDF input. Documents must be converted to images before analysis. This tool provides the infrastructure for that preprocessing while validating reusable architecture patterns.

Project Documentation

PROJECT.md - Complete project roadmap, architecture layers, and phase development plan
_context/.archive/ - Historical development summaries and implementation decisions

Prerequisites

Go 1.25.2+ - This tool uses a nested Go module

ImageMagick v7+ - Required for PDF page rendering

# Arch Linux
sudo pacman -S imagemagick

# Ubuntu/Debian
sudo apt install imagemagick

# macOS
brew install imagemagick

# Verify installation
magick -version

Commands

The classify-docs tool provides two main commands integrated into main.go:

generate-prompt - System Prompt Generation

Generate classification system prompts from reference policy documents using sequential processing with context accumulation.

Usage:

go run . generate-prompt [options]

Flags:

--config (default: "config.classify-gpt4o-key.json") - Path to agent configuration file
--token - API token (overrides token in config file)
--references (default: "_context") - Directory containing reference PDF documents
--no-cache - Disable cache usage (force regeneration)
--timeout (default: "30m") - Operation timeout

Example:

export AZURE_API_KEY="your-api-key"
go run . generate-prompt --token $AZURE_API_KEY

Sample Output:

Discovering reference documents...
  Found: dodm-5200.01-enc4.pdf (9 pages)
  Found: security-classification-markings.pdf (2 pages)

Generating system prompt from 11 pages...
  Page 1/11 processed
  Page 2/11 processed
  ...
  Page 11/11 processed

System prompt generated successfully!
  Cached: .cache/system-prompt.json

---
[Generated system prompt content]
---

The generated system prompt is saved to .cache/system-prompt.json with metadata including timestamp and reference document list.

classify - Document Classification

Classify security markings in PDF documents using the generated system prompt and vision capabilities.

Usage:

go run . classify [options]

Flags:

--config (default: "config.classify-o4-mini.json") - Path to agent configuration file
--token - API token (overrides token in config file)
--input - Directory containing PDF documents to classify
--output (default: "classification-results.json") - Output JSON file path
--system-prompt (default: ".cache/system-prompt.json") - Path to cached system prompt
--timeout (default: "15m") - Operation timeout

Example:

go run . classify --token $AZURE_API_KEY --input _context/marked-documents

Sample Output:

[
  {
    "file": "marked-documents_6.pdf",
    "classification": "SECRET//NOFORN",
    "confidence": "HIGH",
    "markings_found": ["SECRET//NOFORN"],
    "classification_rationale": "Banner marking 'SECRET//NOFORN' clearly visible..."
  },
  {
    "file": "marked-documents_19.pdf",
    "classification": "SECRET",
    "confidence": "MEDIUM",
    "markings_found": ["SECRET"],
    "classification_rationale": "A clear classification banner 'SECRET' appears..."
  }
]

Accuracy Note: Currently achieves 96.3% accuracy (26/27 documents) on test set. Known limitation: Document 19 (marked-documents_19.pdf) - the model cannot consistently detect the faded NOFORN caveat stamp, resulting in classification as 'SECRET' instead of 'SECRET//NOFORN'. This is correctly flagged with MEDIUM confidence for human review.

Utility Tools

test-config - Configuration verification utility:

go run ./cmd/test-config/main.go <config-file>

test-render - PDF rendering verification utility:

cd tools/classify-docs
go build -o test-render ./cmd/test-render
./test-render --input <pdf-path> [options]

See source files for detailed usage information.

Testing

Run the complete test suite:

cd tools/classify-docs
go test ./tests/... -v

Test Packages:

tests/document/ - PDF processing and image conversion (7 tests)
tests/config/ - Configuration loading and merging (4 tests)
tests/cache/ - System prompt caching (4 tests)
tests/retry/ - Retry logic and exponential backoff (6 tests)
tests/processing/ - Parallel and sequential processors (12 tests)
tests/encoding/ - Base64 data URI image encoding (5 tests)
tests/prompt/ - System prompt generation validation (3 tests)

Total: 45 tests, all passing

Tests automatically skip image conversion tests if ImageMagick is not available.

Run with coverage:

go test ./tests/... -coverprofile=coverage.out
go tool cover -func=coverage.out
go tool cover -html=coverage.out -o coverage.html

Development Status

Phase 1: Complete ✅

Document Processing Primitives - PDF extraction and image conversion infrastructure

See: 01-document-processing-primitives.md

Phase 2: Complete ✅

Processing Infrastructure - Generic parallel and sequential processors with retry logic

Generic processing functions with type parameters
Parallel processor with worker pools
Sequential processor with context accumulation
Retry infrastructure with exponential backoff
Unified configuration management

See: 02-processing-infrastructure.md

Phase 3: Complete ✅

Caching Infrastructure - System prompt caching (merged into Phase 2)

JSON-based cache persistence
Metadata tracking (timestamp, reference documents)
Configuration integration

See: 02-processing-infrastructure.md

Phase 4: Complete ✅

System Prompt Generation - Sequential processing with context accumulation

Base64 data URI encoding for vision API
Sequential processor integration with vision agent
Progressive system prompt refinement across document pages
Retry infrastructure tuned for Azure rate limiting (13s initial backoff, 1.2 multiplier)
Generic progress reporting with result visibility
CLI tool for prompt generation

See: 03-system-prompt-generation.md

Phase 5: Complete ✅

Document Classification - Sequential processing with conservative confidence scoring

Per-page classification with context accumulation
Comprehensive classification prompt with self-check verification
Conservative confidence scoring (HIGH/MEDIUM/LOW)
Suspicion-based confidence for documents with missing caveats
Optimized for o4-mini visual reasoning model
Achieved 96.3% accuracy (26/27 documents)

See: 04-document-classification.md

Phase 6: Complete ✅

Testing & Validation - Validation completed through 27-document test set achieving 96.3% accuracy. Suspicion-based confidence scoring successfully flags edge cases for human review. Prototype validation complete, ready for component extraction.

See PROJECT.md for complete roadmap and architecture details.

Troubleshooting

ImageMagick Not Found

Error: exec: "magick": executable file not found in $PATH

Solution: Install ImageMagick (see Prerequisites section)

PDF Not Authorized

Error: not authorized by security policy

Solution: Edit /etc/ImageMagick-7/policy.xml and ensure:

<policy domain="coder" rights="read|write" pattern="PDF" />

Permission Denied

Error: failed to create output directory: permission denied

Solution: Ensure write permissions to the output directory

Architecture

The tool is organized into three layers:

Document Processing Primitives - Low-level PDF operations and image conversion (Phase 1 - ✅ Complete)
Processing Patterns - Sequential document workflows with context accumulation and retry logic (Phase 2 - ✅ Complete)
Use-Case Implementations - System prompt generation and document classification using sequential processing (Phases 4 & 5 - ✅ Complete)

Supporting infrastructure includes:

System prompt caching (Phase 3 - ✅ Complete)
Unified configuration management (Phase 2 - ✅ Complete)
Retry infrastructure with exponential backoff (Phase 2 - ✅ Complete)
Base64 data URI encoding for vision API (Phase 4 - ✅ Complete)

Package Structure:

pkg/
├── config/        # Configuration types and loading
├── retry/         # Retry logic with exponential backoff
├── cache/         # System prompt caching
├── document/      # PDF processing and image conversion
├── encoding/      # Base64 data URI encoding for images
├── prompt/        # System prompt generation
├── classify/      # Document classification logic
└── processing/    # Sequential processor with context accumulation

See PROJECT.md for detailed architecture documentation.

License

Part of the go-agents project. See root LICENSE file for details.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
cmd
test-render command
pkg
cache
classify
config
document
encoding
processing
prompt

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL