ocr

package

v1.3.1 Latest Latest Go to latest Published: Jan 16, 2026 License: AGPL-3.0 Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/platinummonkey/legible

Links

Open Source Insights

README ¶

OCR Package

Package ocr provides Optical Character Recognition (OCR) functionality using Ollama's vision models for handwriting recognition.

Overview

The OCR package processes images to extract text with positional information (bounding boxes). It's specifically designed for handwritten notes from reMarkable tablets, providing significantly better accuracy than traditional OCR engines like Tesseract for handwriting recognition.

Features

✅ Ollama Vision Models

Uses Ollama's local API with vision-capable models (llava, mistral-small, etc.)
Superior handwriting recognition compared to Tesseract
Structured JSON output with text and bounding boxes
Confidence scores for each extracted word

✅ Structured Output

Word-level text extraction with bounding boxes
Confidence scores for each word (0-100 scale)
Page-level text and confidence aggregation
Document-level statistics

✅ Flexible Configuration

Configurable Ollama endpoint
Custom model selection
Adjustable retry logic
Custom prompt templates

✅ Error Handling

Graceful handling of Ollama connection failures
Detailed logging of processing steps
Automatic retry with exponential backoff
Model availability checking

System Requirements

Ollama Installation

Ollama must be installed and running on your system or accessible via network.

All Platforms

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service (runs in background)
ollama serve

Visit ollama.ai for platform-specific installation instructions.

Vision Model

Download a vision-capable model for OCR:

# Default model (llava)
ollama pull llava

# Or use mistral-small for better handwriting recognition
ollama pull mistral-small

# Check available models
ollama list

Recommended models for handwriting:

llava - Good general-purpose vision model (default)
mistral-small - Better handwriting recognition
llava:13b - Larger, more accurate (but slower)

Usage

Basic OCR Processing

import "github.com/platinummonkey/legible/internal/ocr"

// Create processor with default configuration
processor := ocr.New(&ocr.Config{})

// Process image
imageData, _ := os.ReadFile("page.png")
pageOCR, err := processor.ProcessImage(imageData, 1)
if err != nil {
    log.Fatal(err)
}

// Access results
fmt.Printf("Page text: %s\n", pageOCR.Text)
fmt.Printf("Confidence: %.2f%%\n", pageOCR.Confidence)
fmt.Printf("Words found: %d\n", len(pageOCR.Words))

// Iterate over words with positions
for _, word := range pageOCR.Words {
    fmt.Printf("Word: %s at (%d, %d) size: %dx%d confidence: %.2f\n",
        word.Text,
        word.BoundingBox.X,
        word.BoundingBox.Y,
        word.BoundingBox.Width,
        word.BoundingBox.Height,
        word.Confidence)
}

Custom Configuration

processor := ocr.New(&ocr.Config{
    Logger:         myLogger,
    OllamaEndpoint: "http://localhost:11434",  // default
    Model:          "mistral-small",            // custom model
    Temperature:    0.0,                        // deterministic output
    MaxRetries:     3,                          // retry on failures
})

pageOCR, err := processor.ProcessImage(imageData, 1)

Remote Ollama Instance

// Use Ollama running on another machine
processor := ocr.New(&ocr.Config{
    OllamaEndpoint: "http://192.168.1.100:11434",
    Model:          "llava",
})

Custom Prompt Template

customPrompt := `Extract all text from this image.
Return JSON array: [{"text": "word", "bbox": [x,y,w,h], "confidence": 0.95}]`

pageOCR, err := processor.ProcessImageWithCustomPrompt(imageData, 1, customPrompt)

Health Check

// Verify Ollama is accessible and model is available
processor := ocr.New(&ocr.Config{
    Model: "llava",
})

if err := processor.HealthCheck(); err != nil {
    log.Fatalf("Ollama health check failed: %v", err)
}

fmt.Println("Ollama is ready for OCR processing")

Document Processing

doc := ocr.NewDocumentOCR("doc-123", "llava")

for pageNum, imageData := range pageImages {
    pageOCR, err := processor.ProcessImage(imageData, pageNum+1)
    if err != nil {
        log.Printf("Failed to process page %d: %v", pageNum+1, err)
        continue
    }

    doc.AddPage(*pageOCR)
}

doc.Finalize()

fmt.Printf("Total pages: %d\n", doc.TotalPages)
fmt.Printf("Total words: %d\n", doc.TotalWords)
fmt.Printf("Average confidence: %.2f%%\n", doc.AverageConfidence)

Implementation Details

OCR Prompt Template

The processor uses a carefully designed prompt to extract text with bounding boxes:

You are analyzing a handwritten note from a reMarkable tablet.

Extract ALL visible handwritten text from this image.
Return ONLY valid JSON with no markdown formatting, no code blocks, no explanation.

Format:
{
  "words": [
    {"text": "word", "bbox": [x, y, width, height], "confidence": 0.95}
  ]
}

Rules:
- Include ALL text, even if partially visible
- bbox coordinates are pixels from top-left (0,0)
- confidence is 0.0-1.0, use 0.8 if uncertain
- Return {"words": []} if no text found

Response Format

Ollama returns JSON with extracted words:

[
  {"text": "Hello", "bbox": [50, 50, 100, 30], "confidence": 0.95},
  {"text": "World", "bbox": [160, 50, 100, 30], "confidence": 0.92}
]

Coordinate System

OCR Coordinate System:

Origin: Top-left corner (0, 0)
X-axis: Increases rightward
Y-axis: Increases downward

Bounding Box Format:

X: Left edge position (pixels from left)
Y: Top edge position (pixels from top)
Width: Box width in pixels
Height: Box height in pixels

Processing Pipeline

Image Input - Accept image data as bytes
Image Decode - Decode to determine dimensions
Base64 Encoding - Encode image for Ollama API
Ollama API Call - Send to vision model with prompt
JSON Parsing - Parse response to extract words
Structure Building - Create PageOCR with words, text, confidence
Result Return - Return structured OCR results

Testing

Running Tests

Note: Tests use mock HTTP servers and don't require Ollama to be running.

# Run all tests
go test ./internal/ocr

# Run with verbose output
go test -v ./internal/ocr

# Run with coverage
go test -cover ./internal/ocr

# Run specific test
go test -v ./internal/ocr -run TestProcessImage_Success

# Run benchmarks
go test -bench=. ./internal/ocr

Test Coverage

The package includes tests for:

✅ Processor initialization with various configs
✅ Successful OCR processing
✅ Empty results handling
✅ Error handling (Ollama down, model errors)
✅ Invalid bounding box handling
✅ Default confidence values
✅ Custom prompt templates
✅ Health check functionality
✅ JSON response parsing

Integration Testing

For integration tests with real Ollama:

# Ensure Ollama is running
ollama serve &

# Pull required model
ollama pull llava

# Run integration tests
go test -v ./internal/ocr -tags=integration

Performance Considerations

Optimization Tips

Image Preprocessing
- Resize large images to reasonable resolution (1404x1872 for reMarkable)
- Convert to appropriate format (PNG or JPEG)
- Maintain aspect ratio
Model Selection
- llava - Fast, good general purpose (~1-3s per page)
- mistral-small - Better accuracy, slower (~3-5s per page)
- llava:13b - Best accuracy, slowest (~5-10s per page)
Parallel Processing
- Process pages in parallel using goroutines
- Ollama can handle multiple concurrent requests
- Be mindful of memory usage
Caching
- Cache processed pages to avoid reprocessing
- Use image dimensions cache for repeated pages

Typical Performance

Simple page (50-100 words): ~1-2s
Complex page (200-300 words): ~2-4s
Dense handwritten notes: ~3-6s

Performance depends on:

Image size and resolution
Text density and handwriting clarity
Model size and type
Hardware (CPU, GPU availability, RAM)
Network latency (for remote Ollama)

Handwriting Recognition Accuracy

Ollama vision models significantly outperform Tesseract for handwriting:

Tesseract: ~40-60% accuracy on handwriting
Ollama (llava): ~85-90% accuracy on handwriting
Ollama (mistral-small): ~90-95% accuracy on handwriting

CI/CD Considerations

Docker Integration

FROM golang:1.21-alpine AS builder

# Build application
WORKDIR /app
COPY . .
RUN go build -o legible ./cmd/legible

FROM alpine:latest

# Install Ollama
RUN apk add --no-cache curl
RUN curl -fsSL https://ollama.ai/install.sh | sh

# Copy application
COPY --from=builder /app/legible /usr/local/bin/

# Pull model during build (optional - large image size)
# RUN ollama serve & sleep 5 && ollama pull llava && pkill ollama

ENTRYPOINT ["/entrypoint.sh"]
CMD ["legible", "daemon"]

GitHub Actions

- name: Install Ollama
  run: |
    curl -fsSL https://ollama.ai/install.sh | sh
    ollama serve &
    sleep 5
    ollama pull llava

- name: Run OCR tests
  run: go test -v ./internal/ocr

Alternative: Use mocks in CI (tests don't require real Ollama)

Troubleshooting

Common Issues

1. "Ollama is not accessible"

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama if not running
ollama serve

2. "Model not found"

# Pull the required model
ollama pull llava

# List available models
ollama list

3. "Connection refused"

# Check Ollama endpoint
curl http://localhost:11434

# Use correct endpoint in config
processor := ocr.New(&ocr.Config{
    OllamaEndpoint: "http://localhost:11434",
})

4. "Poor handwriting recognition"

# Try a better model
ollama pull mistral-small

# Use in config
processor := ocr.New(&ocr.Config{
    Model: "mistral-small",
})

Migration from Tesseract

Key Differences

Aspect	Tesseract	Ollama
Accuracy (handwriting)	40-60%	85-95%
Speed	~0.5-1s/page	~2-5s/page
Setup	System dependencies	Docker/binary install
Languages	100+ language packs	Multilingual models
Output	HOCR XML	JSON
Dependencies	CGO, system libraries	HTTP API

Code Changes

Before (Tesseract):

processor := ocr.New(&ocr.Config{
    Languages: []string{"eng", "fra"},
})

After (Ollama):

processor := ocr.New(&ocr.Config{
    Model: "llava",
    OllamaEndpoint: "http://localhost:11434",
})

The ProcessImage() interface remains the same!

References

Ollama:

Vision Models:

Related Packages:

internal/ollama - Ollama HTTP client

License

Part of legible project. See project LICENSE for details.

Documentation ¶

Overview ¶

Package ocr provides optical character recognition capabilities for document processing.

Index ¶

Constants
func GetDefaultModelForProvider(provider ProviderType) string
func ValidateProviderConfig(cfg *VisionClientConfig) error
type AnthropicVisionClient
- func NewAnthropicVisionClient(apiKey string, temperature float64, maxRetries int, log *logger.Logger) *AnthropicVisionClient
- func (a *AnthropicVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollamaTypes.OCRWord, error)
- func (a *AnthropicVisionClient) HealthCheck(ctx context.Context, model string) error
- func (a *AnthropicVisionClient) Name() string
- func (a *AnthropicVisionClient) SupportedModels() []string
type Config
type DocumentOCR
- func NewDocumentOCR(documentID string, language string) *DocumentOCR
- func (d *DocumentOCR) AddPage(page PageOCR)
- func (d *DocumentOCR) Finalize()
type GoogleVisionClient
- func NewGoogleVisionClient(ctx context.Context, apiKey string, temperature float64, maxRetries int, ...) (*GoogleVisionClient, error)
- func (g *GoogleVisionClient) Close() error
- func (g *GoogleVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollamaTypes.OCRWord, error)
- func (g *GoogleVisionClient) HealthCheck(ctx context.Context, model string) error
- func (g *GoogleVisionClient) Name() string
- func (g *GoogleVisionClient) SupportedModels() []string
type Line
type OllamaVisionClient
- func NewOllamaVisionClient(endpoint string, maxRetries int, log *logger.Logger) *OllamaVisionClient
- func (o *OllamaVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollama.OCRWord, error)
- func (o *OllamaVisionClient) HealthCheck(ctx context.Context, model string) error
- func (o *OllamaVisionClient) Name() string
- func (o *OllamaVisionClient) SupportedModels() []string
type OpenAIVisionClient
- func NewOpenAIVisionClient(apiKey string, temperature float64, maxRetries int, log *logger.Logger) *OpenAIVisionClient
- func (o *OpenAIVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollamaTypes.OCRWord, error)
- func (o *OpenAIVisionClient) HealthCheck(ctx context.Context, model string) error
- func (o *OpenAIVisionClient) Name() string
- func (o *OpenAIVisionClient) SupportedModels() []string
type PageOCR
- func NewPageOCR(pageNumber, width, height int, language string) *PageOCR
- func (p *PageOCR) AddWord(word Word)
- func (p *PageOCR) BuildText()
- func (p *PageOCR) CalculateConfidence()
type Paragraph
type Processor
- func New(cfg *Config) (*Processor, error)
- func (p *Processor) HealthCheck() error
- func (p *Processor) Model() string
- func (p *Processor) ProcessImage(imageData []byte, pageNumber int) (*PageOCR, error)
- func (p *Processor) ProcessImageWithCustomPrompt(imageData []byte, pageNumber int, customPrompt string) (*PageOCR, error)
type ProviderType
type Rectangle
- func NewRectangle(x, y, width, height int) Rectangle
- func (r Rectangle) Area() int
- func (r Rectangle) Bottom() int
- func (r Rectangle) Contains(x, y int) bool
- func (r Rectangle) Intersects(other Rectangle) bool
- func (r Rectangle) Right() int
type Result
type VisionClient
- func NewVisionClient(ctx context.Context, cfg *VisionClientConfig, log *logger.Logger) (VisionClient, error)
type VisionClientConfig
type Word
- func NewWord(text string, bbox Rectangle, confidence float64) Word

Constants ¶

View Source

const (
	// DefaultModel is the default Ollama model for OCR
	DefaultModel = "llava"

	// DefaultTemperature for OCR (0.0 for deterministic output)
	DefaultTemperature = 0.0
)

Variables ¶

This section is empty.

Functions ¶

func GetDefaultModelForProvider ¶ added in v1.1.0

func GetDefaultModelForProvider(provider ProviderType) string

GetDefaultModelForProvider returns a recommended default model for the given provider

func ValidateProviderConfig ¶ added in v1.1.0

func ValidateProviderConfig(cfg *VisionClientConfig) error

ValidateProviderConfig validates that the provider configuration is complete and correct

Types ¶

type AnthropicVisionClient ¶ added in v1.1.0

type AnthropicVisionClient struct {
	// contains filtered or unexported fields
}

AnthropicVisionClient implements VisionClient for Anthropic's Claude API

func NewAnthropicVisionClient ¶ added in v1.1.0

func NewAnthropicVisionClient(apiKey string, temperature float64, maxRetries int, log *logger.Logger) *AnthropicVisionClient

NewAnthropicVisionClient creates a new Anthropic Claude vision client

func (*AnthropicVisionClient) GenerateOCR ¶ added in v1.1.0

func (a *AnthropicVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollamaTypes.OCRWord, error)

GenerateOCR performs OCR using Anthropic's Claude vision API

func (*AnthropicVisionClient) HealthCheck ¶ added in v1.1.0

func (a *AnthropicVisionClient) HealthCheck(ctx context.Context, model string) error

HealthCheck verifies that the Anthropic API is accessible

func (*AnthropicVisionClient) Name ¶ added in v1.1.0

func (a *AnthropicVisionClient) Name() string

Name returns the provider name

func (*AnthropicVisionClient) SupportedModels ¶ added in v1.1.0

func (a *AnthropicVisionClient) SupportedModels() []string

SupportedModels returns a list of Anthropic Claude models with vision capabilities

type Config ¶

type Config struct {
	Logger       *logger.Logger
	VisionClient VisionClient // Pre-configured vision client (optional, for advanced usage)
	// Legacy Ollama-specific fields (deprecated, use VisionConfig instead)
	OllamaEndpoint string  // default: "http://localhost:11434"
	Model          string  // default: "llava"
	Temperature    float64 // default: 0.0 for deterministic output
	MaxRetries     int     // default: 3
	// New unified configuration
	VisionConfig *VisionClientConfig // Vision client configuration (preferred)
}

Config holds configuration for the OCR processor

type DocumentOCR ¶

type DocumentOCR struct {
	// DocumentID is the unique identifier for the document
	DocumentID string

	// Pages contains OCR results for each page
	Pages []PageOCR

	// TotalPages is the total number of pages processed
	TotalPages int

	// TotalWords is the total number of words recognized
	TotalWords int

	// AverageConfidence is the average confidence across all pages
	AverageConfidence float64

	// ProcessingTime is the time taken to process the document (in seconds)
	ProcessingTime float64

	// Language is the OCR language(s) used
	Language string
}

DocumentOCR represents OCR results for an entire document

func NewDocumentOCR ¶

func NewDocumentOCR(documentID string, language string) *DocumentOCR

NewDocumentOCR creates a new DocumentOCR

func (*DocumentOCR) AddPage ¶

func (d *DocumentOCR) AddPage(page PageOCR)

AddPage adds a page to the document OCR results

func (*DocumentOCR) Finalize ¶

func (d *DocumentOCR) Finalize()

Finalize calculates summary statistics after all pages are processed

type GoogleVisionClient ¶ added in v1.1.0

type GoogleVisionClient struct {
	// contains filtered or unexported fields
}

GoogleVisionClient implements VisionClient for Google's Gemini API

func NewGoogleVisionClient ¶ added in v1.1.0

func NewGoogleVisionClient(ctx context.Context, apiKey string, temperature float64, maxRetries int, log *logger.Logger) (*GoogleVisionClient, error)

NewGoogleVisionClient creates a new Google Gemini vision client

func (*GoogleVisionClient) Close ¶ added in v1.1.0

func (g *GoogleVisionClient) Close() error

Close closes the Google client

func (*GoogleVisionClient) GenerateOCR ¶ added in v1.1.0

func (g *GoogleVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollamaTypes.OCRWord, error)

GenerateOCR performs OCR using Google's Gemini vision API

func (*GoogleVisionClient) HealthCheck ¶ added in v1.1.0

func (g *GoogleVisionClient) HealthCheck(ctx context.Context, model string) error

HealthCheck verifies that the Gemini API is accessible

func (*GoogleVisionClient) Name ¶ added in v1.1.0

func (g *GoogleVisionClient) Name() string

Name returns the provider name

func (*GoogleVisionClient) SupportedModels ¶ added in v1.1.0

func (g *GoogleVisionClient) SupportedModels() []string

SupportedModels returns a list of Google Gemini vision models

type Line ¶

type Line struct {
	// Words contains the words in this line
	Words []Word

	// BoundingBox is the bounding box for the entire line
	BoundingBox Rectangle

	// Text is the concatenated text of all words in the line
	Text string

	// Confidence is the average confidence of all words in the line
	Confidence float64
}

Line represents a line of text (multiple words)

type OllamaVisionClient ¶ added in v1.1.0

type OllamaVisionClient struct {
	// contains filtered or unexported fields
}

OllamaVisionClient is an adapter that implements VisionClient for Ollama

func NewOllamaVisionClient ¶ added in v1.1.0

func NewOllamaVisionClient(endpoint string, maxRetries int, log *logger.Logger) *OllamaVisionClient

NewOllamaVisionClient creates a new Ollama vision client

func (*OllamaVisionClient) GenerateOCR ¶ added in v1.1.0

func (o *OllamaVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollama.OCRWord, error)

GenerateOCR performs OCR on a base64-encoded image and returns structured word data

func (*OllamaVisionClient) HealthCheck ¶ added in v1.1.0

func (o *OllamaVisionClient) HealthCheck(ctx context.Context, model string) error

HealthCheck verifies that Ollama is accessible and the model is available

func (*OllamaVisionClient) Name ¶ added in v1.1.0

func (o *OllamaVisionClient) Name() string

Name returns the provider name

func (*OllamaVisionClient) SupportedModels ¶ added in v1.1.0

func (o *OllamaVisionClient) SupportedModels() []string

SupportedModels returns a list of commonly used Ollama vision models

type OpenAIVisionClient ¶ added in v1.1.0

type OpenAIVisionClient struct {
	// contains filtered or unexported fields
}

OpenAIVisionClient implements VisionClient for OpenAI's GPT-4 Vision API

func NewOpenAIVisionClient ¶ added in v1.1.0

func NewOpenAIVisionClient(apiKey string, temperature float64, maxRetries int, log *logger.Logger) *OpenAIVisionClient

NewOpenAIVisionClient creates a new OpenAI vision client

func (*OpenAIVisionClient) GenerateOCR ¶ added in v1.1.0

func (o *OpenAIVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollamaTypes.OCRWord, error)

GenerateOCR performs OCR using OpenAI's vision API

func (*OpenAIVisionClient) HealthCheck ¶ added in v1.1.0

func (o *OpenAIVisionClient) HealthCheck(ctx context.Context, model string) error

HealthCheck verifies that the OpenAI API is accessible

func (*OpenAIVisionClient) Name ¶ added in v1.1.0

func (o *OpenAIVisionClient) Name() string

Name returns the provider name

func (*OpenAIVisionClient) SupportedModels ¶ added in v1.1.0

func (o *OpenAIVisionClient) SupportedModels() []string

SupportedModels returns a list of OpenAI vision models

type PageOCR ¶

type PageOCR struct {
	// PageNumber is the page number (1-indexed)
	PageNumber int

	// Words contains all recognized words on the page with their positions
	Words []Word

	// Text is the full text content of the page (for convenience)
	Text string

	// Confidence is the overall confidence score for the page (0-100)
	Confidence float64

	// Width is the page width in pixels
	Width int

	// Height is the page height in pixels
	Height int

	// Language is the detected or configured language
	Language string
}

PageOCR represents OCR results for a single page

func NewPageOCR ¶

func NewPageOCR(pageNumber, width, height int, language string) *PageOCR

NewPageOCR creates a new PageOCR result

func (*PageOCR) AddWord ¶

func (p *PageOCR) AddWord(word Word)

AddWord adds a word to the page OCR result

func (*PageOCR) BuildText ¶

func (p *PageOCR) BuildText()

BuildText concatenates all word text to build the full page text

func (*PageOCR) CalculateConfidence ¶

func (p *PageOCR) CalculateConfidence()

CalculateConfidence calculates the average confidence for the page

type Paragraph ¶

type Paragraph struct {
	// Lines contains the lines in this paragraph
	Lines []Line

	// BoundingBox is the bounding box for the entire paragraph
	BoundingBox Rectangle

	// Text is the concatenated text of all lines in the paragraph
	Text string

	// Confidence is the average confidence of all lines in the paragraph
	Confidence float64
}

Paragraph represents a paragraph (multiple lines)

type Processor ¶

type Processor struct {
	// contains filtered or unexported fields
}

Processor handles OCR processing using a vision client

func New ¶

func New(cfg *Config) (*Processor, error)

New creates a new OCR processor with a vision client

func (*Processor) HealthCheck ¶

func (p *Processor) HealthCheck() error

HealthCheck verifies that the vision client is accessible and the model is available

func (*Processor) Model ¶

func (p *Processor) Model() string

Model returns the configured model name

func (*Processor) ProcessImage ¶

func (p *Processor) ProcessImage(imageData []byte, pageNumber int) (*PageOCR, error)

ProcessImage performs OCR on an image and returns structured results

func (*Processor) ProcessImageWithCustomPrompt ¶

func (p *Processor) ProcessImageWithCustomPrompt(imageData []byte, pageNumber int, customPrompt string) (*PageOCR, error)

ProcessImageWithCustomPrompt allows using a custom prompt template

type ProviderType ¶ added in v1.1.0

type ProviderType string

ProviderType represents the type of LLM provider

const (
	// ProviderOllama represents a local Ollama instance
	ProviderOllama ProviderType = "ollama"

	// ProviderOpenAI represents OpenAI's GPT-4 Vision API
	ProviderOpenAI ProviderType = "openai"

	// ProviderAnthropic represents Anthropic's Claude API with vision
	ProviderAnthropic ProviderType = "anthropic"

	// ProviderGoogle represents Google's Gemini API
	ProviderGoogle ProviderType = "google"
)

type Rectangle ¶

type Rectangle struct {
	// X is the left coordinate (pixels from left edge)
	X int

	// Y is the top coordinate (pixels from top edge)
	Y int

	// Width is the width of the rectangle in pixels
	Width int

	// Height is the height of the rectangle in pixels
	Height int
}

Rectangle represents a rectangular bounding box

func NewRectangle ¶

func NewRectangle(x, y, width, height int) Rectangle

NewRectangle creates a new Rectangle

func (Rectangle) Area ¶

func (r Rectangle) Area() int

Area returns the area of the rectangle

func (Rectangle) Bottom ¶

func (r Rectangle) Bottom() int

Bottom returns the bottom edge coordinate

func (Rectangle) Contains ¶

func (r Rectangle) Contains(x, y int) bool

Contains returns true if the rectangle contains the point (x, y)

func (Rectangle) Intersects ¶

func (r Rectangle) Intersects(other Rectangle) bool

Intersects returns true if this rectangle intersects with another

func (Rectangle) Right ¶

func (r Rectangle) Right() int

Right returns the right edge coordinate

type Result ¶

type Result struct {
	// DocumentOCR contains the OCR results
	DocumentOCR *DocumentOCR

	// Success indicates if OCR completed successfully
	Success bool

	// Error contains any error message if Success is false
	Error string
}

Result represents the result of an OCR operation

type VisionClient ¶ added in v1.1.0

type VisionClient interface {
	// GenerateOCR performs OCR on a base64-encoded image and returns structured word data
	GenerateOCR(ctx context.Context, model string, imageData string) ([]ollama.OCRWord, error)

	// HealthCheck verifies that the provider is accessible and the model is available
	HealthCheck(ctx context.Context, model string) error

	// Name returns the name of the provider (e.g., "ollama", "openai", "anthropic", "google")
	Name() string

	// SupportedModels returns a list of supported model names for this provider
	SupportedModels() []string
}

VisionClient is an interface for vision-capable LLM providers that can perform OCR

func NewVisionClient ¶ added in v1.1.0

func NewVisionClient(ctx context.Context, cfg *VisionClientConfig, log *logger.Logger) (VisionClient, error)

NewVisionClient creates a vision client based on the provider configuration

type VisionClientConfig ¶ added in v1.1.0

type VisionClientConfig struct {
	// Provider is the LLM provider type (ollama, openai, anthropic, google)
	Provider ProviderType

	// Model is the specific model to use (e.g., "llava", "gpt-4-vision-preview", "claude-3-5-sonnet-20241022", "gemini-1.5-pro")
	Model string

	// Endpoint is the API endpoint (required for Ollama, optional for cloud providers)
	Endpoint string

	// APIKey is the API key for cloud providers (read from env vars)
	APIKey string

	// MaxRetries is the maximum number of retry attempts
	MaxRetries int

	// Temperature controls randomness (0.0 = deterministic, recommended for OCR)
	Temperature float64
}

VisionClientConfig holds common configuration for all vision clients

type Word ¶

type Word struct {
	// Text is the recognized text content
	Text string

	// BoundingBox is the position and size of the word on the page
	BoundingBox Rectangle

	// Confidence is the recognition confidence score (0-100)
	Confidence float64

	// FontSize is the estimated font size in points
	FontSize float64

	// Bold indicates if the word appears to be bold
	Bold bool

	// Italic indicates if the word appears to be italic
	Italic bool
}

Word represents a single recognized word with its bounding box

func NewWord ¶

func NewWord(text string, bbox Rectangle, confidence float64) Word

NewWord creates a new Word

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL