ocr

package
v1.3.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 16, 2026 License: AGPL-3.0 Imports: 15 Imported by: 0

README

OCR Package

Package ocr provides Optical Character Recognition (OCR) functionality using Ollama's vision models for handwriting recognition.

Overview

The OCR package processes images to extract text with positional information (bounding boxes). It's specifically designed for handwritten notes from reMarkable tablets, providing significantly better accuracy than traditional OCR engines like Tesseract for handwriting recognition.

Features

Ollama Vision Models

  • Uses Ollama's local API with vision-capable models (llava, mistral-small, etc.)
  • Superior handwriting recognition compared to Tesseract
  • Structured JSON output with text and bounding boxes
  • Confidence scores for each extracted word

Structured Output

  • Word-level text extraction with bounding boxes
  • Confidence scores for each word (0-100 scale)
  • Page-level text and confidence aggregation
  • Document-level statistics

Flexible Configuration

  • Configurable Ollama endpoint
  • Custom model selection
  • Adjustable retry logic
  • Custom prompt templates

Error Handling

  • Graceful handling of Ollama connection failures
  • Detailed logging of processing steps
  • Automatic retry with exponential backoff
  • Model availability checking

System Requirements

Ollama Installation

Ollama must be installed and running on your system or accessible via network.

All Platforms
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service (runs in background)
ollama serve

Visit ollama.ai for platform-specific installation instructions.

Vision Model

Download a vision-capable model for OCR:

# Default model (llava)
ollama pull llava

# Or use mistral-small for better handwriting recognition
ollama pull mistral-small

# Check available models
ollama list

Recommended models for handwriting:

  • llava - Good general-purpose vision model (default)
  • mistral-small - Better handwriting recognition
  • llava:13b - Larger, more accurate (but slower)

Usage

Basic OCR Processing
import "github.com/platinummonkey/legible/internal/ocr"

// Create processor with default configuration
processor := ocr.New(&ocr.Config{})

// Process image
imageData, _ := os.ReadFile("page.png")
pageOCR, err := processor.ProcessImage(imageData, 1)
if err != nil {
    log.Fatal(err)
}

// Access results
fmt.Printf("Page text: %s\n", pageOCR.Text)
fmt.Printf("Confidence: %.2f%%\n", pageOCR.Confidence)
fmt.Printf("Words found: %d\n", len(pageOCR.Words))

// Iterate over words with positions
for _, word := range pageOCR.Words {
    fmt.Printf("Word: %s at (%d, %d) size: %dx%d confidence: %.2f\n",
        word.Text,
        word.BoundingBox.X,
        word.BoundingBox.Y,
        word.BoundingBox.Width,
        word.BoundingBox.Height,
        word.Confidence)
}
Custom Configuration
processor := ocr.New(&ocr.Config{
    Logger:         myLogger,
    OllamaEndpoint: "http://localhost:11434",  // default
    Model:          "mistral-small",            // custom model
    Temperature:    0.0,                        // deterministic output
    MaxRetries:     3,                          // retry on failures
})

pageOCR, err := processor.ProcessImage(imageData, 1)
Remote Ollama Instance
// Use Ollama running on another machine
processor := ocr.New(&ocr.Config{
    OllamaEndpoint: "http://192.168.1.100:11434",
    Model:          "llava",
})
Custom Prompt Template
customPrompt := `Extract all text from this image.
Return JSON array: [{"text": "word", "bbox": [x,y,w,h], "confidence": 0.95}]`

pageOCR, err := processor.ProcessImageWithCustomPrompt(imageData, 1, customPrompt)
Health Check
// Verify Ollama is accessible and model is available
processor := ocr.New(&ocr.Config{
    Model: "llava",
})

if err := processor.HealthCheck(); err != nil {
    log.Fatalf("Ollama health check failed: %v", err)
}

fmt.Println("Ollama is ready for OCR processing")
Document Processing
doc := ocr.NewDocumentOCR("doc-123", "llava")

for pageNum, imageData := range pageImages {
    pageOCR, err := processor.ProcessImage(imageData, pageNum+1)
    if err != nil {
        log.Printf("Failed to process page %d: %v", pageNum+1, err)
        continue
    }

    doc.AddPage(*pageOCR)
}

doc.Finalize()

fmt.Printf("Total pages: %d\n", doc.TotalPages)
fmt.Printf("Total words: %d\n", doc.TotalWords)
fmt.Printf("Average confidence: %.2f%%\n", doc.AverageConfidence)

Implementation Details

OCR Prompt Template

The processor uses a carefully designed prompt to extract text with bounding boxes:

You are analyzing a handwritten note from a reMarkable tablet.

Extract ALL visible handwritten text from this image.
Return ONLY valid JSON with no markdown formatting, no code blocks, no explanation.

Format:
{
  "words": [
    {"text": "word", "bbox": [x, y, width, height], "confidence": 0.95}
  ]
}

Rules:
- Include ALL text, even if partially visible
- bbox coordinates are pixels from top-left (0,0)
- confidence is 0.0-1.0, use 0.8 if uncertain
- Return {"words": []} if no text found
Response Format

Ollama returns JSON with extracted words:

[
  {"text": "Hello", "bbox": [50, 50, 100, 30], "confidence": 0.95},
  {"text": "World", "bbox": [160, 50, 100, 30], "confidence": 0.92}
]
Coordinate System

OCR Coordinate System:

  • Origin: Top-left corner (0, 0)
  • X-axis: Increases rightward
  • Y-axis: Increases downward

Bounding Box Format:

  • X: Left edge position (pixels from left)
  • Y: Top edge position (pixels from top)
  • Width: Box width in pixels
  • Height: Box height in pixels
Processing Pipeline
  1. Image Input - Accept image data as bytes
  2. Image Decode - Decode to determine dimensions
  3. Base64 Encoding - Encode image for Ollama API
  4. Ollama API Call - Send to vision model with prompt
  5. JSON Parsing - Parse response to extract words
  6. Structure Building - Create PageOCR with words, text, confidence
  7. Result Return - Return structured OCR results

Testing

Running Tests

Note: Tests use mock HTTP servers and don't require Ollama to be running.

# Run all tests
go test ./internal/ocr

# Run with verbose output
go test -v ./internal/ocr

# Run with coverage
go test -cover ./internal/ocr

# Run specific test
go test -v ./internal/ocr -run TestProcessImage_Success

# Run benchmarks
go test -bench=. ./internal/ocr
Test Coverage

The package includes tests for:

  • ✅ Processor initialization with various configs
  • ✅ Successful OCR processing
  • ✅ Empty results handling
  • ✅ Error handling (Ollama down, model errors)
  • ✅ Invalid bounding box handling
  • ✅ Default confidence values
  • ✅ Custom prompt templates
  • ✅ Health check functionality
  • ✅ JSON response parsing
Integration Testing

For integration tests with real Ollama:

# Ensure Ollama is running
ollama serve &

# Pull required model
ollama pull llava

# Run integration tests
go test -v ./internal/ocr -tags=integration

Performance Considerations

Optimization Tips
  1. Image Preprocessing

    • Resize large images to reasonable resolution (1404x1872 for reMarkable)
    • Convert to appropriate format (PNG or JPEG)
    • Maintain aspect ratio
  2. Model Selection

    • llava - Fast, good general purpose (~1-3s per page)
    • mistral-small - Better accuracy, slower (~3-5s per page)
    • llava:13b - Best accuracy, slowest (~5-10s per page)
  3. Parallel Processing

    • Process pages in parallel using goroutines
    • Ollama can handle multiple concurrent requests
    • Be mindful of memory usage
  4. Caching

    • Cache processed pages to avoid reprocessing
    • Use image dimensions cache for repeated pages
Typical Performance
  • Simple page (50-100 words): ~1-2s
  • Complex page (200-300 words): ~2-4s
  • Dense handwritten notes: ~3-6s

Performance depends on:

  • Image size and resolution
  • Text density and handwriting clarity
  • Model size and type
  • Hardware (CPU, GPU availability, RAM)
  • Network latency (for remote Ollama)
Handwriting Recognition Accuracy

Ollama vision models significantly outperform Tesseract for handwriting:

  • Tesseract: ~40-60% accuracy on handwriting
  • Ollama (llava): ~85-90% accuracy on handwriting
  • Ollama (mistral-small): ~90-95% accuracy on handwriting

CI/CD Considerations

Docker Integration
FROM golang:1.21-alpine AS builder

# Build application
WORKDIR /app
COPY . .
RUN go build -o legible ./cmd/legible

FROM alpine:latest

# Install Ollama
RUN apk add --no-cache curl
RUN curl -fsSL https://ollama.ai/install.sh | sh

# Copy application
COPY --from=builder /app/legible /usr/local/bin/

# Pull model during build (optional - large image size)
# RUN ollama serve & sleep 5 && ollama pull llava && pkill ollama

ENTRYPOINT ["/entrypoint.sh"]
CMD ["legible", "daemon"]
GitHub Actions
- name: Install Ollama
  run: |
    curl -fsSL https://ollama.ai/install.sh | sh
    ollama serve &
    sleep 5
    ollama pull llava

- name: Run OCR tests
  run: go test -v ./internal/ocr

Alternative: Use mocks in CI (tests don't require real Ollama)

Troubleshooting

Common Issues

1. "Ollama is not accessible"

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama if not running
ollama serve

2. "Model not found"

# Pull the required model
ollama pull llava

# List available models
ollama list

3. "Connection refused"

# Check Ollama endpoint
curl http://localhost:11434

# Use correct endpoint in config
processor := ocr.New(&ocr.Config{
    OllamaEndpoint: "http://localhost:11434",
})

4. "Poor handwriting recognition"

# Try a better model
ollama pull mistral-small

# Use in config
processor := ocr.New(&ocr.Config{
    Model: "mistral-small",
})

Migration from Tesseract

Key Differences
Aspect Tesseract Ollama
Accuracy (handwriting) 40-60% 85-95%
Speed ~0.5-1s/page ~2-5s/page
Setup System dependencies Docker/binary install
Languages 100+ language packs Multilingual models
Output HOCR XML JSON
Dependencies CGO, system libraries HTTP API
Code Changes

Before (Tesseract):

processor := ocr.New(&ocr.Config{
    Languages: []string{"eng", "fra"},
})

After (Ollama):

processor := ocr.New(&ocr.Config{
    Model: "llava",
    OllamaEndpoint: "http://localhost:11434",
})

The ProcessImage() interface remains the same!

References

Ollama:

Vision Models:

Related Packages:

License

Part of legible project. See project LICENSE for details.

Documentation

Overview

Package ocr provides optical character recognition capabilities for document processing.

Package ocr provides optical character recognition capabilities for document processing.

Index

Constants

View Source
const (
	// DefaultModel is the default Ollama model for OCR
	DefaultModel = "llava"

	// DefaultTemperature for OCR (0.0 for deterministic output)
	DefaultTemperature = 0.0
)

Variables

This section is empty.

Functions

func GetDefaultModelForProvider added in v1.1.0

func GetDefaultModelForProvider(provider ProviderType) string

GetDefaultModelForProvider returns a recommended default model for the given provider

func ValidateProviderConfig added in v1.1.0

func ValidateProviderConfig(cfg *VisionClientConfig) error

ValidateProviderConfig validates that the provider configuration is complete and correct

Types

type AnthropicVisionClient added in v1.1.0

type AnthropicVisionClient struct {
	// contains filtered or unexported fields
}

AnthropicVisionClient implements VisionClient for Anthropic's Claude API

func NewAnthropicVisionClient added in v1.1.0

func NewAnthropicVisionClient(apiKey string, temperature float64, maxRetries int, log *logger.Logger) *AnthropicVisionClient

NewAnthropicVisionClient creates a new Anthropic Claude vision client

func (*AnthropicVisionClient) GenerateOCR added in v1.1.0

func (a *AnthropicVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollamaTypes.OCRWord, error)

GenerateOCR performs OCR using Anthropic's Claude vision API

func (*AnthropicVisionClient) HealthCheck added in v1.1.0

func (a *AnthropicVisionClient) HealthCheck(ctx context.Context, model string) error

HealthCheck verifies that the Anthropic API is accessible

func (*AnthropicVisionClient) Name added in v1.1.0

func (a *AnthropicVisionClient) Name() string

Name returns the provider name

func (*AnthropicVisionClient) SupportedModels added in v1.1.0

func (a *AnthropicVisionClient) SupportedModels() []string

SupportedModels returns a list of Anthropic Claude models with vision capabilities

type Config

type Config struct {
	Logger       *logger.Logger
	VisionClient VisionClient // Pre-configured vision client (optional, for advanced usage)
	// Legacy Ollama-specific fields (deprecated, use VisionConfig instead)
	OllamaEndpoint string  // default: "http://localhost:11434"
	Model          string  // default: "llava"
	Temperature    float64 // default: 0.0 for deterministic output
	MaxRetries     int     // default: 3
	// New unified configuration
	VisionConfig *VisionClientConfig // Vision client configuration (preferred)
}

Config holds configuration for the OCR processor

type DocumentOCR

type DocumentOCR struct {
	// DocumentID is the unique identifier for the document
	DocumentID string

	// Pages contains OCR results for each page
	Pages []PageOCR

	// TotalPages is the total number of pages processed
	TotalPages int

	// TotalWords is the total number of words recognized
	TotalWords int

	// AverageConfidence is the average confidence across all pages
	AverageConfidence float64

	// ProcessingTime is the time taken to process the document (in seconds)
	ProcessingTime float64

	// Language is the OCR language(s) used
	Language string
}

DocumentOCR represents OCR results for an entire document

func NewDocumentOCR

func NewDocumentOCR(documentID string, language string) *DocumentOCR

NewDocumentOCR creates a new DocumentOCR

func (*DocumentOCR) AddPage

func (d *DocumentOCR) AddPage(page PageOCR)

AddPage adds a page to the document OCR results

func (*DocumentOCR) Finalize

func (d *DocumentOCR) Finalize()

Finalize calculates summary statistics after all pages are processed

type GoogleVisionClient added in v1.1.0

type GoogleVisionClient struct {
	// contains filtered or unexported fields
}

GoogleVisionClient implements VisionClient for Google's Gemini API

func NewGoogleVisionClient added in v1.1.0

func NewGoogleVisionClient(ctx context.Context, apiKey string, temperature float64, maxRetries int, log *logger.Logger) (*GoogleVisionClient, error)

NewGoogleVisionClient creates a new Google Gemini vision client

func (*GoogleVisionClient) Close added in v1.1.0

func (g *GoogleVisionClient) Close() error

Close closes the Google client

func (*GoogleVisionClient) GenerateOCR added in v1.1.0

func (g *GoogleVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollamaTypes.OCRWord, error)

GenerateOCR performs OCR using Google's Gemini vision API

func (*GoogleVisionClient) HealthCheck added in v1.1.0

func (g *GoogleVisionClient) HealthCheck(ctx context.Context, model string) error

HealthCheck verifies that the Gemini API is accessible

func (*GoogleVisionClient) Name added in v1.1.0

func (g *GoogleVisionClient) Name() string

Name returns the provider name

func (*GoogleVisionClient) SupportedModels added in v1.1.0

func (g *GoogleVisionClient) SupportedModels() []string

SupportedModels returns a list of Google Gemini vision models

type Line

type Line struct {
	// Words contains the words in this line
	Words []Word

	// BoundingBox is the bounding box for the entire line
	BoundingBox Rectangle

	// Text is the concatenated text of all words in the line
	Text string

	// Confidence is the average confidence of all words in the line
	Confidence float64
}

Line represents a line of text (multiple words)

type OllamaVisionClient added in v1.1.0

type OllamaVisionClient struct {
	// contains filtered or unexported fields
}

OllamaVisionClient is an adapter that implements VisionClient for Ollama

func NewOllamaVisionClient added in v1.1.0

func NewOllamaVisionClient(endpoint string, maxRetries int, log *logger.Logger) *OllamaVisionClient

NewOllamaVisionClient creates a new Ollama vision client

func (*OllamaVisionClient) GenerateOCR added in v1.1.0

func (o *OllamaVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollama.OCRWord, error)

GenerateOCR performs OCR on a base64-encoded image and returns structured word data

func (*OllamaVisionClient) HealthCheck added in v1.1.0

func (o *OllamaVisionClient) HealthCheck(ctx context.Context, model string) error

HealthCheck verifies that Ollama is accessible and the model is available

func (*OllamaVisionClient) Name added in v1.1.0

func (o *OllamaVisionClient) Name() string

Name returns the provider name

func (*OllamaVisionClient) SupportedModels added in v1.1.0

func (o *OllamaVisionClient) SupportedModels() []string

SupportedModels returns a list of commonly used Ollama vision models

type OpenAIVisionClient added in v1.1.0

type OpenAIVisionClient struct {
	// contains filtered or unexported fields
}

OpenAIVisionClient implements VisionClient for OpenAI's GPT-4 Vision API

func NewOpenAIVisionClient added in v1.1.0

func NewOpenAIVisionClient(apiKey string, temperature float64, maxRetries int, log *logger.Logger) *OpenAIVisionClient

NewOpenAIVisionClient creates a new OpenAI vision client

func (*OpenAIVisionClient) GenerateOCR added in v1.1.0

func (o *OpenAIVisionClient) GenerateOCR(ctx context.Context, model string, imageData string) ([]ollamaTypes.OCRWord, error)

GenerateOCR performs OCR using OpenAI's vision API

func (*OpenAIVisionClient) HealthCheck added in v1.1.0

func (o *OpenAIVisionClient) HealthCheck(ctx context.Context, model string) error

HealthCheck verifies that the OpenAI API is accessible

func (*OpenAIVisionClient) Name added in v1.1.0

func (o *OpenAIVisionClient) Name() string

Name returns the provider name

func (*OpenAIVisionClient) SupportedModels added in v1.1.0

func (o *OpenAIVisionClient) SupportedModels() []string

SupportedModels returns a list of OpenAI vision models

type PageOCR

type PageOCR struct {
	// PageNumber is the page number (1-indexed)
	PageNumber int

	// Words contains all recognized words on the page with their positions
	Words []Word

	// Text is the full text content of the page (for convenience)
	Text string

	// Confidence is the overall confidence score for the page (0-100)
	Confidence float64

	// Width is the page width in pixels
	Width int

	// Height is the page height in pixels
	Height int

	// Language is the detected or configured language
	Language string
}

PageOCR represents OCR results for a single page

func NewPageOCR

func NewPageOCR(pageNumber, width, height int, language string) *PageOCR

NewPageOCR creates a new PageOCR result

func (*PageOCR) AddWord

func (p *PageOCR) AddWord(word Word)

AddWord adds a word to the page OCR result

func (*PageOCR) BuildText

func (p *PageOCR) BuildText()

BuildText concatenates all word text to build the full page text

func (*PageOCR) CalculateConfidence

func (p *PageOCR) CalculateConfidence()

CalculateConfidence calculates the average confidence for the page

type Paragraph

type Paragraph struct {
	// Lines contains the lines in this paragraph
	Lines []Line

	// BoundingBox is the bounding box for the entire paragraph
	BoundingBox Rectangle

	// Text is the concatenated text of all lines in the paragraph
	Text string

	// Confidence is the average confidence of all lines in the paragraph
	Confidence float64
}

Paragraph represents a paragraph (multiple lines)

type Processor

type Processor struct {
	// contains filtered or unexported fields
}

Processor handles OCR processing using a vision client

func New

func New(cfg *Config) (*Processor, error)

New creates a new OCR processor with a vision client

func (*Processor) HealthCheck

func (p *Processor) HealthCheck() error

HealthCheck verifies that the vision client is accessible and the model is available

func (*Processor) Model

func (p *Processor) Model() string

Model returns the configured model name

func (*Processor) ProcessImage

func (p *Processor) ProcessImage(imageData []byte, pageNumber int) (*PageOCR, error)

ProcessImage performs OCR on an image and returns structured results

func (*Processor) ProcessImageWithCustomPrompt

func (p *Processor) ProcessImageWithCustomPrompt(imageData []byte, pageNumber int, customPrompt string) (*PageOCR, error)

ProcessImageWithCustomPrompt allows using a custom prompt template

type ProviderType added in v1.1.0

type ProviderType string

ProviderType represents the type of LLM provider

const (
	// ProviderOllama represents a local Ollama instance
	ProviderOllama ProviderType = "ollama"

	// ProviderOpenAI represents OpenAI's GPT-4 Vision API
	ProviderOpenAI ProviderType = "openai"

	// ProviderAnthropic represents Anthropic's Claude API with vision
	ProviderAnthropic ProviderType = "anthropic"

	// ProviderGoogle represents Google's Gemini API
	ProviderGoogle ProviderType = "google"
)

type Rectangle

type Rectangle struct {
	// X is the left coordinate (pixels from left edge)
	X int

	// Y is the top coordinate (pixels from top edge)
	Y int

	// Width is the width of the rectangle in pixels
	Width int

	// Height is the height of the rectangle in pixels
	Height int
}

Rectangle represents a rectangular bounding box

func NewRectangle

func NewRectangle(x, y, width, height int) Rectangle

NewRectangle creates a new Rectangle

func (Rectangle) Area

func (r Rectangle) Area() int

Area returns the area of the rectangle

func (Rectangle) Bottom

func (r Rectangle) Bottom() int

Bottom returns the bottom edge coordinate

func (Rectangle) Contains

func (r Rectangle) Contains(x, y int) bool

Contains returns true if the rectangle contains the point (x, y)

func (Rectangle) Intersects

func (r Rectangle) Intersects(other Rectangle) bool

Intersects returns true if this rectangle intersects with another

func (Rectangle) Right

func (r Rectangle) Right() int

Right returns the right edge coordinate

type Result

type Result struct {
	// DocumentOCR contains the OCR results
	DocumentOCR *DocumentOCR

	// Success indicates if OCR completed successfully
	Success bool

	// Error contains any error message if Success is false
	Error string
}

Result represents the result of an OCR operation

type VisionClient added in v1.1.0

type VisionClient interface {
	// GenerateOCR performs OCR on a base64-encoded image and returns structured word data
	GenerateOCR(ctx context.Context, model string, imageData string) ([]ollama.OCRWord, error)

	// HealthCheck verifies that the provider is accessible and the model is available
	HealthCheck(ctx context.Context, model string) error

	// Name returns the name of the provider (e.g., "ollama", "openai", "anthropic", "google")
	Name() string

	// SupportedModels returns a list of supported model names for this provider
	SupportedModels() []string
}

VisionClient is an interface for vision-capable LLM providers that can perform OCR

func NewVisionClient added in v1.1.0

func NewVisionClient(ctx context.Context, cfg *VisionClientConfig, log *logger.Logger) (VisionClient, error)

NewVisionClient creates a vision client based on the provider configuration

type VisionClientConfig added in v1.1.0

type VisionClientConfig struct {
	// Provider is the LLM provider type (ollama, openai, anthropic, google)
	Provider ProviderType

	// Model is the specific model to use (e.g., "llava", "gpt-4-vision-preview", "claude-3-5-sonnet-20241022", "gemini-1.5-pro")
	Model string

	// Endpoint is the API endpoint (required for Ollama, optional for cloud providers)
	Endpoint string

	// APIKey is the API key for cloud providers (read from env vars)
	APIKey string

	// MaxRetries is the maximum number of retry attempts
	MaxRetries int

	// Temperature controls randomness (0.0 = deterministic, recommended for OCR)
	Temperature float64
}

VisionClientConfig holds common configuration for all vision clients

type Word

type Word struct {
	// Text is the recognized text content
	Text string

	// BoundingBox is the position and size of the word on the page
	BoundingBox Rectangle

	// Confidence is the recognition confidence score (0-100)
	Confidence float64

	// FontSize is the estimated font size in points
	FontSize float64

	// Bold indicates if the word appears to be bold
	Bold bool

	// Italic indicates if the word appears to be italic
	Italic bool
}

Word represents a single recognized word with its bounding box

func NewWord

func NewWord(text string, bbox Rectangle, confidence float64) Word

NewWord creates a new Word

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL