pdf_text_extractor

package module
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 23, 2026 License: MIT Imports: 18 Imported by: 0

README

PDF Text Extractor

Extract text from PDF files in PocketBase file fields and store the result in another field.

The plugin supports multi-file fields, ignores non-PDF uploads, and merges multiple extracted PDFs with --- separators. Text extraction is powered by go-pdfium running through the wazero WebAssembly backend, so there is no CGO requirement and no separate PDFium binary to install.

Installation

Build PocketBase with the plugin:

xpb build --with github.com/mjadobson/pb-plugin-pdf@latest

Setup

During app initialization, the plugin creates a shared _plugins collection if it doesn't already exist. If the plugin is loaded before PocketBase has finished bootstrapping, this setup is deferred until the bootstrap step completes.

The collection includes these fields:

  • plugin_name (text)
  • config (json)
  • enabled (bool)

To configure this plugin, add a row to _plugins with:

  • plugin_name = pdf_text_extractor
  • enabled = true
  • config containing a JSON array of extraction rules

Example:

[
  {
    "collection_name": "docs",
    "input_field": "files",
    "output_field": "files_text",
    "recalculate": true
  },
  {
    "collection_name": "invoices",
    "input_field": "pdf",
    "output_field": "pdf_text"
  }
]

For each configured rule, make sure the target collection has:

  • a file field matching input_field
  • a text or editor field matching output_field

The plugin runs after successful record creation, and after updates only when the configured file field value changes.

Plugin Config

config[].collection_name

The PocketBase collection name to watch.

config[].input_field

The file field containing one or more uploads.

config[].output_field

The text or editor field where extracted content should be stored.

config[].recalculate

Optional one-off trigger. When set to true, the plugin will backfill all rows in the configured collection in batches of 100.

As soon as the job starts, the plugin removes recalculate and sets recalculating: true for that config entry. When the job finishes, it removes recalculating too.

config[].recalculating

Transient status flag managed by the plugin while a one-off recalculation is running. You should not set this manually.

Behaviour

  • Empty input clears the output field.
  • Only .pdf files are processed.
  • Non-PDF files are ignored.
  • Multiple PDFs are joined with --- on its own line.
  • Extraction failures are logged and skipped so other files can still be processed.
  • Unrelated record updates do not re-parse PDFs; text is refreshed only when the configured file field changes.
  • Changing _plugins rows or relevant collection schemas takes effect for future create/update events without backfilling existing records.
  • Setting recalculate: true on a config entry triggers a one-off backfill for existing rows in that collection.

Development

go mod tidy
GOCACHE=/tmp/go-build go test ./...
GOCACHE=/tmp/go-build go build ./...

License

Licensed under the MIT License.

This plugin uses github.com/klippa-app/go-pdfium with its wazero WebAssembly runtime. Those dependencies are MIT- and Apache-licensed respectively. See LICENSE.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ExtractPdfTextConfig

type ExtractPdfTextConfig struct {
	CollectionName string `json:"collection_name"`
	InputField     string `json:"input_field"`
	OutputField    string `json:"output_field"`
	Recalculate    bool   `json:"recalculate,omitempty"`
	Recalculating  bool   `json:"recalculating,omitempty"`
}

type Plugin

type Plugin struct {
	// contains filtered or unexported fields
}

func (*Plugin) Description

func (p *Plugin) Description() string

func (*Plugin) Init

func (p *Plugin) Init(app core.App) error

func (*Plugin) Name

func (p *Plugin) Name() string

func (*Plugin) Version

func (p *Plugin) Version() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL