docloader

package
v0.0.0-...-50d0747 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 11, 2026 License: Apache-2.0 Imports: 23 Imported by: 0

README

DocLoader

Loads and parses document files (PDF, TXT, MD, HTML, EPUB, webarchive).

Type

ProcessPlugin

Version

1.0

Name

docloader

Parameters

Parameter Required Type Description
file_path Yes string Path to document file
title No string Override document title
url No string Document source URL
site_name No string Site name (for web content)
site_url No string Site URL (for web content)

Supported Formats

Extension Format
.pdf PDF Document
.txt Plain Text
.md, .markdown Markdown
.html, .htm HTML
.webarchive Web Archive
.epub EPUB

Output

Returns a map with file_path and document object containing:

{
  "file_path": "document.pdf",
  "document": {
    "content": "<extracted-text>",
    "properties": {
      "title": "<document-title>",
      "author": "<author>",
      "year": "2024",
      "source": "<source>",
      "abstract": "<summary>",
      "keywords": ["tag1", "tag2"],
      "url": "<source-url>",
      "site_name": "<site-name>",
      "site_url": "<site-url>",
      "header_image": "<url>",
      "publish_at": 1704067200,
      "unread": false,
      "marked": false
    }
  }
}
Document Properties
Field Type Description
content string Document text content
properties.title string Document title
properties.author string Author name
properties.year string Publication year (from filename or metadata)
properties.source string Source/publisher
properties.abstract string Document abstract/summary
properties.notes string Personal notes
properties.keywords []string Keywords (array of strings)
properties.url string Source URL
properties.site_name string Site name (for web content)
properties.site_url string Site URL (for web content)
properties.header_image string Header image URL (HTML only)
properties.unread bool Marked as unread
properties.marked bool Marked as starred
properties.publish_at int64 Publish timestamp (Unix)

Architecture

docloader.go
├── DocLoader (main plugin)
├── Parser interface (Load returns types.Document)
│
├── filename.go
│   └── extractFileNameMetadata() // Parse filename patterns for author/title/year
│
├── pdf.go
│   └── PDF parser (extracts PDF metadata, supports password)
│
├── html.go
│   ├── HTML parser
│   └── extractHTMLMetadata() // Meta tags, OG tags, Dublin Core
│
├── epub.go
│   └── EPUB parser (extracts Dublin Core from OPF)
│
└── plaintext.go
    ├── Text parser (TXT/MD/Markdown)
    └── extractTextContentMetadata() // Title from # heading, abstract from paragraphs

Metadata Extraction by Format

PDF
  • Extracts info dict metadata (author, title, subject, creator, producer)
  • Supports password-protected PDFs
  • Falls back to file modification time for publish_at
Text (TXT, MD, Markdown)
  • Extracts title from first # heading or first non-empty line
  • Extracts first paragraph as abstract
  • Parses filename patterns for author/title/year:
    • Author_Title_2024.txt
    • Author - Title (2024).md
HTML
  • Extracts meta tags: author, description, keywords
  • Extracts Open Graph tags: og:title, og:description, og:image, og:site_name
  • Extracts Dublin Core tags: dc.title, dc.creator, dc.description, etc.
  • Falls back to HTML <title> tag
EPUB
  • Extracts Dublin Core metadata from OPF container
  • Supports: title, creator, description, subject, publisher, date

Usage Example

# Load a PDF document
- name: docloader
  parameters:
    file_path: "/path/to/document.pdf"

# Load an HTML file
- name: docloader
  parameters:
    file_path: "/path/to/page.html"

# Load a markdown file
- name: docloader
  parameters:
    file_path: "/path/to/readme.md"

Notes

  • The file_path is relative to the working path
  • If no title is found, filename (without extension) is used
  • Fields not found in document will be empty/default values
  • header_image only available for HTML with OG meta tags
  • year is extracted from filename patterns or document metadata
  • keywords is returned as an array, not comma-separated string
  • publish_at is Unix timestamp (int64), not string

Documentation

Index

Constants

View Source
const (
	PluginName    = "docloader"
	PluginVersion = "1.0"
)

Variables

View Source
var PluginSpec = types.PluginSpec{
	Name:    PluginName,
	Version: PluginVersion,
	Type:    types.TypeProcess,
}

Functions

func HTMLContentTrim

func HTMLContentTrim(content string) string

func NewDocLoader

func NewDocLoader(ps types.PluginCall) types.Plugin

func ReadableHTMLContent

func ReadableHTMLContent(content string) string

Types

type CSV

type CSV struct {
	// contains filtered or unexported fields
}

func NewCSV

func NewCSV(docPath string, option map[string]string) CSV

func (CSV) Load

func (c CSV) Load(_ context.Context) (types.Document, error)

type DocLoader

type DocLoader struct {
	// contains filtered or unexported fields
}

func (*DocLoader) Name

func (d *DocLoader) Name() string

func (*DocLoader) Run

func (d *DocLoader) Run(ctx context.Context, request *api.Request) (*api.Response, error)

func (*DocLoader) Type

func (d *DocLoader) Type() types.PluginType

func (*DocLoader) Version

func (d *DocLoader) Version() string

type EPUB

type EPUB struct {
	// contains filtered or unexported fields
}

func (EPUB) Load

func (e EPUB) Load(_ context.Context) (types.Document, error)

type HTML

type HTML struct {
	// contains filtered or unexported fields
}

func (HTML) Load

func (h HTML) Load(ctx context.Context) (types.Document, error)

type PDF

type PDF struct {
	// contains filtered or unexported fields
}

func (*PDF) Load

func (p *PDF) Load(_ context.Context) (types.Document, error)

type Parser

type Parser interface {
	Load(ctx context.Context) (doc types.Document, err error)
}

func NewEPUB

func NewEPUB(docPath string, option map[string]string) Parser

func NewHTML

func NewHTML(docPath string, option map[string]string) Parser

func NewPDF

func NewPDF(docPath string, option map[string]string) Parser

func NewText

func NewText(docPath string, option map[string]string) Parser

type Text

type Text struct {
	// contains filtered or unexported fields
}

func (Text) Load

func (l Text) Load(_ context.Context) (types.Document, error)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL