docloader

package

v0.0.0-...-5b41359 Latest Latest Go to latest Published: Feb 18, 2026 License: Apache-2.0 Imports: 23 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/basenana/plugin

Links

Open Source Insights

README ¶

DocLoader

Loads and parses document files (PDF, TXT, MD, HTML, EPUB, webarchive).

Type

ProcessPlugin

Version

1.0

Name

docloader

Parameters

Parameter	Required	Type	Description
`file_path`	Yes	string	Path to document file
`updated_at`	No	string	Publish time in RFC3339 format (e.g., "2024-01-01T00:00:00Z")
`title`	No	string	Override document title
`url`	No	string	Document source URL
`site_name`	No	string	Site name (for web content)
`site_url`	No	string	Site URL (for web content)

Supported Formats

Extension	Format
`.pdf`	PDF Document
`.txt`	Plain Text
`.md`, `.markdown`	Markdown
`.html`, `.htm`	HTML
`.webarchive`	Web Archive
`.epub`	EPUB

Output

Returns a map with file_path and document object containing:

{
  "file_path": "document.pdf",
  "document": {
    "content": "<extracted-text>",
    "properties": {
      "title": "<document-title>",
      "author": "<author>",
      "year": "2024",
      "source": "<source>",
      "abstract": "<summary>",
      "keywords": ["tag1", "tag2"],
      "url": "<source-url>",
      "site_name": "<site-name>",
      "site_url": "<site-url>",
      "header_image": "<url>",
      "publish_at": 1704067200,
      "unread": false,
      "marked": false
    }
  }
}

Document Properties

Field	Type	Description
`content`	string	Document text content
`properties.title`	string	Document title
`properties.author`	string	Author name
`properties.year`	string	Publication year (from filename or metadata)
`properties.source`	string	Source/publisher
`properties.abstract`	string	Document abstract/summary
`properties.notes`	string	Personal notes
`properties.keywords`	[]string	Keywords (array of strings)
`properties.url`	string	Source URL
`properties.site_name`	string	Site name (for web content)
`properties.site_url`	string	Site URL (for web content)
`properties.header_image`	string	Header image URL (HTML only)
`properties.unread`	bool	Marked as unread
`properties.marked`	bool	Marked as starred
`properties.publish_at`	int64	Publish timestamp (Unix)

Architecture

docloader.go
├── DocLoader (main plugin)
├── Parser interface (Load returns types.Document)
│
├── filename.go
│   └── extractFileNameMetadata() // Parse filename patterns for author/title/year
│
├── pdf.go
│   └── PDF parser (extracts PDF metadata, supports password)
│
├── html.go
│   ├── HTML parser
│   └── extractHTMLMetadata() // Meta tags, OG tags, Dublin Core
│
├── epub.go
│   └── EPUB parser (extracts Dublin Core from OPF)
│
└── plaintext.go
    ├── Text parser (TXT/MD/Markdown)
    └── extractTextContentMetadata() // Title from # heading, abstract from paragraphs

Metadata Extraction by Format

PDF

Extracts info dict metadata (author, title, subject, creator, producer)
Supports password-protected PDFs
Falls back to file modification time for publish_at

Text (TXT, MD, Markdown)

Extracts title from first # heading or first non-empty line
Extracts first paragraph as abstract
Parses filename patterns for author/title/year:
- Author_Title_2024.txt
- Author - Title (2024).md

HTML

Extracts meta tags: author, description, keywords
Extracts Open Graph tags: og:title, og:description, og:image, og:site_name
Extracts Dublin Core tags: dc.title, dc.creator, dc.description, etc.
Falls back to HTML <title> tag

EPUB

Extracts Dublin Core metadata from OPF container
Supports: title, creator, description, subject, publisher, date

Usage Example

# Load a PDF document
- name: docloader
  parameters:
    file_path: "/path/to/document.pdf"

# Load an HTML file
- name: docloader
  parameters:
    file_path: "/path/to/page.html"

# Load a markdown file
- name: docloader
  parameters:
    file_path: "/path/to/readme.md"

Notes

The file_path is relative to the working path
If no title is found, filename (without extension) is used
Fields not found in document will be empty/default values
header_image only available for HTML with OG meta tags
year is extracted from filename patterns or document metadata
keywords is returned as an array, not comma-separated string
publish_at is Unix timestamp (int64), not string
Use updated_at parameter to preserve original publish time when document lacks internal metadata (e.g., webarchive from RSS feeds)

Documentation ¶

Index ¶

Constants
Variables
func HTMLContentTrim(content string) string
func NewDocLoader(ps types.PluginCall) types.Plugin
func ReadableHTMLContent(content string) string
type CSV
- func NewCSV(docPath string, option map[string]string) CSV
- func (c CSV) Load(_ context.Context) (types.Document, error)
type DocLoader
- func (d *DocLoader) Name() string
- func (d *DocLoader) Run(ctx context.Context, request *api.Request) (*api.Response, error)
- func (d *DocLoader) Type() types.PluginType
- func (d *DocLoader) Version() string
type EPUB
- func (e EPUB) Load(_ context.Context) (types.Document, error)
type HTML
- func (h HTML) Load(ctx context.Context) (types.Document, error)
type PDF
- func (p *PDF) Load(_ context.Context) (types.Document, error)
type Parser
- func NewEPUB(docPath string, option map[string]string) Parser
- func NewHTML(docPath string, option map[string]string) Parser
- func NewPDF(docPath string, option map[string]string) Parser
- func NewText(docPath string, option map[string]string) Parser
type Text
- func (l Text) Load(_ context.Context) (types.Document, error)

Constants ¶

View Source

const (
	PluginName    = "docloader"
	PluginVersion = "1.0"
)

Variables ¶

View Source

var PluginSpec = types.PluginSpec{
	Name:    PluginName,
	Version: PluginVersion,
	Type:    types.TypeProcess,
	Parameters: []types.ParameterSpec{
		{
			Name:        "file_path",
			Required:    true,
			Description: "Path to document file",
		},
		{
			Name:        "title",
			Required:    false,
			Description: "Document title",
		},
		{
			Name:        "url",
			Required:    false,
			Description: "Document URL",
		},
		{
			Name:        "site_name",
			Required:    false,
			Description: "Site name",
		},
		{
			Name:        "site_url",
			Required:    false,
			Description: "Site URL",
		},
	},
}

Functions ¶

func HTMLContentTrim ¶

func HTMLContentTrim(content string) string

func NewDocLoader ¶

func NewDocLoader(ps types.PluginCall) types.Plugin

func ReadableHTMLContent ¶

func ReadableHTMLContent(content string) string

Types ¶

type CSV ¶

type CSV struct {
	// contains filtered or unexported fields
}

func NewCSV ¶

func NewCSV(docPath string, option map[string]string) CSV

func (CSV) Load ¶

func (c CSV) Load(_ context.Context) (types.Document, error)

type DocLoader ¶

type DocLoader struct {
	// contains filtered or unexported fields
}

func (*DocLoader) Name ¶

func (d *DocLoader) Name() string

func (*DocLoader) Run ¶

func (d *DocLoader) Run(ctx context.Context, request *api.Request) (*api.Response, error)

func (*DocLoader) Type ¶

func (d *DocLoader) Type() types.PluginType

func (*DocLoader) Version ¶

func (d *DocLoader) Version() string

type EPUB ¶

type EPUB struct {
	// contains filtered or unexported fields
}

func (EPUB) Load ¶

func (e EPUB) Load(_ context.Context) (types.Document, error)

type HTML ¶

type HTML struct {
	// contains filtered or unexported fields
}

func (HTML) Load ¶

func (h HTML) Load(ctx context.Context) (types.Document, error)

type PDF ¶

type PDF struct {
	// contains filtered or unexported fields
}

func (*PDF) Load ¶

func (p *PDF) Load(_ context.Context) (types.Document, error)

type Parser ¶

type Parser interface {
	Load(ctx context.Context) (doc types.Document, err error)
}

func NewEPUB ¶

func NewEPUB(docPath string, option map[string]string) Parser

func NewHTML ¶

func NewHTML(docPath string, option map[string]string) Parser

func NewPDF ¶

func NewPDF(docPath string, option map[string]string) Parser

func NewText ¶

func NewText(docPath string, option map[string]string) Parser

type Text ¶

type Text struct {
	// contains filtered or unexported fields
}

func (Text) Load ¶

func (l Text) Load(_ context.Context) (types.Document, error)

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL