pdf

package module
v0.0.0-...-8fe9471 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 10, 2026 License: Apache-2.0 Imports: 7 Imported by: 5

README

PDF Parser for Eino

English | 简体中文

Introduction

This is a PDF parser component for Eino. It implements the Parser interface and can be seamlessly integrated into Eino's document processing workflow to parse PDF files into plain text documents.

Features

  • Implements github.com/cloudwego/eino/components/document/parser.Parser interface
  • Parse PDF content to plain text
  • Support for page-by-page parsing or full document parsing
  • Font caching for improved performance
  • Easy integration into Eino workflows

Installation

go get github.com/cloudwego/eino-ext/components/document/parser/pdf

Quick Start

Parse entire PDF as single document
package main

import (
	"context"
	"log"
	"os"

	"github.com/cloudwego/eino-ext/components/document/parser/pdf"
)

func main() {
	ctx := context.Background()

	parser, err := pdf.NewPDFParser(ctx, &pdf.Config{})
	if err != nil {
		log.Fatalf("pdf.NewPDFParser failed, err=%v", err)
	}

	file, err := os.Open("document.pdf")
	if err != nil {
		log.Fatalf("os.Open failed, err=%v", err)
	}
	defer file.Close()

	docs, err := parser.Parse(ctx, file)
	if err != nil {
		log.Fatalf("parser.Parse failed, err=%v", err)
	}

	log.Printf("Parsed %d document(s)", len(docs))
	log.Printf("Content: %s", docs[0].Content)
}
Parse PDF page-by-page
parser, err := pdf.NewPDFParser(ctx, &pdf.Config{
	ToPages: true,
})

docs, err := parser.Parse(ctx, file)

for i, doc := range docs {
	log.Printf("Page %d: %s", i+1, doc.Content)
}

Configuration

The parser can be configured through the Config structure:

type Config struct {
    // ToPages determines whether to parse PDF page-by-page (Optional)
    // If true, each page becomes a separate document
    // If false, entire PDF is parsed as a single document
    // Default: false
    ToPages bool
}

Parser Options

You can also configure the parser behavior using parser options:

docs, err := parser.Parse(ctx, file, 
    pdf.WithToPages(true),
)

Using in Chain

import (
    "github.com/cloudwego/eino/compose"
    "github.com/cloudwego/eino/components/document"
    pdfParser "github.com/cloudwego/eino-ext/components/document/parser/pdf"
)

parser, _ := pdfParser.NewPDFParser(ctx, &pdfParser.Config{})
loader, _ := fileLoader.NewFileLoader(ctx, &fileLoader.FileLoaderConfig{
    Parser: parser,
})

chain := compose.NewChain[document.Source, []*schema.Document]()
chain.AppendLoader(loader)

run, _ := chain.Compile(ctx)
docs, _ := run.Invoke(ctx, document.Source{URI: "document.pdf"})

Important Notes

⚠️ Alpha Stage: This parser is in alpha stage and may not support all PDF use cases perfectly.

Current Limitations:

  • May not preserve whitespace and newlines in all cases
  • Complex PDF layouts may not be parsed optimally
  • Some PDF features may not be fully supported

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func WithToPages

func WithToPages(toPages bool) parser.Option

WithToPages is a parser option that specifies whether to parse the PDF into pages.

Types

type Config

type Config struct {
	ToPages bool // whether to
}

Config is the configuration for PDF parser.

type PDFParser

type PDFParser struct {
	ToPages bool
}

PDFParser reads from io.Reader and parse its content as plain text. Attention: This is in alpha stage, and may not support all PDF use cases well enough. For example, it will not preserve whitespace and new line for now.

func NewPDFParser

func NewPDFParser(ctx context.Context, config *Config) (*PDFParser, error)

NewPDFParser creates a new PDF parser.

func (*PDFParser) Parse

func (pp *PDFParser) Parse(ctx context.Context, reader io.Reader, opts ...parser.Option) (docs []*schema.Document, err error)

Parse parses the PDF content from io.Reader.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL