pdf

package module

v0.0.0-...-c9a5dc9 Latest Latest Go to latest Published: Jul 24, 2026 License: Apache-2.0 Imports: 7 Imported by: 5

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cloudwego/eino-ext

Links

Open Source Insights

README ¶

PDF Parser for Eino

English | 简体中文

Introduction

This is a PDF parser component for Eino. It implements the Parser interface and can be seamlessly integrated into Eino's document processing workflow to parse PDF files into plain text documents.

Features

Implements github.com/cloudwego/eino/components/document/parser.Parser interface
Parse PDF content to plain text
Support for page-by-page parsing or full document parsing
Font caching for improved performance
Easy integration into Eino workflows

Installation

go get github.com/cloudwego/eino-ext/components/document/parser/pdf

Quick Start

Parse entire PDF as single document

package main

import (
	"context"
	"log"
	"os"

	"github.com/cloudwego/eino-ext/components/document/parser/pdf"
)

func main() {
	ctx := context.Background()

	parser, err := pdf.NewPDFParser(ctx, &pdf.Config{})
	if err != nil {
		log.Fatalf("pdf.NewPDFParser failed, err=%v", err)
	}

	file, err := os.Open("document.pdf")
	if err != nil {
		log.Fatalf("os.Open failed, err=%v", err)
	}
	defer file.Close()

	docs, err := parser.Parse(ctx, file)
	if err != nil {
		log.Fatalf("parser.Parse failed, err=%v", err)
	}

	log.Printf("Parsed %d document(s)", len(docs))
	log.Printf("Content: %s", docs[0].Content)
}

Parse PDF page-by-page

parser, err := pdf.NewPDFParser(ctx, &pdf.Config{
	ToPages: true,
})

docs, err := parser.Parse(ctx, file)

for i, doc := range docs {
	log.Printf("Page %d: %s", i+1, doc.Content)
}

Configuration

The parser can be configured through the Config structure:

type Config struct {
    // ToPages determines whether to parse PDF page-by-page (Optional)
    // If true, each page becomes a separate document
    // If false, entire PDF is parsed as a single document
    // Default: false
    ToPages bool
}

Parser Options

You can also configure the parser behavior using parser options:

docs, err := parser.Parse(ctx, file, 
    pdf.WithToPages(true),
)

Using in Chain

import (
    "github.com/cloudwego/eino/compose"
    "github.com/cloudwego/eino/components/document"
    pdfParser "github.com/cloudwego/eino-ext/components/document/parser/pdf"
)

parser, _ := pdfParser.NewPDFParser(ctx, &pdfParser.Config{})
loader, _ := fileLoader.NewFileLoader(ctx, &fileLoader.FileLoaderConfig{
    Parser: parser,
})

chain := compose.NewChain[document.Source, []*schema.Document]()
chain.AppendLoader(loader)

run, _ := chain.Compile(ctx)
docs, _ := run.Invoke(ctx, document.Source{URI: "document.pdf"})

Important Notes

⚠️ Alpha Stage: This parser is in alpha stage and may not support all PDF use cases perfectly.

Current Limitations:

May not preserve whitespace and newlines in all cases
Complex PDF layouts may not be parsed optimally
Some PDF features may not be fully supported

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Documentation ¶

Index ¶

func WithToPages(toPages bool) parser.Option
type Config
type PDFParser
- func NewPDFParser(ctx context.Context, config *Config) (*PDFParser, error)
- func (pp *PDFParser) Parse(ctx context.Context, reader io.Reader, opts ...parser.Option) (docs []*schema.Document, err error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func WithToPages ¶

func WithToPages(toPages bool) parser.Option

WithToPages is a parser option that specifies whether to parse the PDF into pages.

Types ¶

type Config ¶

type Config struct {
	ToPages bool // whether to
}

Config is the configuration for PDF parser.

type PDFParser ¶

type PDFParser struct {
	ToPages bool
}

PDFParser reads from io.Reader and parse its content as plain text. Attention: This is in alpha stage, and may not support all PDF use cases well enough. For example, it will not preserve whitespace and new line for now.

func NewPDFParser ¶

func NewPDFParser(ctx context.Context, config *Config) (*PDFParser, error)

NewPDFParser creates a new PDF parser.

func (*PDFParser) Parse ¶

func (pp *PDFParser) Parse(ctx context.Context, reader io.Reader, opts ...parser.Option) (docs []*schema.Document, err error)

Parse parses the PDF content from io.Reader.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL