knowledge

package
v1.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 12, 2026 License: Apache-2.0 Imports: 14 Imported by: 0

README

Knowledge Package - Document Loaders & Chunkers

Sistema completo de carregamento e processamento de documentos para RAG (Retrieval-Augmented Generation) no AgentGo.

📚 Loaders Disponíveis

1. TextLoader - Arquivos de texto
loader := knowledge.NewTextLoader("./docs/readme.md")
docs, err := loader.Load()

Suporta: .txt, .md, .log, qualquer arquivo de texto


2. DirectoryLoader - Diretórios completos
loader := knowledge.NewDirectoryLoader(
    "./docs",
    "*.md",    // Pattern: *.txt, *.md, etc.
    true,      // Recursive
)
docs, err := loader.Load()

Features:

  • Suporte a glob patterns
  • Modo recursivo
  • Filtragem por extensão

3. PDFLoader - Documentos PDF
loader := knowledge.NewPDFLoader("./paper.pdf")
docs, err := loader.Load()

Features:

  • Extração de texto de todas as páginas
  • Separador configurável entre páginas
  • Metadata com número de páginas
  • Suporta PDFs com texto (não OCR)

Dependência: github.com/ledongthuc/pdf


4. CSVLoader - Tabelas CSV
loader := knowledge.NewCSVLoader("./data.csv")
loader.HasHeader = true
loader.ContentFormat = "text" // ou "json"
loader.RowsPerDoc = 0 // 0 = todas em um doc
docs, err := loader.Load()

Features:

  • Detecção automática de headers
  • Múltiplos formatos de saída (texto/JSON)
  • Divisão por número de linhas
  • Filtragem de colunas
  • Delimiter configurável

Nativo Go: Usa encoding/csv


5. JSONLoader - Documentos JSON
loader := knowledge.NewJSONLoader("./data.json")
loader.ContentFields = []string{"title", "content"}
loader.MetadataFields = []string{"author", "date"}
docs, err := loader.Load()

Features:

  • Suporta objetos e arrays
  • Extração seletiva de campos
  • Metadata customizável
  • Auto-detecção de estrutura

Suporta:

  • Objetos JSON únicos
  • Arrays de objetos
  • JSON aninhado

Nativo Go: Usa encoding/json


6. HTMLLoader - Páginas HTML
loader := knowledge.NewHTMLLoader("./page.html")
loader.RemoveScripts = true
loader.RemoveStyles = true
loader.ExtractMetaTags = true
loader.Selectors = []string{"article", ".content"} // Opcional
docs, err := loader.Load()

Features:

  • Remoção de scripts/styles
  • Extração de meta tags
  • Seletores CSS customizados
  • Preservação de links (opcional)
  • Limpeza automática de whitespace

Dependência: github.com/PuerkitoBio/goquery


7. URLLoader - Conteúdo da Web
loader := knowledge.NewURLLoader("https://example.com/article")
loader.Timeout = 30 * time.Second
loader.Headers = map[string]string{"Authorization": "Bearer token"}
docs, err := loader.Load()

Features:

  • Auto-detecção de content-type
  • Suporta HTML, JSON, PDF, texto
  • Headers customizáveis
  • Timeout configurável
  • Follow redirects

Roteamento automático:

  • HTML → HTMLLoader
  • JSON → JSONLoader
  • PDF → PDFLoader
  • Texto → TextLoader

8. MultiURLLoader - Múltiplos URLs
urls := []string{
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
}
loader := knowledge.NewMultiURLLoader(urls)
loader.MaxConcurrent = 5
loader.ContinueOnErr = true
docs, err := loader.Load()

Features:

  • Carregamento concorrente
  • Controle de taxa (rate limiting)
  • Metadata compartilhada
  • Tratamento de erros individual

9. ReaderLoader - Streams (io.Reader)
loader := knowledge.NewReaderLoader(reader, "doc-id", metadata)
docs, err := loader.Load()

Use cases:

  • HTTP response bodies
  • Stdin
  • Buffers em memória
  • Pipes

✂️ Chunkers (Divisão de Documentos)

1. CharacterChunker - Por caracteres
chunker := knowledge.NewCharacterChunker(
    1000,  // ChunkSize (caracteres)
    100,   // ChunkOverlap
)
chunks, err := chunker.Chunk(document)

Features:

  • Quebra inteligente em separadores
  • Overlap para contexto
  • Preserva palavras completas
  • Metadata automática (start_char, end_char)

Ideal para: Textos sem estrutura clara


2. SentenceChunker - Por sentenças
chunker := knowledge.NewSentenceChunker(
    1000,  // MaxChunkSize
    250,   // MinChunkSize
)
chunks, err := chunker.Chunk(document)

Features:

  • Preserva sentenças completas
  • Detecção automática (., !, ?)
  • Respeita limites min/max
  • Integridade semântica

Ideal para: Artigos, documentos narrativos


3. ParagraphChunker - Por parágrafos
chunker := knowledge.NewParagraphChunker(2000) // MaxChunkSize
chunks, err := chunker.Chunk(document)

Features:

  • Quebra em \n\n
  • Fallback para CharacterChunker (parágrafos grandes)
  • Mantém estrutura do documento

Ideal para: Documentação, livros, artigos estruturados


🔄 Pipeline Completo RAG

package main

import (
    "context"
    "github.com/jholhewres/agent-go/pkg/agentgo/knowledge"
    "github.com/jholhewres/agent-go/pkg/agentgo/embeddings/openai"
    "github.com/jholhewres/agent-go/pkg/agentgo/vectordb/pgvector"
)

func main() {
    ctx := context.Background()

    // 1. Carregar documentos (múltiplos tipos)
    var allDocs []knowledge.Document

    // PDFs
    pdfLoader := knowledge.NewPDFDirectoryLoader("./pdfs", true)
    pdfDocs, _ := pdfLoader.Load()
    allDocs = append(allDocs, pdfDocs...)

    // Markdown
    mdLoader := knowledge.NewDirectoryLoader("./docs", "*.md", true)
    mdDocs, _ := mdLoader.Load()
    allDocs = append(allDocs, mdDocs...)

    // URLs
    urlLoader := knowledge.NewMultiURLLoader([]string{
        "https://example.com/article1",
        "https://example.com/article2",
    })
    urlDocs, _ := urlLoader.Load()
    allDocs = append(allDocs, urlDocs...)

    // 2. Chunkar documentos
    chunker := knowledge.NewCharacterChunker(1000, 100)
    var allChunks []knowledge.Chunk

    for _, doc := range allDocs {
        chunks, _ := chunker.Chunk(doc)
        allChunks = append(allChunks, chunks...)
    }

    // 3. Criar embeddings
    embedder := openai.NewEmbedding("text-embedding-3-small", apiKey)

    // 4. Armazenar no vector database
    vectorDB := pgvector.New(connString, "knowledge_base")

    for _, chunk := range allChunks {
        embedding, _ := embedder.Embed(ctx, chunk.Content)
        
        vectorDB.Add(ctx, []vectordb.Document{{
            ID:        chunk.ID,
            Content:   chunk.Content,
            Embedding: embedding,
            Metadata:  chunk.Metadata,
        }})
    }

    // 5. Query
    queryEmbedding, _ := embedder.Embed(ctx, "Como funciona o sistema?")
    results, _ := vectorDB.Query(ctx, queryEmbedding, 5, nil)

    for _, result := range results {
        println(result.Content)
    }
}

📊 Estruturas de Dados

Document
type Document struct {
    ID       string                 // Identificador único
    Content  string                 // Conteúdo textual
    Metadata map[string]interface{} // Metadata (filename, path, etc.)
    Source   string                 // Origem (file path, URL)
}
Chunk
type Chunk struct {
    ID       string                 // Identificador único
    Content  string                 // Conteúdo do chunk
    Metadata map[string]interface{} // Metadata herdada + chunk info
    Index    int                    // Posição no documento original
}

🚀 Performance

Loader Velocidade Uso de Memória Notas
TextLoader ⚡⚡⚡ Muito rápido Baixo Leitura direta de arquivo
PDFLoader ⚡⚡ Rápido Médio Depende do tamanho do PDF
CSVLoader ⚡⚡⚡ Muito rápido Baixo Parser nativo Go
JSONLoader ⚡⚡⚡ Muito rápido Baixo Parser nativo Go
HTMLLoader ⚡⚡ Rápido Médio Parsing + limpeza
URLLoader ⚡ Médio Médio Depende da rede
MultiURLLoader ⚡⚡ Rápido Médio-Alto Paralelização

🔒 Segurança

  • Path Traversal: filepath.Walk não segue symlinks
  • URL Validation: Timeout e headers configuráveis
  • Memory Limits: Chunkers previnem documentos gigantes em memória
  • Error Handling: Todos os loaders retornam erros descritivos

📦 Dependências

Loader Dependência Licença
PDFLoader github.com/ledongthuc/pdf Apache 2.0
HTMLLoader github.com/PuerkitoBio/goquery BSD 3-Clause
CSV, JSON, Text Nativo Go BSD 3-Clause

🛠️ Exemplos Avançados

CSV com Filtragem de Colunas
loader := knowledge.NewCSVLoader("./users.csv")
loader.TextColumns = []int{0, 2, 4} // Apenas colunas 0, 2, 4
loader.ContentFormat = "json"
docs, _ := loader.Load()
HTML com Seletores Específicos
loader := knowledge.NewHTMLLoader("./article.html")
loader.Selectors = []string{"article", ".post-content", "#main"}
loader.PreserveLinks = true
docs, _ := loader.Load()
JSON com Template Customizado
loader := knowledge.NewJSONLoader("./posts.json")
loader.ContentFields = []string{"title", "body", "tags"}
loader.MetadataFields = []string{"author", "date", "category"}
docs, _ := loader.Load()
URL com Headers Customizados
loader := knowledge.NewURLLoader("https://api.example.com/data")
loader.Headers = map[string]string{
    "Authorization": "Bearer " + token,
    "Accept": "application/json",
}
loader.Timeout = 60 * time.Second
docs, _ := loader.Load()

🎯 Best Practices

  1. Escolha o Chunker Certo:

    • Documentos técnicos → ParagraphChunker
    • Artigos/narrativas → SentenceChunker
    • Texto sem estrutura → CharacterChunker
  2. Ajuste Chunk Size:

    • Embeddings: 500-1000 caracteres
    • LLM Context: 1000-2000 caracteres
    • Overlap: 10-20% do chunk size
  3. Use Metadata:

    • Filtre por tipo de documento
    • Ordene por data/relevância
    • Track source para citações
  4. Tratamento de Erros:

    • Use ContinueOnErr para processamento em batch
    • Log failures para análise
    • Valide documentos antes de chunkar

📝 TODO / Roadmap

  • OCR Support - Extração de texto de PDFs escaneados
  • DocxLoader - Microsoft Word documents
  • PPTXLoader - PowerPoint presentations
  • ExcelLoader - Planilhas Excel
  • XMLLoader - Documentos XML
  • EpubLoader - E-books
  • AudioLoader - Transcrição de áudio
  • VideoLoader - Transcrição de vídeo
  • DatabaseLoader - SQL/NoSQL queries
  • S3Loader - AWS S3 objects
  • GCSLoader - Google Cloud Storage
  • GitLoader - Repositórios Git
  • SlackLoader - Mensagens do Slack
  • NotionLoader - Páginas do Notion
  • ConfluenceLoader - Wiki pages
  • JiraLoader - Issues e documentação

🤝 Contribuindo

Para adicionar um novo loader:

  1. Implemente a interface Loader:
type Loader interface {
    Load() ([]Document, error)
}
  1. Siga o padrão de naming: *Loader, *ReaderLoader
  2. Adicione metadata relevante
  3. Crie testes unitários
  4. Atualize este README

Autor: Jhol Hewres (@jholhewres)
Licença: Apache 2.0
Versão: 1.0.0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CSVLoader added in v1.1.0

type CSVLoader struct {
	FilePath      string
	Delimiter     rune   // Default: ','
	HasHeader     bool   // Whether first row is header
	TextColumns   []int  // Indices of columns to include (nil = all)
	RowsPerDoc    int    // Number of rows per document (0 = all in one doc)
	ContentFormat string // "json" or "text" (default: "text")
}

CSVLoader loads documents from CSV files

func NewCSVLoader added in v1.1.0

func NewCSVLoader(filePath string) *CSVLoader

NewCSVLoader creates a new CSV loader

func (*CSVLoader) Load added in v1.1.0

func (l *CSVLoader) Load() ([]Document, error)

Load loads a CSV file

type CSVReaderLoader added in v1.1.0

type CSVReaderLoader struct {
	Reader        io.Reader
	ID            string
	Delimiter     rune
	HasHeader     bool
	RowsPerDoc    int
	ContentFormat string
	Metadata      map[string]interface{}
}

CSVReaderLoader loads CSV from an io.Reader

func NewCSVReaderLoader added in v1.1.0

func NewCSVReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *CSVReaderLoader

NewCSVReaderLoader creates a new CSV reader loader

func (*CSVReaderLoader) Load added in v1.1.0

func (l *CSVReaderLoader) Load() ([]Document, error)

Load loads CSV content from a reader

type CharacterChunker

type CharacterChunker struct {
	ChunkSize    int // Number of characters per chunk
	ChunkOverlap int // Number of characters to overlap between chunks
	Separator    string
}

CharacterChunker splits documents by character count

func NewCharacterChunker

func NewCharacterChunker(chunkSize, chunkOverlap int) *CharacterChunker

NewCharacterChunker creates a new character-based chunker

func (*CharacterChunker) Chunk

func (c *CharacterChunker) Chunk(doc Document) ([]Chunk, error)

Chunk splits a document into character-based chunks

type Chunk

type Chunk struct {
	ID       string                 `json:"id"`
	Content  string                 `json:"content"`
	Metadata map[string]interface{} `json:"metadata,omitempty"`
	Index    int                    `json:"index"` // Position in original document
}

Chunk represents a chunk of a document

type Chunker

type Chunker interface {
	Chunk(doc Document) ([]Chunk, error)
}

Chunker interface for splitting documents into chunks

type DirectoryLoader

type DirectoryLoader struct {
	DirPath   string
	Pattern   string // File pattern to match (e.g., "*.txt", "*.md")
	Recursive bool   // Whether to search subdirectories
	// contains filtered or unexported fields
}

DirectoryLoader loads documents from a directory

func NewDirectoryLoader

func NewDirectoryLoader(dirPath string, pattern string, recursive bool) *DirectoryLoader

NewDirectoryLoader creates a new directory loader

func (*DirectoryLoader) Load

func (l *DirectoryLoader) Load() ([]Document, error)

Load loads all matching files from a directory

type Document

type Document struct {
	ID       string                 `json:"id"`
	Content  string                 `json:"content"`
	Metadata map[string]interface{} `json:"metadata,omitempty"`
	Source   string                 `json:"source,omitempty"` // File path or URL
}

Document represents a document with metadata

type HTMLLoader added in v1.1.0

type HTMLLoader struct {
	FilePath        string
	RemoveScripts   bool     // Remove <script> tags
	RemoveStyles    bool     // Remove <style> tags
	ExtractMetaTags bool     // Extract meta tags as metadata
	Selectors       []string // CSS selectors to extract specific content (nil = extract all)
	PreserveLinks   bool     // Keep links in content
}

HTMLLoader loads documents from HTML files

func NewHTMLLoader added in v1.1.0

func NewHTMLLoader(filePath string) *HTMLLoader

NewHTMLLoader creates a new HTML loader

func (*HTMLLoader) Load added in v1.1.0

func (l *HTMLLoader) Load() ([]Document, error)

Load loads an HTML file

type HTMLReaderLoader added in v1.1.0

type HTMLReaderLoader struct {
	Reader          io.Reader
	ID              string
	RemoveScripts   bool
	RemoveStyles    bool
	ExtractMetaTags bool
	Selectors       []string
	PreserveLinks   bool
	Metadata        map[string]interface{}
}

HTMLReaderLoader loads HTML from an io.Reader

func NewHTMLReaderLoader added in v1.1.0

func NewHTMLReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *HTMLReaderLoader

NewHTMLReaderLoader creates a new HTML reader loader

func (*HTMLReaderLoader) Load added in v1.1.0

func (l *HTMLReaderLoader) Load() ([]Document, error)

Load loads HTML content from a reader

type JSONLoader added in v1.1.0

type JSONLoader struct {
	FilePath       string
	JSONPath       string   // JSONPath expression to extract content (optional)
	ContentFields  []string // Fields to use as content (if JSON is object/array of objects)
	MetadataFields []string // Fields to include in metadata
	TextTemplate   string   // Template for formatting content (e.g., "{title}: {body}")
}

JSONLoader loads documents from JSON files

func NewJSONLoader added in v1.1.0

func NewJSONLoader(filePath string) *JSONLoader

NewJSONLoader creates a new JSON loader

func (*JSONLoader) Load added in v1.1.0

func (l *JSONLoader) Load() ([]Document, error)

Load loads a JSON file

type JSONReaderLoader added in v1.1.0

type JSONReaderLoader struct {
	Reader         io.Reader
	ID             string
	ContentFields  []string
	MetadataFields []string
	Metadata       map[string]interface{}
}

JSONReaderLoader loads JSON from an io.Reader

func NewJSONReaderLoader added in v1.1.0

func NewJSONReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *JSONReaderLoader

NewJSONReaderLoader creates a new JSON reader loader

func (*JSONReaderLoader) Load added in v1.1.0

func (l *JSONReaderLoader) Load() ([]Document, error)

Load loads JSON content from a reader

type Loader

type Loader interface {
	Load() ([]Document, error)
}

Loader interface for loading documents from different sources

type MultiURLLoader added in v1.1.0

type MultiURLLoader struct {
	URLs           []string
	Timeout        time.Duration
	MaxConcurrent  int // Maximum concurrent requests (default: 5)
	ContinueOnErr  bool
	CommonHeaders  map[string]string
	CommonMetadata map[string]interface{}
}

MultiURLLoader loads documents from multiple URLs

func NewMultiURLLoader added in v1.1.0

func NewMultiURLLoader(urls []string) *MultiURLLoader

NewMultiURLLoader creates a new multi-URL loader

func (*MultiURLLoader) Load added in v1.1.0

func (l *MultiURLLoader) Load() ([]Document, error)

Load fetches content from multiple URLs concurrently

type PDFDirectoryLoader added in v1.1.0

type PDFDirectoryLoader struct {
	DirPath   string
	Recursive bool
}

PDFDirectoryLoader loads all PDF files from a directory

func NewPDFDirectoryLoader added in v1.1.0

func NewPDFDirectoryLoader(dirPath string, recursive bool) *PDFDirectoryLoader

NewPDFDirectoryLoader creates a new PDF directory loader

func (*PDFDirectoryLoader) Load added in v1.1.0

func (l *PDFDirectoryLoader) Load() ([]Document, error)

Load loads all PDF files from a directory

type PDFLoader added in v1.1.0

type PDFLoader struct {
	FilePath       string
	ExtractImages  bool // Future: extract images from PDF
	PageSeparator  string
	PreserveLayout bool // Try to preserve text layout
}

PDFLoader loads documents from PDF files

func NewPDFLoader added in v1.1.0

func NewPDFLoader(filePath string) *PDFLoader

NewPDFLoader creates a new PDF loader

func (*PDFLoader) Load added in v1.1.0

func (l *PDFLoader) Load() ([]Document, error)

Load loads a PDF file and extracts text content

type PDFReaderLoader added in v1.1.0

type PDFReaderLoader struct {
	Data     []byte
	ID       string
	Metadata map[string]interface{}
}

PDFReaderLoader loads PDF from a byte slice (in-memory PDF)

func NewPDFReaderLoader added in v1.1.0

func NewPDFReaderLoader(data []byte, id string, metadata map[string]interface{}) *PDFReaderLoader

NewPDFReaderLoader creates a new PDF reader loader from bytes

func (*PDFReaderLoader) Load added in v1.1.0

func (l *PDFReaderLoader) Load() ([]Document, error)

Load loads PDF content from bytes

type ParagraphChunker

type ParagraphChunker struct {
	MaxChunkSize int // Maximum characters per chunk
}

ParagraphChunker splits documents by paragraphs

func NewParagraphChunker

func NewParagraphChunker(maxChunkSize int) *ParagraphChunker

NewParagraphChunker creates a new paragraph-based chunker

func (*ParagraphChunker) Chunk

func (c *ParagraphChunker) Chunk(doc Document) ([]Chunk, error)

Chunk splits a document into paragraph-based chunks

type ReaderLoader

type ReaderLoader struct {
	Reader   io.Reader
	ID       string
	Metadata map[string]interface{}
}

ReaderLoader loads documents from an io.Reader

func NewReaderLoader

func NewReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *ReaderLoader

NewReaderLoader creates a new reader loader

func (*ReaderLoader) Load

func (l *ReaderLoader) Load() ([]Document, error)

Load loads content from a reader

type SentenceChunker

type SentenceChunker struct {
	MaxChunkSize int // Maximum characters per chunk
	MinChunkSize int // Minimum characters per chunk
}

SentenceChunker splits documents by sentences

func NewSentenceChunker

func NewSentenceChunker(maxChunkSize, minChunkSize int) *SentenceChunker

NewSentenceChunker creates a new sentence-based chunker

func (*SentenceChunker) Chunk

func (c *SentenceChunker) Chunk(doc Document) ([]Chunk, error)

Chunk splits a document into sentence-based chunks

type TextLoader

type TextLoader struct {
	FilePath string
}

TextLoader loads documents from text files

func NewTextLoader

func NewTextLoader(filePath string) *TextLoader

NewTextLoader creates a new text file loader

func (*TextLoader) Load

func (l *TextLoader) Load() ([]Document, error)

Load loads a text file

type URLLoader added in v1.1.0

type URLLoader struct {
	URL            string
	Method         string            // HTTP method (default: GET)
	Headers        map[string]string // Custom headers
	Timeout        time.Duration     // Request timeout (default: 30s)
	FollowRedirect bool              // Follow redirects (default: true)
	UserAgent      string            // User agent string
	ContentType    string            // Expected content type (html, json, pdf, text)
	AutoDetect     bool              // Auto-detect content type from response
}

URLLoader loads documents from URLs (web pages, APIs, etc.)

func NewURLLoader added in v1.1.0

func NewURLLoader(url string) *URLLoader

NewURLLoader creates a new URL loader

func (*URLLoader) Load added in v1.1.0

func (l *URLLoader) Load() ([]Document, error)

Load fetches content from URL and loads it as document

type WebCrawler added in v1.1.0

type WebCrawler struct {
	StartURL      string
	MaxDepth      int               // Maximum crawl depth (default: 2)
	MaxPages      int               // Maximum pages to crawl (default: 10)
	SameDomain    bool              // Only crawl same domain (default: true)
	IncludeFilter []string          // URL patterns to include
	ExcludeFilter []string          // URL patterns to exclude
	Timeout       time.Duration     // Request timeout per page
	Headers       map[string]string // Custom headers
}

WebCrawler crawls web pages starting from a URL

func NewWebCrawler added in v1.1.0

func NewWebCrawler(startURL string) *WebCrawler

NewWebCrawler creates a new web crawler

func (*WebCrawler) Load added in v1.1.0

func (c *WebCrawler) Load() ([]Document, error)

Load crawls web pages and loads them as documents Note: This is a basic implementation. For production, consider using a dedicated crawler library

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL