knowledge

package

v1.4.0 Latest Latest Go to latest Published: Mar 12, 2026 License: Apache-2.0 Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jholhewres/agent-go

Links

Open Source Insights

README ¶

Knowledge Package - Document Loaders & Chunkers

Sistema completo de carregamento e processamento de documentos para RAG (Retrieval-Augmented Generation) no AgentGo.

📚 Loaders Disponíveis

1. TextLoader - Arquivos de texto

loader := knowledge.NewTextLoader("./docs/readme.md")
docs, err := loader.Load()

Suporta: .txt, .md, .log, qualquer arquivo de texto

2. DirectoryLoader - Diretórios completos

loader := knowledge.NewDirectoryLoader(
    "./docs",
    "*.md",    // Pattern: *.txt, *.md, etc.
    true,      // Recursive
)
docs, err := loader.Load()

Features:

Suporte a glob patterns
Modo recursivo
Filtragem por extensão

3. PDFLoader - Documentos PDF

loader := knowledge.NewPDFLoader("./paper.pdf")
docs, err := loader.Load()

Features:

Extração de texto de todas as páginas
Separador configurável entre páginas
Metadata com número de páginas
Suporta PDFs com texto (não OCR)

Dependência: github.com/ledongthuc/pdf

4. CSVLoader - Tabelas CSV

loader := knowledge.NewCSVLoader("./data.csv")
loader.HasHeader = true
loader.ContentFormat = "text" // ou "json"
loader.RowsPerDoc = 0 // 0 = todas em um doc
docs, err := loader.Load()

Features:

Detecção automática de headers
Múltiplos formatos de saída (texto/JSON)
Divisão por número de linhas
Filtragem de colunas
Delimiter configurável

Nativo Go: Usa encoding/csv

5. JSONLoader - Documentos JSON

loader := knowledge.NewJSONLoader("./data.json")
loader.ContentFields = []string{"title", "content"}
loader.MetadataFields = []string{"author", "date"}
docs, err := loader.Load()

Features:

Suporta objetos e arrays
Extração seletiva de campos
Metadata customizável
Auto-detecção de estrutura

Suporta:

Objetos JSON únicos
Arrays de objetos
JSON aninhado

Nativo Go: Usa encoding/json

6. HTMLLoader - Páginas HTML

loader := knowledge.NewHTMLLoader("./page.html")
loader.RemoveScripts = true
loader.RemoveStyles = true
loader.ExtractMetaTags = true
loader.Selectors = []string{"article", ".content"} // Opcional
docs, err := loader.Load()

Features:

Remoção de scripts/styles
Extração de meta tags
Seletores CSS customizados
Preservação de links (opcional)
Limpeza automática de whitespace

Dependência: github.com/PuerkitoBio/goquery

7. URLLoader - Conteúdo da Web

loader := knowledge.NewURLLoader("https://example.com/article")
loader.Timeout = 30 * time.Second
loader.Headers = map[string]string{"Authorization": "Bearer token"}
docs, err := loader.Load()

Features:

Auto-detecção de content-type
Suporta HTML, JSON, PDF, texto
Headers customizáveis
Timeout configurável
Follow redirects

Roteamento automático:

HTML → HTMLLoader
JSON → JSONLoader
PDF → PDFLoader
Texto → TextLoader

8. MultiURLLoader - Múltiplos URLs

urls := []string{
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
}
loader := knowledge.NewMultiURLLoader(urls)
loader.MaxConcurrent = 5
loader.ContinueOnErr = true
docs, err := loader.Load()

Features:

Carregamento concorrente
Controle de taxa (rate limiting)
Metadata compartilhada
Tratamento de erros individual

9. ReaderLoader - Streams (io.Reader)

loader := knowledge.NewReaderLoader(reader, "doc-id", metadata)
docs, err := loader.Load()

Use cases:

HTTP response bodies
Stdin
Buffers em memória
Pipes

✂️ Chunkers (Divisão de Documentos)

1. CharacterChunker - Por caracteres

chunker := knowledge.NewCharacterChunker(
    1000,  // ChunkSize (caracteres)
    100,   // ChunkOverlap
)
chunks, err := chunker.Chunk(document)

Features:

Quebra inteligente em separadores
Overlap para contexto
Preserva palavras completas
Metadata automática (start_char, end_char)

Ideal para: Textos sem estrutura clara

2. SentenceChunker - Por sentenças

chunker := knowledge.NewSentenceChunker(
    1000,  // MaxChunkSize
    250,   // MinChunkSize
)
chunks, err := chunker.Chunk(document)

Features:

Preserva sentenças completas
Detecção automática (., !, ?)
Respeita limites min/max
Integridade semântica

Ideal para: Artigos, documentos narrativos

3. ParagraphChunker - Por parágrafos

chunker := knowledge.NewParagraphChunker(2000) // MaxChunkSize
chunks, err := chunker.Chunk(document)

Features:

Quebra em \n\n
Fallback para CharacterChunker (parágrafos grandes)
Mantém estrutura do documento

Ideal para: Documentação, livros, artigos estruturados

🔄 Pipeline Completo RAG

package main

import (
    "context"
    "github.com/jholhewres/agent-go/pkg/agentgo/knowledge"
    "github.com/jholhewres/agent-go/pkg/agentgo/embeddings/openai"
    "github.com/jholhewres/agent-go/pkg/agentgo/vectordb/pgvector"
)

func main() {
    ctx := context.Background()

    // 1. Carregar documentos (múltiplos tipos)
    var allDocs []knowledge.Document

    // PDFs
    pdfLoader := knowledge.NewPDFDirectoryLoader("./pdfs", true)
    pdfDocs, _ := pdfLoader.Load()
    allDocs = append(allDocs, pdfDocs...)

    // Markdown
    mdLoader := knowledge.NewDirectoryLoader("./docs", "*.md", true)
    mdDocs, _ := mdLoader.Load()
    allDocs = append(allDocs, mdDocs...)

    // URLs
    urlLoader := knowledge.NewMultiURLLoader([]string{
        "https://example.com/article1",
        "https://example.com/article2",
    })
    urlDocs, _ := urlLoader.Load()
    allDocs = append(allDocs, urlDocs...)

    // 2. Chunkar documentos
    chunker := knowledge.NewCharacterChunker(1000, 100)
    var allChunks []knowledge.Chunk

    for _, doc := range allDocs {
        chunks, _ := chunker.Chunk(doc)
        allChunks = append(allChunks, chunks...)
    }

    // 3. Criar embeddings
    embedder := openai.NewEmbedding("text-embedding-3-small", apiKey)

    // 4. Armazenar no vector database
    vectorDB := pgvector.New(connString, "knowledge_base")

    for _, chunk := range allChunks {
        embedding, _ := embedder.Embed(ctx, chunk.Content)
        
        vectorDB.Add(ctx, []vectordb.Document{{
            ID:        chunk.ID,
            Content:   chunk.Content,
            Embedding: embedding,
            Metadata:  chunk.Metadata,
        }})
    }

    // 5. Query
    queryEmbedding, _ := embedder.Embed(ctx, "Como funciona o sistema?")
    results, _ := vectorDB.Query(ctx, queryEmbedding, 5, nil)

    for _, result := range results {
        println(result.Content)
    }
}

📊 Estruturas de Dados

Document

type Document struct {
    ID       string                 // Identificador único
    Content  string                 // Conteúdo textual
    Metadata map[string]interface{} // Metadata (filename, path, etc.)
    Source   string                 // Origem (file path, URL)
}

Chunk

type Chunk struct {
    ID       string                 // Identificador único
    Content  string                 // Conteúdo do chunk
    Metadata map[string]interface{} // Metadata herdada + chunk info
    Index    int                    // Posição no documento original
}

🚀 Performance

Loader	Velocidade	Uso de Memória	Notas
TextLoader	⚡⚡⚡ Muito rápido	Baixo	Leitura direta de arquivo
PDFLoader	⚡⚡ Rápido	Médio	Depende do tamanho do PDF
CSVLoader	⚡⚡⚡ Muito rápido	Baixo	Parser nativo Go
JSONLoader	⚡⚡⚡ Muito rápido	Baixo	Parser nativo Go
HTMLLoader	⚡⚡ Rápido	Médio	Parsing + limpeza
URLLoader	⚡ Médio	Médio	Depende da rede
MultiURLLoader	⚡⚡ Rápido	Médio-Alto	Paralelização

🔒 Segurança

Path Traversal: filepath.Walk não segue symlinks
URL Validation: Timeout e headers configuráveis
Memory Limits: Chunkers previnem documentos gigantes em memória
Error Handling: Todos os loaders retornam erros descritivos

📦 Dependências

Loader	Dependência	Licença
PDFLoader	`github.com/ledongthuc/pdf`	Apache 2.0
HTMLLoader	`github.com/PuerkitoBio/goquery`	BSD 3-Clause
CSV, JSON, Text	Nativo Go	BSD 3-Clause

🛠️ Exemplos Avançados

CSV com Filtragem de Colunas

loader := knowledge.NewCSVLoader("./users.csv")
loader.TextColumns = []int{0, 2, 4} // Apenas colunas 0, 2, 4
loader.ContentFormat = "json"
docs, _ := loader.Load()

HTML com Seletores Específicos

loader := knowledge.NewHTMLLoader("./article.html")
loader.Selectors = []string{"article", ".post-content", "#main"}
loader.PreserveLinks = true
docs, _ := loader.Load()

JSON com Template Customizado

loader := knowledge.NewJSONLoader("./posts.json")
loader.ContentFields = []string{"title", "body", "tags"}
loader.MetadataFields = []string{"author", "date", "category"}
docs, _ := loader.Load()

URL com Headers Customizados

loader := knowledge.NewURLLoader("https://api.example.com/data")
loader.Headers = map[string]string{
    "Authorization": "Bearer " + token,
    "Accept": "application/json",
}
loader.Timeout = 60 * time.Second
docs, _ := loader.Load()

🎯 Best Practices

Escolha o Chunker Certo:
- Documentos técnicos → ParagraphChunker
- Artigos/narrativas → SentenceChunker
- Texto sem estrutura → CharacterChunker
Ajuste Chunk Size:
- Embeddings: 500-1000 caracteres
- LLM Context: 1000-2000 caracteres
- Overlap: 10-20% do chunk size
Use Metadata:
- Filtre por tipo de documento
- Ordene por data/relevância
- Track source para citações
Tratamento de Erros:
- Use ContinueOnErr para processamento em batch
- Log failures para análise
- Valide documentos antes de chunkar

📝 TODO / Roadmap

OCR Support - Extração de texto de PDFs escaneados
DocxLoader - Microsoft Word documents
PPTXLoader - PowerPoint presentations
ExcelLoader - Planilhas Excel
XMLLoader - Documentos XML
EpubLoader - E-books
AudioLoader - Transcrição de áudio
VideoLoader - Transcrição de vídeo
DatabaseLoader - SQL/NoSQL queries
S3Loader - AWS S3 objects
GCSLoader - Google Cloud Storage
GitLoader - Repositórios Git
SlackLoader - Mensagens do Slack
NotionLoader - Páginas do Notion
ConfluenceLoader - Wiki pages
JiraLoader - Issues e documentação

🤝 Contribuindo

Para adicionar um novo loader:

Implemente a interface Loader:

type Loader interface {
    Load() ([]Document, error)
}

Siga o padrão de naming: *Loader, *ReaderLoader
Adicione metadata relevante
Crie testes unitários
Atualize este README

Autor: Jhol Hewres (@jholhewres)
Licença: Apache 2.0
Versão: 1.0.0

Documentation ¶

Index ¶

type CSVLoader
- func NewCSVLoader(filePath string) *CSVLoader
- func (l *CSVLoader) Load() ([]Document, error)
type CSVReaderLoader
- func NewCSVReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *CSVReaderLoader
- func (l *CSVReaderLoader) Load() ([]Document, error)
type CharacterChunker
- func NewCharacterChunker(chunkSize, chunkOverlap int) *CharacterChunker
- func (c *CharacterChunker) Chunk(doc Document) ([]Chunk, error)
type Chunk
type Chunker
type DirectoryLoader
- func NewDirectoryLoader(dirPath string, pattern string, recursive bool) *DirectoryLoader
- func (l *DirectoryLoader) Load() ([]Document, error)
type Document
type HTMLLoader
- func NewHTMLLoader(filePath string) *HTMLLoader
- func (l *HTMLLoader) Load() ([]Document, error)
type HTMLReaderLoader
- func NewHTMLReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *HTMLReaderLoader
- func (l *HTMLReaderLoader) Load() ([]Document, error)
type JSONLoader
- func NewJSONLoader(filePath string) *JSONLoader
- func (l *JSONLoader) Load() ([]Document, error)
type JSONReaderLoader
- func NewJSONReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *JSONReaderLoader
- func (l *JSONReaderLoader) Load() ([]Document, error)
type Loader
type MultiURLLoader
- func NewMultiURLLoader(urls []string) *MultiURLLoader
- func (l *MultiURLLoader) Load() ([]Document, error)
type PDFDirectoryLoader
- func NewPDFDirectoryLoader(dirPath string, recursive bool) *PDFDirectoryLoader
- func (l *PDFDirectoryLoader) Load() ([]Document, error)
type PDFLoader
- func NewPDFLoader(filePath string) *PDFLoader
- func (l *PDFLoader) Load() ([]Document, error)
type PDFReaderLoader
- func NewPDFReaderLoader(data []byte, id string, metadata map[string]interface{}) *PDFReaderLoader
- func (l *PDFReaderLoader) Load() ([]Document, error)
type ParagraphChunker
- func NewParagraphChunker(maxChunkSize int) *ParagraphChunker
- func (c *ParagraphChunker) Chunk(doc Document) ([]Chunk, error)
type ReaderLoader
- func NewReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *ReaderLoader
- func (l *ReaderLoader) Load() ([]Document, error)
type SentenceChunker
- func NewSentenceChunker(maxChunkSize, minChunkSize int) *SentenceChunker
- func (c *SentenceChunker) Chunk(doc Document) ([]Chunk, error)
type TextLoader
- func NewTextLoader(filePath string) *TextLoader
- func (l *TextLoader) Load() ([]Document, error)
type URLLoader
- func NewURLLoader(url string) *URLLoader
- func (l *URLLoader) Load() ([]Document, error)
type WebCrawler
- func NewWebCrawler(startURL string) *WebCrawler
- func (c *WebCrawler) Load() ([]Document, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type CSVLoader ¶ added in v1.1.0

type CSVLoader struct {
	FilePath      string
	Delimiter     rune   // Default: ','
	HasHeader     bool   // Whether first row is header
	TextColumns   []int  // Indices of columns to include (nil = all)
	RowsPerDoc    int    // Number of rows per document (0 = all in one doc)
	ContentFormat string // "json" or "text" (default: "text")
}

CSVLoader loads documents from CSV files

func NewCSVLoader ¶ added in v1.1.0

func NewCSVLoader(filePath string) *CSVLoader

NewCSVLoader creates a new CSV loader

func (*CSVLoader) Load ¶ added in v1.1.0

func (l *CSVLoader) Load() ([]Document, error)

Load loads a CSV file

type CSVReaderLoader ¶ added in v1.1.0

type CSVReaderLoader struct {
	Reader        io.Reader
	ID            string
	Delimiter     rune
	HasHeader     bool
	RowsPerDoc    int
	ContentFormat string
	Metadata      map[string]interface{}
}

CSVReaderLoader loads CSV from an io.Reader

func NewCSVReaderLoader ¶ added in v1.1.0

func NewCSVReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *CSVReaderLoader

NewCSVReaderLoader creates a new CSV reader loader

func (*CSVReaderLoader) Load ¶ added in v1.1.0

func (l *CSVReaderLoader) Load() ([]Document, error)

Load loads CSV content from a reader

type CharacterChunker ¶

type CharacterChunker struct {
	ChunkSize    int // Number of characters per chunk
	ChunkOverlap int // Number of characters to overlap between chunks
	Separator    string
}

CharacterChunker splits documents by character count

func NewCharacterChunker ¶

func NewCharacterChunker(chunkSize, chunkOverlap int) *CharacterChunker

NewCharacterChunker creates a new character-based chunker

func (*CharacterChunker) Chunk ¶

func (c *CharacterChunker) Chunk(doc Document) ([]Chunk, error)

Chunk splits a document into character-based chunks

type Chunk ¶

type Chunk struct {
	ID       string                 `json:"id"`
	Content  string                 `json:"content"`
	Metadata map[string]interface{} `json:"metadata,omitempty"`
	Index    int                    `json:"index"` // Position in original document
}

Chunk represents a chunk of a document

type Chunker ¶

type Chunker interface {
	Chunk(doc Document) ([]Chunk, error)
}

Chunker interface for splitting documents into chunks

type DirectoryLoader ¶

type DirectoryLoader struct {
	DirPath   string
	Pattern   string // File pattern to match (e.g., "*.txt", "*.md")
	Recursive bool   // Whether to search subdirectories
	// contains filtered or unexported fields
}

DirectoryLoader loads documents from a directory

func NewDirectoryLoader ¶

func NewDirectoryLoader(dirPath string, pattern string, recursive bool) *DirectoryLoader

NewDirectoryLoader creates a new directory loader

func (*DirectoryLoader) Load ¶

func (l *DirectoryLoader) Load() ([]Document, error)

Load loads all matching files from a directory

type Document ¶

type Document struct {
	ID       string                 `json:"id"`
	Content  string                 `json:"content"`
	Metadata map[string]interface{} `json:"metadata,omitempty"`
	Source   string                 `json:"source,omitempty"` // File path or URL
}

Document represents a document with metadata

type HTMLLoader ¶ added in v1.1.0

type HTMLLoader struct {
	FilePath        string
	RemoveScripts   bool     // Remove <script> tags
	RemoveStyles    bool     // Remove <style> tags
	ExtractMetaTags bool     // Extract meta tags as metadata
	Selectors       []string // CSS selectors to extract specific content (nil = extract all)
	PreserveLinks   bool     // Keep links in content
}

HTMLLoader loads documents from HTML files

func NewHTMLLoader ¶ added in v1.1.0

func NewHTMLLoader(filePath string) *HTMLLoader

NewHTMLLoader creates a new HTML loader

func (*HTMLLoader) Load ¶ added in v1.1.0

func (l *HTMLLoader) Load() ([]Document, error)

Load loads an HTML file

type HTMLReaderLoader ¶ added in v1.1.0

type HTMLReaderLoader struct {
	Reader          io.Reader
	ID              string
	RemoveScripts   bool
	RemoveStyles    bool
	ExtractMetaTags bool
	Selectors       []string
	PreserveLinks   bool
	Metadata        map[string]interface{}
}

HTMLReaderLoader loads HTML from an io.Reader

func NewHTMLReaderLoader ¶ added in v1.1.0

func NewHTMLReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *HTMLReaderLoader

NewHTMLReaderLoader creates a new HTML reader loader

func (*HTMLReaderLoader) Load ¶ added in v1.1.0

func (l *HTMLReaderLoader) Load() ([]Document, error)

Load loads HTML content from a reader

type JSONLoader ¶ added in v1.1.0

type JSONLoader struct {
	FilePath       string
	JSONPath       string   // JSONPath expression to extract content (optional)
	ContentFields  []string // Fields to use as content (if JSON is object/array of objects)
	MetadataFields []string // Fields to include in metadata
	TextTemplate   string   // Template for formatting content (e.g., "{title}: {body}")
}

JSONLoader loads documents from JSON files

func NewJSONLoader ¶ added in v1.1.0

func NewJSONLoader(filePath string) *JSONLoader

NewJSONLoader creates a new JSON loader

func (*JSONLoader) Load ¶ added in v1.1.0

func (l *JSONLoader) Load() ([]Document, error)

Load loads a JSON file

type JSONReaderLoader ¶ added in v1.1.0

type JSONReaderLoader struct {
	Reader         io.Reader
	ID             string
	ContentFields  []string
	MetadataFields []string
	Metadata       map[string]interface{}
}

JSONReaderLoader loads JSON from an io.Reader

func NewJSONReaderLoader ¶ added in v1.1.0

func NewJSONReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *JSONReaderLoader

NewJSONReaderLoader creates a new JSON reader loader

func (*JSONReaderLoader) Load ¶ added in v1.1.0

func (l *JSONReaderLoader) Load() ([]Document, error)

Load loads JSON content from a reader

type Loader ¶

type Loader interface {
	Load() ([]Document, error)
}

Loader interface for loading documents from different sources

type MultiURLLoader ¶ added in v1.1.0

type MultiURLLoader struct {
	URLs           []string
	Timeout        time.Duration
	MaxConcurrent  int // Maximum concurrent requests (default: 5)
	ContinueOnErr  bool
	CommonHeaders  map[string]string
	CommonMetadata map[string]interface{}
}

MultiURLLoader loads documents from multiple URLs

func NewMultiURLLoader ¶ added in v1.1.0

func NewMultiURLLoader(urls []string) *MultiURLLoader

NewMultiURLLoader creates a new multi-URL loader

func (*MultiURLLoader) Load ¶ added in v1.1.0

func (l *MultiURLLoader) Load() ([]Document, error)

Load fetches content from multiple URLs concurrently

type PDFDirectoryLoader ¶ added in v1.1.0

type PDFDirectoryLoader struct {
	DirPath   string
	Recursive bool
}

PDFDirectoryLoader loads all PDF files from a directory

func NewPDFDirectoryLoader ¶ added in v1.1.0

func NewPDFDirectoryLoader(dirPath string, recursive bool) *PDFDirectoryLoader

NewPDFDirectoryLoader creates a new PDF directory loader

func (*PDFDirectoryLoader) Load ¶ added in v1.1.0

func (l *PDFDirectoryLoader) Load() ([]Document, error)

Load loads all PDF files from a directory

type PDFLoader ¶ added in v1.1.0

type PDFLoader struct {
	FilePath       string
	ExtractImages  bool // Future: extract images from PDF
	PageSeparator  string
	PreserveLayout bool // Try to preserve text layout
}

PDFLoader loads documents from PDF files

func NewPDFLoader ¶ added in v1.1.0

func NewPDFLoader(filePath string) *PDFLoader

NewPDFLoader creates a new PDF loader

func (*PDFLoader) Load ¶ added in v1.1.0

func (l *PDFLoader) Load() ([]Document, error)

Load loads a PDF file and extracts text content

type PDFReaderLoader ¶ added in v1.1.0

type PDFReaderLoader struct {
	Data     []byte
	ID       string
	Metadata map[string]interface{}
}

PDFReaderLoader loads PDF from a byte slice (in-memory PDF)

func NewPDFReaderLoader ¶ added in v1.1.0

func NewPDFReaderLoader(data []byte, id string, metadata map[string]interface{}) *PDFReaderLoader

NewPDFReaderLoader creates a new PDF reader loader from bytes

func (*PDFReaderLoader) Load ¶ added in v1.1.0

func (l *PDFReaderLoader) Load() ([]Document, error)

Load loads PDF content from bytes

type ParagraphChunker ¶

type ParagraphChunker struct {
	MaxChunkSize int // Maximum characters per chunk
}

ParagraphChunker splits documents by paragraphs

func NewParagraphChunker ¶

func NewParagraphChunker(maxChunkSize int) *ParagraphChunker

NewParagraphChunker creates a new paragraph-based chunker

func (*ParagraphChunker) Chunk ¶

func (c *ParagraphChunker) Chunk(doc Document) ([]Chunk, error)

Chunk splits a document into paragraph-based chunks

type ReaderLoader ¶

type ReaderLoader struct {
	Reader   io.Reader
	ID       string
	Metadata map[string]interface{}
}

ReaderLoader loads documents from an io.Reader

func NewReaderLoader ¶

func NewReaderLoader(reader io.Reader, id string, metadata map[string]interface{}) *ReaderLoader

NewReaderLoader creates a new reader loader

func (*ReaderLoader) Load ¶

func (l *ReaderLoader) Load() ([]Document, error)

Load loads content from a reader

type SentenceChunker ¶

type SentenceChunker struct {
	MaxChunkSize int // Maximum characters per chunk
	MinChunkSize int // Minimum characters per chunk
}

SentenceChunker splits documents by sentences

func NewSentenceChunker ¶

func NewSentenceChunker(maxChunkSize, minChunkSize int) *SentenceChunker

NewSentenceChunker creates a new sentence-based chunker

func (*SentenceChunker) Chunk ¶

func (c *SentenceChunker) Chunk(doc Document) ([]Chunk, error)

Chunk splits a document into sentence-based chunks

type TextLoader ¶

type TextLoader struct {
	FilePath string
}

TextLoader loads documents from text files

func NewTextLoader ¶

func NewTextLoader(filePath string) *TextLoader

NewTextLoader creates a new text file loader

func (*TextLoader) Load ¶

func (l *TextLoader) Load() ([]Document, error)

Load loads a text file

type URLLoader ¶ added in v1.1.0

type URLLoader struct {
	URL            string
	Method         string            // HTTP method (default: GET)
	Headers        map[string]string // Custom headers
	Timeout        time.Duration     // Request timeout (default: 30s)
	FollowRedirect bool              // Follow redirects (default: true)
	UserAgent      string            // User agent string
	ContentType    string            // Expected content type (html, json, pdf, text)
	AutoDetect     bool              // Auto-detect content type from response
}

URLLoader loads documents from URLs (web pages, APIs, etc.)

func NewURLLoader ¶ added in v1.1.0

func NewURLLoader(url string) *URLLoader

NewURLLoader creates a new URL loader

func (*URLLoader) Load ¶ added in v1.1.0

func (l *URLLoader) Load() ([]Document, error)

Load fetches content from URL and loads it as document

type WebCrawler ¶ added in v1.1.0

type WebCrawler struct {
	StartURL      string
	MaxDepth      int               // Maximum crawl depth (default: 2)
	MaxPages      int               // Maximum pages to crawl (default: 10)
	SameDomain    bool              // Only crawl same domain (default: true)
	IncludeFilter []string          // URL patterns to include
	ExcludeFilter []string          // URL patterns to exclude
	Timeout       time.Duration     // Request timeout per page
	Headers       map[string]string // Custom headers
}

WebCrawler crawls web pages starting from a URL

func NewWebCrawler ¶ added in v1.1.0

func NewWebCrawler(startURL string) *WebCrawler

NewWebCrawler creates a new web crawler

func (*WebCrawler) Load ¶ added in v1.1.0

func (c *WebCrawler) Load() ([]Document, error)

Load crawls web pages and loads them as documents Note: This is a basic implementation. For production, consider using a dedicated crawler library

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL