core

package
v1.1.11 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 23, 2026 License: MIT Imports: 6 Imported by: 0

Documentation

Overview

Package entity defines the core entities for the goRAG framework.

Index

Constants

View Source
const (
	// 基础文本格式
	MimeTypeTextPlain       = "text/plain"
	MimeTypeTextMarkdown    = "text/markdown"
	MimeTypeTextHTML        = "text/html"
	MimeTypeApplicationJSON = "application/json"
	MimeTypeTextCSS         = "text/css"
	MimeTypeTextJavaScript  = "text/javascript"
	MimeTypeApplicationXML  = "application/xml"
	MimeTypeTextXML         = "text/xml"
	MimeTypeTextYAML        = "text/yaml"
	MimeTypeTextTOML        = "text/toml"
	MimeTypeTextCSV         = "text/csv"
	MimeTypeTextTSV         = "text/tab-separated-values"
	MimeTypeTextSQL         = "text/sql"
	// 编程语言
	MimeTypeTextPython     = "text/x-python"
	MimeTypeTextGo         = "text/x-go"
	MimeTypeTextJava       = "text/x-java"
	MimeTypeTextC          = "text/x-c"
	MimeTypeTextCPP        = "text/x-c++"
	MimeTypeTextCsharp     = "text/x-csharp"
	MimeTypeTextPHP        = "text/x-php"
	MimeTypeTextRuby       = "text/x-ruby"
	MimeTypeTextPerl       = "text/x-perl"
	MimeTypeTextBash       = "text/x-sh"
	MimeTypeTextPowerShell = "text/x-powershell"
	MimeTypeTextRust       = "text/x-rust"
	MimeTypeTextSwift      = "text/x-swift"
	MimeTypeTextKotlin     = "text/x-kotlin"
	MimeTypeTextTypeScript = "text/typescript"
	MimeTypeTextVue        = "text/vue"
	MimeTypeTextSvelte     = "text/svelte"
	MimeTypeTextGraphQL    = "application/graphql"
	// 图片格式
	MimeTypeImageJPEG = "image/jpeg"
	MimeTypeImagePNG  = "image/png"
	MimeTypeImageGIF  = "image/gif"
	MimeTypeImageWebP = "image/webp"
	MimeTypeImageBMP  = "image/bmp"
	MimeTypeImageSVG  = "image/svg+xml"
	// Office 文档
	MimeTypeApplicationMsWord            = "application/msword"
	MimeTypeApplicationWordOpenXML       = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
	MimeTypeApplicationMsExcel           = "application/vnd.ms-excel"
	MimeTypeApplicationExcelOpenXML      = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
	MimeTypeApplicationMsPowerpoint      = "application/vnd.ms-powerpoint"
	MimeTypeApplicationPowerpointOpenXML = "application/vnd.openxmlformats-officedocument.presentationml.presentation"
	MimeTypeApplicationPDF               = "application/pdf"
	// 其他
	MimeTypeApplicationXYAML = "application/x-yaml"
	MimeTypeApplicationToml  = "application/toml"
)

MIME类型常量

Variables

ExtMimeTypes 是 MIME 类型到文件扩展名的反向映射

View Source
var MimeTypes = map[string]string{

	".txt":  MimeTypeTextPlain,
	".md":   MimeTypeTextMarkdown,
	".html": MimeTypeTextHTML,
	".htm":  MimeTypeTextHTML,
	".json": MimeTypeApplicationJSON,
	".css":  MimeTypeTextCSS,
	".js":   MimeTypeTextJavaScript,
	".xml":  MimeTypeTextXML,
	".yaml": MimeTypeTextYAML,
	".yml":  MimeTypeTextYAML,
	".toml": MimeTypeTextTOML,
	".csv":  MimeTypeTextCSV,
	".tsv":  MimeTypeTextTSV,
	".sql":  MimeTypeTextSQL,

	".py":      MimeTypeTextPython,
	".go":      MimeTypeTextGo,
	".java":    MimeTypeTextJava,
	".c":       MimeTypeTextC,
	".cpp":     MimeTypeTextCPP,
	".h":       MimeTypeTextC,
	".hpp":     MimeTypeTextCPP,
	".cs":      MimeTypeTextCsharp,
	".php":     MimeTypeTextPHP,
	".rb":      MimeTypeTextRuby,
	".pl":      MimeTypeTextPerl,
	".sh":      MimeTypeTextBash,
	".ps1":     MimeTypeTextPowerShell,
	".rs":      MimeTypeTextRust,
	".swift":   MimeTypeTextSwift,
	".kt":      MimeTypeTextKotlin,
	".ts":      MimeTypeTextTypeScript,
	".vue":     MimeTypeTextVue,
	".svelte":  MimeTypeTextSvelte,
	".graphql": MimeTypeTextGraphQL,
	".gql":     MimeTypeTextGraphQL,

	".ini":  MimeTypeTextPlain,
	".conf": MimeTypeTextPlain,
	".cfg":  MimeTypeTextPlain,
	".env":  MimeTypeTextPlain,

	".jpg":  MimeTypeImageJPEG,
	".jpeg": MimeTypeImageJPEG,
	".png":  MimeTypeImagePNG,
	".gif":  MimeTypeImageGIF,
	".webp": MimeTypeImageWebP,
	".bmp":  MimeTypeImageBMP,
	".svg":  MimeTypeImageSVG,

	".doc":  MimeTypeApplicationMsWord,
	".docx": MimeTypeApplicationWordOpenXML,
	".xls":  MimeTypeApplicationMsExcel,
	".xlsx": MimeTypeApplicationExcelOpenXML,
	".ppt":  MimeTypeApplicationMsPowerpoint,
	".pptx": MimeTypeApplicationPowerpointOpenXML,
	".pdf":  MimeTypeApplicationPDF,
}

MimeTypes 是文件扩展名到MIME类型的映射

Functions

func CleanText added in v1.1.10

func CleanText(text string) string

CleanText 按默认顺序应用所有清洗函数 清洗顺序:全角半角 → 噪音字符 → 链接 → 行号 → 水印 → 繁简转换 → 隐私脱敏 → 段落规范化 → 停用词 → 基础清洗

func ParseMimeTypeFromText added in v1.1.10

func ParseMimeTypeFromText(text string) string

ParseMimeTypeFromText 根据文本内容推断 MIME 类型 只支持纯文本格式的检测,返回最可能的 MIME 类型

Types

type BaseFormatter added in v1.1.10

type BaseFormatter struct{}

BaseFormatter 提供通用格式化方法

func (*BaseFormatter) Format added in v1.1.10

func (f *BaseFormatter) Format(hit *Hit) string

func (*BaseFormatter) FormatAll added in v1.1.10

func (f *BaseFormatter) FormatAll(hits []Hit) string

func (*BaseFormatter) Write added in v1.1.10

func (f *BaseFormatter) Write(w io.Writer, hits []Hit) error

type CacheStore added in v1.1.10

type CacheStore interface {
	// Get 根据 key 获取缓存值,反序列化到 value
	// key 不存在时返回 nil, nil
	Get(key string, value any) error

	// Set 写入缓存,value 会被 JSON 序列化
	Set(key string, value any) error

	// Delete 删除指定 key
	Delete(key string) error

	// Len 返回缓存条目数量
	Len() int

	// Flush 强制将内存中的脏数据刷写到磁盘
	Flush() error

	// Close 关闭缓存,释放资源
	Close() error
}

CacheStore 通用持久化缓存接口 提供基于 key-value 的缓存读写能力,value 为任意 JSON 可序列化数据 实现可以是 bbolt、Redis、BadgerDB 等任意持久化存储

type Chunk

type Chunk struct {
	ID        string         `json:"id"`         // Chunk 唯一ID
	ParentID  string         `json:"parent_id"`  // 父Chunk/父文档ID(来自 RawDocument.Source)
	DocID     string         `json:"doc_id"`     // 原始文档ID(来自 RawDocument.ID)
	MIMEType  string         `json:"mime_type"`  // 内容类型
	Content   string         `json:"content"`    // 分块内容(清洗后纯文本)
	Metadata  map[string]any `json:"metadata"`   // 扩展元数据(来自 RawDocument.Metadata)
	ChunkMeta ChunkMeta      `json:"chunk_meta"` // 分块固定元数据
}

Chunk 最终可索引单元,由 Chunker 生成,承接解析层所有信息

type ChunkIndexer added in v1.1.10

type ChunkIndexer interface {
	Indexer
	// IndexChunks indexes multiple pre-generated chunks in batch.
	// This method is used by HybridIndexer to ensure all indexers use the same Chunk IDs.
	//
	// Parameters:
	//   - ctx: Context for cancellation
	//   - chunks: The chunks to index
	//
	// Returns:
	//   - error: An error if the operation fails
	IndexChunks(ctx context.Context, chunks []*Chunk) error
}

ChunkIndexer is an optional interface for indexers that support batch chunk indexing. This interface is used by HybridIndexer to ensure data consistency across multiple indexers.

type ChunkInfo added in v1.1.10

type ChunkInfo struct {
	ChunkID     string   `json:"chunk_id"`     // Chunk 唯一ID
	ParentID    string   `json:"parent_id"`    // 父Chunk/父文档ID
	Index       int      `json:"index"`        // 分块序号(0,1,2...)
	Content     string   `json:"content"`      // 分块内容
	StartPos    int      `json:"start_pos"`    // 在原始文本中的起始位置
	EndPos      int      `json:"end_pos"`      // 在原始文本中的结束位置
	Heading     string   `json:"heading"`      // 标题(最内层)
	HeadingPath []string `json:"heading_path"` // 标题路径
}

ChunkInfo 文档还原中的单一块信息

type ChunkMeta added in v1.1.10

type ChunkMeta struct {
	Index        int      `json:"index"`         // 分块在文档中的序号(0,1,2...)
	StartPos     int      `json:"start_pos"`     // 分块在原始清洗后文本中的起始位置
	EndPos       int      `json:"end_pos"`       // 分块在原始清洗后文本中的结束位置
	HeadingLevel int      `json:"heading_level"` // 分块对应的标题层级(来自 StructureNode)
	HeadingPath  []string `json:"heading_path"`  // 分块对应的标题路径(如 ["第一章","1.1节"])
}

ChunkMeta Chunk 固定元数据(分块相关位置、层级信息)

type ChunkStrategy added in v1.1.10

type ChunkStrategy string

ChunkStrategy 分块策略类型

type Chunker added in v1.1.10

type Chunker interface {
	// Chunk 接收结构化文档,结合结构边界生成 Chunk 集合
	// GraphRAG 流程:Document → Chunk → LLM Extractor → Node/Edge → GraphDB
	Chunk(doc *StructuredDocument) ([]*Chunk, error)

	// GetStrategy 返回分块策略类型
	GetStrategy() ChunkStrategy
}

Chunker 分块接口,接收解析层输出,生成最终可索引 Chunk

type Community added in v1.1.10

type Community struct {
	ID       string   `json:"id"`                  // Unique identifier for the community
	Level    int      `json:"level"`               // Hierarchy level (0 = finest granularity)
	NodeIDs  []string `json:"node_ids"`            // Node IDs in this community
	EdgeIDs  []string `json:"edge_ids"`            // Edge IDs in this community
	ParentID string   `json:"parent_id,omitempty"` // Parent community ID (for hierarchy)

	// LLM-generated summary
	Summary  string   `json:"summary,omitempty"`  // Community summary
	Keywords []string `json:"keywords,omitempty"` // Key topics/concepts

	// Source binding
	SourceChunkIDs []string `json:"source_chunk_ids,omitempty"`
}

Community represents a detected community in the knowledge graph. Communities are hierarchical groups of related nodes, enabling global search.

type CommunityDetector added in v1.1.10

type CommunityDetector interface {
	// Detect identifies communities in the graph and returns them hierarchically.
	Detect(ctx context.Context, graphStore GraphStore) ([]*Community, error)
}

CommunityDetector defines the interface for community detection algorithms.

type CommunityMatch added in v1.1.10

type CommunityMatch struct {
	CommunityID string   `json:"community_id"`
	Score       float32  `json:"score"`
	Summary     string   `json:"summary"`
	Keywords    []string `json:"keywords"`
}

CommunityMatch represents a matched community during global search.

type Document added in v1.1.10

type Document interface {
	GetID() string           // 原始文档唯一ID,用于与 StructuredDocument、Entity、Relation 关联
	GetContent() string      // 文件的纯文本内容(核心)
	GetMimeType() string     // 文件内容的类型
	GetMeta() map[string]any // 基础文件元数据(文件名、大小、修改时间、所有者等)
	GetImages() []Image      // 附带内容(例如包含在文件内的其它附件,如图片、视频、音频等)
	GetSource() string       // 文件来源(路径/URL/URI)
	GetExt() string          // 文件扩展名
}

type Edge added in v1.1.10

type Edge struct {
	ID        string `json:"id"`                  // Unique identifier for the edge
	Type      string `json:"type"`                // Type of the edge (e.g., WORKS_FOR, LOCATED_IN, BELONGS_TO)
	Source    string `json:"source"`              // Source node ID (subject entity)
	Target    string `json:"target"`              // Target node ID (object entity)
	Predicate string `json:"predicate,omitempty"` // Relationship type alias (e.g., "就职于", "属于")

	// Properties stores extended features with standardized keys:
	// - "confidence": float32 - extraction confidence (0~1 from LLM/rules)
	// - "score": float32 - relationship strength score
	// - "evidence": string - text evidence for the relationship
	// - custom fields as needed
	Properties map[string]any `json:"properties,omitempty"`

	// Source binding - following Microsoft GraphRAG design
	SourceChunkIDs []string `json:"source_chunk_ids,omitempty"` // IDs of source chunks
	SourceDocIDs   []string `json:"source_doc_ids,omitempty"`   // IDs of source documents
}

Edge represents a graph edge entity in the RAG system. Edges represent relationships between entities and are also bound to source text. Unified relationship structure combining advantages from Relation design.

type Embedder added in v1.1.10

type Embedder interface {
	// Calc 计算 Chunk 的向量表示
	Calc(chunk *Chunk) (*Vector, error)

	// CalcText 直接计算文本的向量表示(用于查询)
	CalcText(text string) (*Vector, error)

	// CalcImage 直接计算图片的向量表示(用于查询)
	CalcImage(data []byte) (*Vector, error)

	// Bulk 批量计算 Chunk 的向量表示
	Bulk(chunks []*Chunk) ([]*Vector, error)

	Dim() int

	// 是否支持多模态
	Multimoding() bool
}

type Formatter added in v1.1.10

type Formatter interface {
	// Format 格式化单个 Hit
	Format(hit *Hit) string

	// FormatAll 格式化多个 Hit
	FormatAll(hits []Hit) string

	// Write 格式化并写入输出流
	Write(w io.Writer, hits []Hit) error
}

Formatter 定义搜索结果格式化接口

type FullTextSearchResult added in v1.1.10

type FullTextSearchResult struct {
	ID        string         // chunk ID
	Score     float64        // 相关性得分(由搜索引擎计算)
	DocID     string         // 所属文档 ID
	Content   string         // 匹配的文本内容片段
	Metadata  map[string]any // 扩展元数据(来自原 Chunk.Metadata)
	ChunkMeta ChunkMeta      // 分块固定元数据(来自原 Chunk.ChunkMeta)
}

FullTextSearchResult 全文搜索结果

type FullTextStore added in v1.1.10

type FullTextStore interface {
	// Index 将 chunk 写入全文索引
	Index(chunk *Chunk) error

	// Search 执行全文搜索,返回匹配结果列表
	Search(query string, topK int) ([]FullTextSearchResult, error)

	// Delete 从索引中移除指定 chunk
	Delete(chunkID string) error
}

FullTextStore 全文存储接口(基于 bleve 等搜索引擎)

type GraphStore added in v1.1.10

type GraphStore interface {
	// UpsertNodes inserts or updates entities (e.g., PERSON, ORGANIZATION)
	UpsertNodes(ctx context.Context, nodes []*Node) error

	// UpsertEdges inserts or updates relationships between entities
	UpsertEdges(ctx context.Context, edges []*Edge) error

	// GetNode retrieves a single node/entity by ID
	GetNode(ctx context.Context, id string) (*Node, error)

	// GetNeighbors fetches up to 'limit' connected edges and nodes starting from 'nodeID'
	GetNeighbors(ctx context.Context, nodeID string, depth int, limit int) ([]*Node, []*Edge, error)

	// DeleteNode removes a node by ID
	DeleteNode(ctx context.Context, id string) error

	// DeleteEdge removes an edge by ID
	DeleteEdge(ctx context.Context, id string) error

	// Query semantic graph structure. Implementations (Neo4j, Nebula) usually take Cypher/GQL.
	Query(ctx context.Context, query string, params map[string]any) ([]map[string]any, error)

	// GetNodesByChunkIDs retrieves all nodes associated with the given chunk IDs
	// This is used in hybrid search to find entities related to semantic search results
	GetNodesByChunkIDs(ctx context.Context, chunkIDs []string) ([]*Node, error)

	// GetEdgesByChunkIDs retrieves all edges associated with the given chunk IDs
	// This is used in hybrid search to find relationships related to semantic search results
	GetEdgesByChunkIDs(ctx context.Context, chunkIDs []string) ([]*Edge, error)

	// GetCommunitySummaries fetches hierarchical community abstracts, which are core to Microsoft's GraphRAG paper.
	GetCommunitySummaries(ctx context.Context, level int) ([]map[string]any, error)

	// GetMultiHopPaths performs multi-hop traversal from starting node IDs,
	// optionally filtering by edge types. Returns discovered nodes and edges.
	// depth controls how many hops to traverse (1 = direct neighbors only).
	// limit caps the total number of results returned.
	GetMultiHopPaths(ctx context.Context, nodeIDs []string, edgeTypes []string, depth int, limit int) ([]*Node, []*Edge, error)

	// GetAllEdgeTypes returns all distinct edge types present in the graph.
	// Useful for introspection and building UI filters.
	GetAllEdgeTypes(ctx context.Context) ([]string, error)

	// Close cleanly tears down Graph Store connections
	Close(ctx context.Context) error
}

GraphStore defines the storage foundation for GraphRAG. It tracks Nodes (Entities), Edges (Relationships), and supports semantic property queries.

type Hit added in v1.1.10

type Hit struct {
	ID        string         `json:"id"`         // 结果ID
	Score     float32        `json:"score"`      // 相似度分数
	Content   string         `json:"content"`    // 结果内容
	DocID     string         `json:"doc_id"`     // 文档ID
	Metadata  map[string]any `json:"metadata"`   // 扩展元数据(来自原 Chunk.Metadata)
	ChunkMeta ChunkMeta      `json:"chunk_meta"` // 分块固定元数据(来自原 Chunk.ChunkMeta)
}

Hit 搜索结果结构

type Image added in v1.1.10

type Image struct {
	// contains filtered or unexported fields
}

func NewImage added in v1.1.10

func NewImage(data []byte) *Image

func (*Image) Data added in v1.1.10

func (i *Image) Data() []byte

type Indexer added in v1.1.10

type Indexer interface {
	// Name returns the name of the indexer.
	//
	// Returns:
	//   - string: The name of the indexer
	Name() string

	// Type returns the type of the indexer.
	//
	// Returns:
	//   - string: The type of the indexer
	Type() string

	// Add adds content to the index.
	//
	// Parameters:
	//   - ctx: Context for cancellation
	//   - content: The content to add to the index
	//
	// Returns:
	//   - *Chunk: The chunk created from the content
	//   - error: An error if the operation fails
	Add(ctx context.Context, content string) ([]*Chunk, error)

	AddFile(ctx context.Context, filePath string) ([]*Chunk, error)

	NewQuery(terms string) Query

	// Search searches the index for the given query.
	//
	// Parameters:
	//   - ctx: Context for cancellation
	//   - query: The query to search for
	//
	// Returns:
	//   - []Hit: The search results
	//   - error: An error if the operation fails
	Search(ctx context.Context, query Query) ([]Hit, error)

	// Remove removes a chunk from the index.
	//
	// Parameters:
	//   - ctx: Context for cancellation
	//   - chunkID: The ID of the chunk to remove
	//
	// Returns:
	//   - error: An error if the operation fails
	Remove(ctx context.Context, chunkID string) error

	// IndexChunk indexes a pre-generated chunk.
	// This method is used by HybridIndexer to ensure all indexers use the same Chunk IDs.
	//
	// Parameters:
	//   - ctx: Context for cancellation
	//   - chunk: The chunk to index
	//
	// Returns:
	//   - error: An error if the operation fails
	IndexChunk(ctx context.Context, chunk *Chunk) error
}

Indexer defines the interface for indexers in the RAG system. Indexers are responsible for adding content to an index, searching the index, and removing content from the index.

type Loader added in v1.1.10

type Loader interface {
	// Load 读取指定路径/URL的文件,返回原始文档
	Load(path string) (Document, error)
	// SupportTypes 返回支持的文件类型列表
	SupportTypes() []string
}

Loader 加载器接口,统一读取各类文件,输出原始二进制文档

type Node added in v1.1.10

type Node struct {
	ID   string `json:"id"`   // Unique identifier for the node
	Type string `json:"type"` // Type of the node (e.g., PERSON, ORGANIZATION, LOCATION, TECHNOLOGY)
	Name string `json:"name"` // Entity name (cleaned text, e.g., "张三", "阿里巴巴")

	// Properties stores extended features with standardized keys:
	// - "confidence": float32 - extraction confidence (0~1 from LLM/rules)
	// - "frequency": int - occurrence count across documents
	// - "vectors": []float32 - semantic embedding vectors
	// - "aliases": []string - alternative names
	// - custom fields as needed
	Properties map[string]any `json:"properties,omitempty"`

	// Source binding - following Microsoft GraphRAG design: graph as index layer
	SourceChunkIDs []string `json:"source_chunk_ids,omitempty"` // IDs of source chunks
	SourceDocIDs   []string `json:"source_doc_ids,omitempty"`   // IDs of source documents
}

Node represents a graph node entity in the RAG system. In GraphRAG, nodes are derived from text chunks and serve as an index layer. Unified entity structure combining advantages from Entity design.

type Query added in v1.1.10

type Query interface {

	// Raw returns the raw, unprocessed query string.
	//
	// Returns:
	//   - string: The raw query string
	Raw() string

	// Keywords returns the extracted keywords from the query.
	// 关键词的提取可以用于BM25精确查找。
	// Returns:
	//   - []string: The extracted keywords
	Keywords() []string

	// Filters returns the filters to apply to the search.
	// 提取出查询中的过滤条件,可辅助语义化查询的精确度
	// Returns:
	//   - map[string]any: The filters
	Filters() map[string]any

	// AddFilter adds a filter to the query.
	AddFilter(key string, value any) Query
}

Query defines the interface for queries in the RAG system. Queries are used to search the index for relevant content.

type ReconstructedDocument added in v1.1.10

type ReconstructedDocument struct {
	DocID   string      `json:"doc_id"`  // 原始文档ID
	Title   string      `json:"title"`   // 文档标题(从 Metadata 推测)
	Chunks  []ChunkInfo `json:"chunks"`  // 所有分块(按 index 排序)
	Content string      `json:"content"` // 完整还原的文档内容
}

ReconstructedDocument 从向量数据库碎片还原出的完整文档

func ReconstructDocument added in v1.1.10

func ReconstructDocument(vectors []*Vector) *ReconstructedDocument

ReconstructDocument 从向量碎片还原完整文档 vectors 必须包含 metadata 中的 doc_id, content, chunk_meta 字段

type SearchMode added in v1.1.10

type SearchMode string

SearchMode defines the search strategy for GraphRAG retrieval.

const (
	// SearchModeLocal uses graph traversal from extracted entities.
	// Best for: specific questions about entities and their relationships.
	SearchModeLocal SearchMode = "local"

	// SearchModeGlobal uses community summaries for macro-level queries.
	// Best for: "What are the main themes?" type questions.
	SearchModeGlobal SearchMode = "global"

	// SearchModeHybrid combines local and global search with vector search.
	// Best for: complex queries needing both specific facts and context.
	SearchModeHybrid SearchMode = "hybrid"
)

type StructureNode added in v1.1.10

type StructureNode struct {
	NodeType string           `json:"node_type"` // 节点类型(heading/paragraph/table/list 等)
	Title    string           `json:"title"`     // 节点标题(仅 heading 类型有效)
	Level    int              `json:"level"`     // 标题层级(仅 heading 类型有效,H1=1、H2=2...)
	Text     string           `json:"text"`      // 清洗后的纯文本内容(核心,无任何格式垃圾)
	StartPos int              `json:"start_pos"` // 文本在原始清洗后内容中的起始位置(用于分块定位)
	EndPos   int              `json:"end_pos"`   // 文本在原始清洗后内容中的结束位置(用于分块定位)
	Children []*StructureNode `json:"children"`  // 子节点(如 H1 下的 H2、段落下的列表)
}

StructureNode 文档结构节点,对应文档中的标题、段落、列表、表格等单元

func (*StructureNode) Clean added in v1.1.10

func (n *StructureNode) Clean()

Clean 清洗当前节点的 Text 和 Title 字段,并递归清洗所有子节点

func (*StructureNode) ID added in v1.1.10

func (n *StructureNode) ID() string

type StructuredDocument added in v1.1.10

type StructuredDocument struct {
	RawDoc Document       `json:"raw_doc"` // 原始文档对象
	Title  string         `json:"title"`   // 文档总标题(清洗后)
	Root   *StructureNode `json:"root"`    // 文档结构根节点(顶层节点)
}

StructuredDocument 结构化文档,以树形结构呈现整个文档的层级关系

func (*StructuredDocument) ID added in v1.1.10

func (s *StructuredDocument) ID() string

func (*StructuredDocument) Meta added in v1.1.10

func (s *StructuredDocument) Meta() map[string]any

func (*StructuredDocument) SetValue added in v1.1.10

func (s *StructuredDocument) SetValue(key string, value any) *StructuredDocument

type Structurizer added in v1.1.10

type Structurizer interface {
	// Parse 接收原始文档,先清洗再结构化,输出结构化文档
	// 内部流程:RawDocument(脏)→ 数据清洗 → 结构化解析 → StructuredDocument(干净)
	Parse(raw Document) (*StructuredDocument, error)
}

Structurizer 结构化接口,负责数据清洗与文档结构解析

type Vector added in v1.1.10

type Vector struct {
	ID       string         `json:"id"`       // Unique identifier for the vector
	Values   []float32      `json:"values"`   // The vector values
	ChunkID  string         `json:"chunk_id"` // ID of the corresponding chunk
	Metadata map[string]any `json:"metadata"` // Additional metadata about the vector
}

Vector represents a vector entity in the RAG system. It contains the vector representation of a document chunk.

func NewVector added in v1.1.10

func NewVector(values []float32, metadata map[string]any) *Vector

NewVector creates a new Vector instance with the specified parameters.

Parameters:

  • values: the embedding vector values (float32 slice)
  • metadata: additional metadata for filtering and tracking

type VectorStore added in v1.1.10

type VectorStore interface {
	// Upsert inserts or updates vectors in the store.
	// If a vector with the same ID exists, it will be updated; otherwise, it will be inserted.
	//
	// Parameters:
	//   - ctx: Context for cancellation and timeout
	//   - vectors: Slice of vectors to insert or update
	//
	// Returns:
	//   - error: Any error that occurred during the operation
	Upsert(ctx context.Context, vectors []*Vector) error

	// Search performs a similarity search to find the most similar vectors.
	// It returns the topK most similar vectors along with their similarity scores.
	//
	// Parameters:
	//   - ctx: Context for cancellation and timeout
	//   - query: The query vector to search for
	//   - topK: Maximum number of results to return
	//   - filters: Optional metadata filters to apply
	//
	// Returns:
	//   - []*Vector: The most similar vectors found
	//   - []float32: Similarity scores for each result
	//   - error: Any error that occurred during search
	Search(ctx context.Context, query []float32, topK int, filters map[string]any) ([]*Vector, []float32, error)

	// Delete removes a vector from the store by its ID.
	//
	// Parameters:
	//   - ctx: Context for cancellation and timeout
	//   - id: The unique identifier of the vector to delete
	//
	// Returns:
	//   - error: Any error that occurred during deletion
	Delete(ctx context.Context, id string) error

	// GetByDocID retrieves all vectors belonging to the same document by doc_id.
	// This enables "knowledge traceability" — reconstructing the original document
	// from individual chunks stored in the vector store.
	//
	// Parameters:
	//   - ctx: Context for cancellation and timeout
	//   - docID: The document ID to search for (from Chunk.DocID)
	//
	// Returns:
	//   - []*Vector: All vectors belonging to the document (sorted by chunk index)
	//   - error: Any error that occurred during retrieval
	//
	// Example usage:
	//
	//	vectors, err := store.GetByDocID(ctx, docID)
	//	if err != nil { ... }
	//	doc := ReconstructDocument(vectors)
	GetByDocID(ctx context.Context, docID string) ([]*Vector, error)

	// Close gracefully shuts down the vector store connection.
	// It should release all resources and close any open connections.
	//
	// Parameters:
	//   - ctx: Context for cancellation and timeout
	//
	// Returns:
	//   - error: Any error that occurred during shutdown
	Close(ctx context.Context) error
}

VectorStore defines the interface for vector storage and similarity search. It provides methods for storing embedding vectors and performing efficient nearest neighbor searches. Implementations can use various vector databases like Milvus, Pinecone, Qdrant, Weaviate, or in-memory stores.

Key responsibilities:

  • Store and update vector embeddings with associated metadata
  • Perform similarity searches using cosine distance or other metrics
  • Support metadata filtering during searches
  • Manage the lifecycle of stored vectors

Example usage:

store := NewMilvusVectorStore()
err := store.Upsert(ctx, vectors)
if err != nil {
    log.Fatal(err)
}
results, scores, err := store.Search(ctx, queryVector, 10, filters)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL