Documentation
¶
Overview ¶
Package entity defines the core entities for the goRAG framework.
Index ¶
- Constants
- Variables
- func CleanText(text string) string
- func ParseMimeTypeFromText(text string) string
- type BaseFormatter
- type CacheStore
- type Chunk
- type ChunkIndexer
- type ChunkInfo
- type ChunkMeta
- type ChunkStrategy
- type Chunker
- type Community
- type CommunityDetector
- type CommunityMatch
- type Document
- type Edge
- type Embedder
- type Formatter
- type FullTextSearchResult
- type FullTextStore
- type GraphStore
- type Hit
- type Image
- type Indexer
- type Loader
- type Node
- type Query
- type ReconstructedDocument
- type SearchMode
- type StructureNode
- type StructuredDocument
- type Structurizer
- type Vector
- type VectorStore
Constants ¶
const ( // 基础文本格式 MimeTypeTextPlain = "text/plain" MimeTypeTextMarkdown = "text/markdown" MimeTypeTextHTML = "text/html" MimeTypeApplicationJSON = "application/json" MimeTypeTextCSS = "text/css" MimeTypeTextJavaScript = "text/javascript" MimeTypeApplicationXML = "application/xml" MimeTypeTextXML = "text/xml" MimeTypeTextYAML = "text/yaml" MimeTypeTextTOML = "text/toml" MimeTypeTextCSV = "text/csv" MimeTypeTextTSV = "text/tab-separated-values" MimeTypeTextSQL = "text/sql" // 编程语言 MimeTypeTextPython = "text/x-python" MimeTypeTextGo = "text/x-go" MimeTypeTextJava = "text/x-java" MimeTypeTextC = "text/x-c" MimeTypeTextCPP = "text/x-c++" MimeTypeTextCsharp = "text/x-csharp" MimeTypeTextPHP = "text/x-php" MimeTypeTextRuby = "text/x-ruby" MimeTypeTextPerl = "text/x-perl" MimeTypeTextBash = "text/x-sh" MimeTypeTextPowerShell = "text/x-powershell" MimeTypeTextRust = "text/x-rust" MimeTypeTextSwift = "text/x-swift" MimeTypeTextKotlin = "text/x-kotlin" MimeTypeTextTypeScript = "text/typescript" MimeTypeTextVue = "text/vue" MimeTypeTextSvelte = "text/svelte" MimeTypeTextGraphQL = "application/graphql" // 图片格式 MimeTypeImageJPEG = "image/jpeg" MimeTypeImagePNG = "image/png" MimeTypeImageGIF = "image/gif" MimeTypeImageWebP = "image/webp" MimeTypeImageBMP = "image/bmp" MimeTypeImageSVG = "image/svg+xml" // Office 文档 MimeTypeApplicationMsWord = "application/msword" MimeTypeApplicationWordOpenXML = "application/vnd.openxmlformats-officedocument.wordprocessingml.document" MimeTypeApplicationMsExcel = "application/vnd.ms-excel" MimeTypeApplicationExcelOpenXML = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" MimeTypeApplicationMsPowerpoint = "application/vnd.ms-powerpoint" MimeTypeApplicationPowerpointOpenXML = "application/vnd.openxmlformats-officedocument.presentationml.presentation" MimeTypeApplicationPDF = "application/pdf" // 其他 MimeTypeApplicationXYAML = "application/x-yaml" MimeTypeApplicationToml = "application/toml" )
MIME类型常量
Variables ¶
var ExtMimeTypes = map[string]string{ MimeTypeTextPlain: ".txt", MimeTypeTextMarkdown: ".md", MimeTypeTextHTML: ".html", MimeTypeApplicationJSON: ".json", MimeTypeTextCSS: ".css", MimeTypeTextJavaScript: ".js", MimeTypeTextXML: ".xml", MimeTypeTextYAML: ".yaml", MimeTypeTextTOML: ".toml", MimeTypeTextCSV: ".csv", MimeTypeApplicationXYAML: ".yaml", MimeTypeApplicationToml: ".toml", MimeTypeTextPython: ".py", MimeTypeTextGo: ".go", MimeTypeTextJava: ".java", MimeTypeTextC: ".c", MimeTypeTextCPP: ".cpp", MimeTypeTextCsharp: ".cs", MimeTypeTextPHP: ".php", MimeTypeTextRuby: ".rb", MimeTypeTextPerl: ".pl", MimeTypeTextBash: ".sh", MimeTypeTextPowerShell: ".ps1", MimeTypeTextRust: ".rs", MimeTypeTextSwift: ".swift", MimeTypeTextKotlin: ".kt", MimeTypeTextTypeScript: ".ts", MimeTypeTextVue: ".vue", MimeTypeTextSvelte: ".svelte", MimeTypeTextGraphQL: ".graphql", MimeTypeImageJPEG: ".jpg", MimeTypeImagePNG: ".png", MimeTypeImageGIF: ".gif", MimeTypeImageWebP: ".webp", MimeTypeImageBMP: ".bmp", MimeTypeImageSVG: ".svg", MimeTypeApplicationMsWord: ".doc", MimeTypeApplicationWordOpenXML: ".docx", MimeTypeApplicationMsExcel: ".xls", MimeTypeApplicationExcelOpenXML: ".xlsx", MimeTypeApplicationMsPowerpoint: ".ppt", MimeTypeApplicationPowerpointOpenXML: ".pptx", MimeTypeApplicationPDF: ".pdf", }
ExtMimeTypes 是 MIME 类型到文件扩展名的反向映射
var MimeTypes = map[string]string{ ".txt": MimeTypeTextPlain, ".md": MimeTypeTextMarkdown, ".html": MimeTypeTextHTML, ".htm": MimeTypeTextHTML, ".json": MimeTypeApplicationJSON, ".css": MimeTypeTextCSS, ".js": MimeTypeTextJavaScript, ".xml": MimeTypeTextXML, ".yaml": MimeTypeTextYAML, ".yml": MimeTypeTextYAML, ".toml": MimeTypeTextTOML, ".csv": MimeTypeTextCSV, ".tsv": MimeTypeTextTSV, ".sql": MimeTypeTextSQL, ".py": MimeTypeTextPython, ".go": MimeTypeTextGo, ".java": MimeTypeTextJava, ".c": MimeTypeTextC, ".cpp": MimeTypeTextCPP, ".h": MimeTypeTextC, ".hpp": MimeTypeTextCPP, ".cs": MimeTypeTextCsharp, ".php": MimeTypeTextPHP, ".rb": MimeTypeTextRuby, ".pl": MimeTypeTextPerl, ".sh": MimeTypeTextBash, ".ps1": MimeTypeTextPowerShell, ".rs": MimeTypeTextRust, ".swift": MimeTypeTextSwift, ".kt": MimeTypeTextKotlin, ".ts": MimeTypeTextTypeScript, ".vue": MimeTypeTextVue, ".svelte": MimeTypeTextSvelte, ".graphql": MimeTypeTextGraphQL, ".gql": MimeTypeTextGraphQL, ".ini": MimeTypeTextPlain, ".conf": MimeTypeTextPlain, ".cfg": MimeTypeTextPlain, ".env": MimeTypeTextPlain, ".jpg": MimeTypeImageJPEG, ".jpeg": MimeTypeImageJPEG, ".png": MimeTypeImagePNG, ".gif": MimeTypeImageGIF, ".webp": MimeTypeImageWebP, ".bmp": MimeTypeImageBMP, ".svg": MimeTypeImageSVG, ".doc": MimeTypeApplicationMsWord, ".docx": MimeTypeApplicationWordOpenXML, ".xls": MimeTypeApplicationMsExcel, ".xlsx": MimeTypeApplicationExcelOpenXML, ".ppt": MimeTypeApplicationMsPowerpoint, ".pptx": MimeTypeApplicationPowerpointOpenXML, ".pdf": MimeTypeApplicationPDF, }
MimeTypes 是文件扩展名到MIME类型的映射
Functions ¶
func CleanText ¶ added in v1.1.10
CleanText 按默认顺序应用所有清洗函数 清洗顺序:全角半角 → 噪音字符 → 链接 → 行号 → 水印 → 繁简转换 → 隐私脱敏 → 段落规范化 → 停用词 → 基础清洗
func ParseMimeTypeFromText ¶ added in v1.1.10
ParseMimeTypeFromText 根据文本内容推断 MIME 类型 只支持纯文本格式的检测,返回最可能的 MIME 类型
Types ¶
type BaseFormatter ¶ added in v1.1.10
type BaseFormatter struct{}
BaseFormatter 提供通用格式化方法
func (*BaseFormatter) Format ¶ added in v1.1.10
func (f *BaseFormatter) Format(hit *Hit) string
func (*BaseFormatter) FormatAll ¶ added in v1.1.10
func (f *BaseFormatter) FormatAll(hits []Hit) string
type CacheStore ¶ added in v1.1.10
type CacheStore interface {
// Get 根据 key 获取缓存值,反序列化到 value
// key 不存在时返回 nil, nil
Get(key string, value any) error
// Set 写入缓存,value 会被 JSON 序列化
Set(key string, value any) error
// Delete 删除指定 key
Delete(key string) error
// Len 返回缓存条目数量
Len() int
// Flush 强制将内存中的脏数据刷写到磁盘
Flush() error
// Close 关闭缓存,释放资源
Close() error
}
CacheStore 通用持久化缓存接口 提供基于 key-value 的缓存读写能力,value 为任意 JSON 可序列化数据 实现可以是 bbolt、Redis、BadgerDB 等任意持久化存储
type Chunk ¶
type Chunk struct {
ID string `json:"id"` // Chunk 唯一ID
ParentID string `json:"parent_id"` // 父Chunk/父文档ID(来自 RawDocument.Source)
DocID string `json:"doc_id"` // 原始文档ID(来自 RawDocument.ID)
MIMEType string `json:"mime_type"` // 内容类型
Content string `json:"content"` // 分块内容(清洗后纯文本)
Metadata map[string]any `json:"metadata"` // 扩展元数据(来自 RawDocument.Metadata)
ChunkMeta ChunkMeta `json:"chunk_meta"` // 分块固定元数据
}
Chunk 最终可索引单元,由 Chunker 生成,承接解析层所有信息
type ChunkIndexer ¶ added in v1.1.10
type ChunkIndexer interface {
Indexer
// IndexChunks indexes multiple pre-generated chunks in batch.
// This method is used by HybridIndexer to ensure all indexers use the same Chunk IDs.
//
// Parameters:
// - ctx: Context for cancellation
// - chunks: The chunks to index
//
// Returns:
// - error: An error if the operation fails
IndexChunks(ctx context.Context, chunks []*Chunk) error
}
ChunkIndexer is an optional interface for indexers that support batch chunk indexing. This interface is used by HybridIndexer to ensure data consistency across multiple indexers.
type ChunkInfo ¶ added in v1.1.10
type ChunkInfo struct {
ChunkID string `json:"chunk_id"` // Chunk 唯一ID
ParentID string `json:"parent_id"` // 父Chunk/父文档ID
Index int `json:"index"` // 分块序号(0,1,2...)
Content string `json:"content"` // 分块内容
StartPos int `json:"start_pos"` // 在原始文本中的起始位置
EndPos int `json:"end_pos"` // 在原始文本中的结束位置
Heading string `json:"heading"` // 标题(最内层)
HeadingPath []string `json:"heading_path"` // 标题路径
}
ChunkInfo 文档还原中的单一块信息
type ChunkMeta ¶ added in v1.1.10
type ChunkMeta struct {
Index int `json:"index"` // 分块在文档中的序号(0,1,2...)
StartPos int `json:"start_pos"` // 分块在原始清洗后文本中的起始位置
EndPos int `json:"end_pos"` // 分块在原始清洗后文本中的结束位置
HeadingLevel int `json:"heading_level"` // 分块对应的标题层级(来自 StructureNode)
HeadingPath []string `json:"heading_path"` // 分块对应的标题路径(如 ["第一章","1.1节"])
}
ChunkMeta Chunk 固定元数据(分块相关位置、层级信息)
type Chunker ¶ added in v1.1.10
type Chunker interface {
// Chunk 接收结构化文档,结合结构边界生成 Chunk 集合
// GraphRAG 流程:Document → Chunk → LLM Extractor → Node/Edge → GraphDB
Chunk(doc *StructuredDocument) ([]*Chunk, error)
// GetStrategy 返回分块策略类型
GetStrategy() ChunkStrategy
}
Chunker 分块接口,接收解析层输出,生成最终可索引 Chunk
type Community ¶ added in v1.1.10
type Community struct {
ID string `json:"id"` // Unique identifier for the community
Level int `json:"level"` // Hierarchy level (0 = finest granularity)
NodeIDs []string `json:"node_ids"` // Node IDs in this community
EdgeIDs []string `json:"edge_ids"` // Edge IDs in this community
ParentID string `json:"parent_id,omitempty"` // Parent community ID (for hierarchy)
// LLM-generated summary
Summary string `json:"summary,omitempty"` // Community summary
Keywords []string `json:"keywords,omitempty"` // Key topics/concepts
// Source binding
SourceChunkIDs []string `json:"source_chunk_ids,omitempty"`
}
Community represents a detected community in the knowledge graph. Communities are hierarchical groups of related nodes, enabling global search.
type CommunityDetector ¶ added in v1.1.10
type CommunityDetector interface {
// Detect identifies communities in the graph and returns them hierarchically.
Detect(ctx context.Context, graphStore GraphStore) ([]*Community, error)
}
CommunityDetector defines the interface for community detection algorithms.
type CommunityMatch ¶ added in v1.1.10
type CommunityMatch struct {
CommunityID string `json:"community_id"`
Score float32 `json:"score"`
Summary string `json:"summary"`
Keywords []string `json:"keywords"`
}
CommunityMatch represents a matched community during global search.
type Document ¶ added in v1.1.10
type Document interface {
GetID() string // 原始文档唯一ID,用于与 StructuredDocument、Entity、Relation 关联
GetContent() string // 文件的纯文本内容(核心)
GetMimeType() string // 文件内容的类型
GetMeta() map[string]any // 基础文件元数据(文件名、大小、修改时间、所有者等)
GetImages() []Image // 附带内容(例如包含在文件内的其它附件,如图片、视频、音频等)
GetSource() string // 文件来源(路径/URL/URI)
GetExt() string // 文件扩展名
}
type Edge ¶ added in v1.1.10
type Edge struct {
ID string `json:"id"` // Unique identifier for the edge
Type string `json:"type"` // Type of the edge (e.g., WORKS_FOR, LOCATED_IN, BELONGS_TO)
Source string `json:"source"` // Source node ID (subject entity)
Target string `json:"target"` // Target node ID (object entity)
Predicate string `json:"predicate,omitempty"` // Relationship type alias (e.g., "就职于", "属于")
// Properties stores extended features with standardized keys:
// - "confidence": float32 - extraction confidence (0~1 from LLM/rules)
// - "score": float32 - relationship strength score
// - "evidence": string - text evidence for the relationship
// - custom fields as needed
Properties map[string]any `json:"properties,omitempty"`
// Source binding - following Microsoft GraphRAG design
SourceChunkIDs []string `json:"source_chunk_ids,omitempty"` // IDs of source chunks
SourceDocIDs []string `json:"source_doc_ids,omitempty"` // IDs of source documents
}
Edge represents a graph edge entity in the RAG system. Edges represent relationships between entities and are also bound to source text. Unified relationship structure combining advantages from Relation design.
type Embedder ¶ added in v1.1.10
type Embedder interface {
// Calc 计算 Chunk 的向量表示
Calc(chunk *Chunk) (*Vector, error)
// CalcText 直接计算文本的向量表示(用于查询)
CalcText(text string) (*Vector, error)
// CalcImage 直接计算图片的向量表示(用于查询)
CalcImage(data []byte) (*Vector, error)
// Bulk 批量计算 Chunk 的向量表示
Bulk(chunks []*Chunk) ([]*Vector, error)
Dim() int
// 是否支持多模态
Multimoding() bool
}
type Formatter ¶ added in v1.1.10
type Formatter interface {
// Format 格式化单个 Hit
Format(hit *Hit) string
// FormatAll 格式化多个 Hit
FormatAll(hits []Hit) string
// Write 格式化并写入输出流
Write(w io.Writer, hits []Hit) error
}
Formatter 定义搜索结果格式化接口
type FullTextSearchResult ¶ added in v1.1.10
type FullTextSearchResult struct {
ID string // chunk ID
Score float64 // 相关性得分(由搜索引擎计算)
DocID string // 所属文档 ID
Content string // 匹配的文本内容片段
Metadata map[string]any // 扩展元数据(来自原 Chunk.Metadata)
ChunkMeta ChunkMeta // 分块固定元数据(来自原 Chunk.ChunkMeta)
}
FullTextSearchResult 全文搜索结果
type FullTextStore ¶ added in v1.1.10
type FullTextStore interface {
// Index 将 chunk 写入全文索引
Index(chunk *Chunk) error
// Search 执行全文搜索,返回匹配结果列表
Search(query string, topK int) ([]FullTextSearchResult, error)
// Delete 从索引中移除指定 chunk
Delete(chunkID string) error
}
FullTextStore 全文存储接口(基于 bleve 等搜索引擎)
type GraphStore ¶ added in v1.1.10
type GraphStore interface {
// UpsertNodes inserts or updates entities (e.g., PERSON, ORGANIZATION)
UpsertNodes(ctx context.Context, nodes []*Node) error
// UpsertEdges inserts or updates relationships between entities
UpsertEdges(ctx context.Context, edges []*Edge) error
// GetNode retrieves a single node/entity by ID
GetNode(ctx context.Context, id string) (*Node, error)
// GetNeighbors fetches up to 'limit' connected edges and nodes starting from 'nodeID'
GetNeighbors(ctx context.Context, nodeID string, depth int, limit int) ([]*Node, []*Edge, error)
// DeleteNode removes a node by ID
DeleteNode(ctx context.Context, id string) error
// DeleteEdge removes an edge by ID
DeleteEdge(ctx context.Context, id string) error
// Query semantic graph structure. Implementations (Neo4j, Nebula) usually take Cypher/GQL.
Query(ctx context.Context, query string, params map[string]any) ([]map[string]any, error)
// GetNodesByChunkIDs retrieves all nodes associated with the given chunk IDs
// This is used in hybrid search to find entities related to semantic search results
GetNodesByChunkIDs(ctx context.Context, chunkIDs []string) ([]*Node, error)
// GetEdgesByChunkIDs retrieves all edges associated with the given chunk IDs
// This is used in hybrid search to find relationships related to semantic search results
GetEdgesByChunkIDs(ctx context.Context, chunkIDs []string) ([]*Edge, error)
// GetCommunitySummaries fetches hierarchical community abstracts, which are core to Microsoft's GraphRAG paper.
GetCommunitySummaries(ctx context.Context, level int) ([]map[string]any, error)
// GetMultiHopPaths performs multi-hop traversal from starting node IDs,
// optionally filtering by edge types. Returns discovered nodes and edges.
// depth controls how many hops to traverse (1 = direct neighbors only).
// limit caps the total number of results returned.
GetMultiHopPaths(ctx context.Context, nodeIDs []string, edgeTypes []string, depth int, limit int) ([]*Node, []*Edge, error)
// GetAllEdgeTypes returns all distinct edge types present in the graph.
// Useful for introspection and building UI filters.
GetAllEdgeTypes(ctx context.Context) ([]string, error)
// Close cleanly tears down Graph Store connections
Close(ctx context.Context) error
}
GraphStore defines the storage foundation for GraphRAG. It tracks Nodes (Entities), Edges (Relationships), and supports semantic property queries.
type Hit ¶ added in v1.1.10
type Hit struct {
ID string `json:"id"` // 结果ID
Score float32 `json:"score"` // 相似度分数
Content string `json:"content"` // 结果内容
DocID string `json:"doc_id"` // 文档ID
Metadata map[string]any `json:"metadata"` // 扩展元数据(来自原 Chunk.Metadata)
ChunkMeta ChunkMeta `json:"chunk_meta"` // 分块固定元数据(来自原 Chunk.ChunkMeta)
}
Hit 搜索结果结构
type Indexer ¶ added in v1.1.10
type Indexer interface {
// Name returns the name of the indexer.
//
// Returns:
// - string: The name of the indexer
Name() string
// Type returns the type of the indexer.
//
// Returns:
// - string: The type of the indexer
Type() string
// Add adds content to the index.
//
// Parameters:
// - ctx: Context for cancellation
// - content: The content to add to the index
//
// Returns:
// - *Chunk: The chunk created from the content
// - error: An error if the operation fails
Add(ctx context.Context, content string) ([]*Chunk, error)
AddFile(ctx context.Context, filePath string) ([]*Chunk, error)
NewQuery(terms string) Query
// Search searches the index for the given query.
//
// Parameters:
// - ctx: Context for cancellation
// - query: The query to search for
//
// Returns:
// - []Hit: The search results
// - error: An error if the operation fails
Search(ctx context.Context, query Query) ([]Hit, error)
// Remove removes a chunk from the index.
//
// Parameters:
// - ctx: Context for cancellation
// - chunkID: The ID of the chunk to remove
//
// Returns:
// - error: An error if the operation fails
Remove(ctx context.Context, chunkID string) error
// IndexChunk indexes a pre-generated chunk.
// This method is used by HybridIndexer to ensure all indexers use the same Chunk IDs.
//
// Parameters:
// - ctx: Context for cancellation
// - chunk: The chunk to index
//
// Returns:
// - error: An error if the operation fails
IndexChunk(ctx context.Context, chunk *Chunk) error
}
Indexer defines the interface for indexers in the RAG system. Indexers are responsible for adding content to an index, searching the index, and removing content from the index.
type Loader ¶ added in v1.1.10
type Loader interface {
// Load 读取指定路径/URL的文件,返回原始文档
Load(path string) (Document, error)
// SupportTypes 返回支持的文件类型列表
SupportTypes() []string
}
Loader 加载器接口,统一读取各类文件,输出原始二进制文档
type Node ¶ added in v1.1.10
type Node struct {
ID string `json:"id"` // Unique identifier for the node
Type string `json:"type"` // Type of the node (e.g., PERSON, ORGANIZATION, LOCATION, TECHNOLOGY)
Name string `json:"name"` // Entity name (cleaned text, e.g., "张三", "阿里巴巴")
// Properties stores extended features with standardized keys:
// - "confidence": float32 - extraction confidence (0~1 from LLM/rules)
// - "frequency": int - occurrence count across documents
// - "vectors": []float32 - semantic embedding vectors
// - "aliases": []string - alternative names
// - custom fields as needed
Properties map[string]any `json:"properties,omitempty"`
// Source binding - following Microsoft GraphRAG design: graph as index layer
SourceChunkIDs []string `json:"source_chunk_ids,omitempty"` // IDs of source chunks
SourceDocIDs []string `json:"source_doc_ids,omitempty"` // IDs of source documents
}
Node represents a graph node entity in the RAG system. In GraphRAG, nodes are derived from text chunks and serve as an index layer. Unified entity structure combining advantages from Entity design.
type Query ¶ added in v1.1.10
type Query interface {
// Raw returns the raw, unprocessed query string.
//
// Returns:
// - string: The raw query string
Raw() string
// Keywords returns the extracted keywords from the query.
// 关键词的提取可以用于BM25精确查找。
// Returns:
// - []string: The extracted keywords
Keywords() []string
// Filters returns the filters to apply to the search.
// 提取出查询中的过滤条件,可辅助语义化查询的精确度
// Returns:
// - map[string]any: The filters
Filters() map[string]any
// AddFilter adds a filter to the query.
AddFilter(key string, value any) Query
}
Query defines the interface for queries in the RAG system. Queries are used to search the index for relevant content.
type ReconstructedDocument ¶ added in v1.1.10
type ReconstructedDocument struct {
DocID string `json:"doc_id"` // 原始文档ID
Title string `json:"title"` // 文档标题(从 Metadata 推测)
Chunks []ChunkInfo `json:"chunks"` // 所有分块(按 index 排序)
Content string `json:"content"` // 完整还原的文档内容
}
ReconstructedDocument 从向量数据库碎片还原出的完整文档
func ReconstructDocument ¶ added in v1.1.10
func ReconstructDocument(vectors []*Vector) *ReconstructedDocument
ReconstructDocument 从向量碎片还原完整文档 vectors 必须包含 metadata 中的 doc_id, content, chunk_meta 字段
type SearchMode ¶ added in v1.1.10
type SearchMode string
SearchMode defines the search strategy for GraphRAG retrieval.
const ( // SearchModeLocal uses graph traversal from extracted entities. // Best for: specific questions about entities and their relationships. SearchModeLocal SearchMode = "local" // SearchModeGlobal uses community summaries for macro-level queries. // Best for: "What are the main themes?" type questions. SearchModeGlobal SearchMode = "global" // SearchModeHybrid combines local and global search with vector search. // Best for: complex queries needing both specific facts and context. SearchModeHybrid SearchMode = "hybrid" )
type StructureNode ¶ added in v1.1.10
type StructureNode struct {
NodeType string `json:"node_type"` // 节点类型(heading/paragraph/table/list 等)
Title string `json:"title"` // 节点标题(仅 heading 类型有效)
Level int `json:"level"` // 标题层级(仅 heading 类型有效,H1=1、H2=2...)
Text string `json:"text"` // 清洗后的纯文本内容(核心,无任何格式垃圾)
StartPos int `json:"start_pos"` // 文本在原始清洗后内容中的起始位置(用于分块定位)
EndPos int `json:"end_pos"` // 文本在原始清洗后内容中的结束位置(用于分块定位)
Children []*StructureNode `json:"children"` // 子节点(如 H1 下的 H2、段落下的列表)
}
StructureNode 文档结构节点,对应文档中的标题、段落、列表、表格等单元
func (*StructureNode) Clean ¶ added in v1.1.10
func (n *StructureNode) Clean()
Clean 清洗当前节点的 Text 和 Title 字段,并递归清洗所有子节点
func (*StructureNode) ID ¶ added in v1.1.10
func (n *StructureNode) ID() string
type StructuredDocument ¶ added in v1.1.10
type StructuredDocument struct {
RawDoc Document `json:"raw_doc"` // 原始文档对象
Title string `json:"title"` // 文档总标题(清洗后)
Root *StructureNode `json:"root"` // 文档结构根节点(顶层节点)
}
StructuredDocument 结构化文档,以树形结构呈现整个文档的层级关系
func (*StructuredDocument) ID ¶ added in v1.1.10
func (s *StructuredDocument) ID() string
func (*StructuredDocument) Meta ¶ added in v1.1.10
func (s *StructuredDocument) Meta() map[string]any
func (*StructuredDocument) SetValue ¶ added in v1.1.10
func (s *StructuredDocument) SetValue(key string, value any) *StructuredDocument
type Structurizer ¶ added in v1.1.10
type Structurizer interface {
// Parse 接收原始文档,先清洗再结构化,输出结构化文档
// 内部流程:RawDocument(脏)→ 数据清洗 → 结构化解析 → StructuredDocument(干净)
Parse(raw Document) (*StructuredDocument, error)
}
Structurizer 结构化接口,负责数据清洗与文档结构解析
type Vector ¶ added in v1.1.10
type Vector struct {
ID string `json:"id"` // Unique identifier for the vector
Values []float32 `json:"values"` // The vector values
ChunkID string `json:"chunk_id"` // ID of the corresponding chunk
Metadata map[string]any `json:"metadata"` // Additional metadata about the vector
}
Vector represents a vector entity in the RAG system. It contains the vector representation of a document chunk.
type VectorStore ¶ added in v1.1.10
type VectorStore interface {
// Upsert inserts or updates vectors in the store.
// If a vector with the same ID exists, it will be updated; otherwise, it will be inserted.
//
// Parameters:
// - ctx: Context for cancellation and timeout
// - vectors: Slice of vectors to insert or update
//
// Returns:
// - error: Any error that occurred during the operation
Upsert(ctx context.Context, vectors []*Vector) error
// Search performs a similarity search to find the most similar vectors.
// It returns the topK most similar vectors along with their similarity scores.
//
// Parameters:
// - ctx: Context for cancellation and timeout
// - query: The query vector to search for
// - topK: Maximum number of results to return
// - filters: Optional metadata filters to apply
//
// Returns:
// - []*Vector: The most similar vectors found
// - []float32: Similarity scores for each result
// - error: Any error that occurred during search
Search(ctx context.Context, query []float32, topK int, filters map[string]any) ([]*Vector, []float32, error)
// Delete removes a vector from the store by its ID.
//
// Parameters:
// - ctx: Context for cancellation and timeout
// - id: The unique identifier of the vector to delete
//
// Returns:
// - error: Any error that occurred during deletion
Delete(ctx context.Context, id string) error
// GetByDocID retrieves all vectors belonging to the same document by doc_id.
// This enables "knowledge traceability" — reconstructing the original document
// from individual chunks stored in the vector store.
//
// Parameters:
// - ctx: Context for cancellation and timeout
// - docID: The document ID to search for (from Chunk.DocID)
//
// Returns:
// - []*Vector: All vectors belonging to the document (sorted by chunk index)
// - error: Any error that occurred during retrieval
//
// Example usage:
//
// vectors, err := store.GetByDocID(ctx, docID)
// if err != nil { ... }
// doc := ReconstructDocument(vectors)
GetByDocID(ctx context.Context, docID string) ([]*Vector, error)
// Close gracefully shuts down the vector store connection.
// It should release all resources and close any open connections.
//
// Parameters:
// - ctx: Context for cancellation and timeout
//
// Returns:
// - error: Any error that occurred during shutdown
Close(ctx context.Context) error
}
VectorStore defines the interface for vector storage and similarity search. It provides methods for storing embedding vectors and performing efficient nearest neighbor searches. Implementations can use various vector databases like Milvus, Pinecone, Qdrant, Weaviate, or in-memory stores.
Key responsibilities:
- Store and update vector embeddings with associated metadata
- Perform similarity searches using cosine distance or other metrics
- Support metadata filtering during searches
- Manage the lifecycle of stored vectors
Example usage:
store := NewMilvusVectorStore()
err := store.Upsert(ctx, vectors)
if err != nil {
log.Fatal(err)
}
results, scores, err := store.Search(ctx, queryVector, 10, filters)