Documentation
¶
Overview ¶
Package chunker splits source files into semantically coherent chunks using tree-sitter ASTs.
The cAST algorithm recursively traverses a file's AST. Nodes whose byte span exceeds [Config.MaxChunkSize] are split; adjacent small nodes are merged until [Config.MinChunkSize] is reached. The result is a []Chunk where each element is a coherent source unit ready for embedding.
Chunker is constructed once via NewChunker and is safe to reuse across files. Language configurations are injected at construction time; see the langs package for the full set of built-in languages.
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
var ( // ErrUnsupportedLanguage is returned by ChunkFile when no LanguageConfig // matches the file's extension. ErrUnsupportedLanguage = errors.New("unsupported language") // ErrParseFailed is returned by ChunkFile when tree-sitter cannot produce // a valid syntax tree for the given source. ErrParseFailed = errors.New("tree-sitter parse failed") )
Functions ¶
This section is empty.
Types ¶
type Chunk ¶
type Chunk struct {
Content string // raw source text of this chunk
FilePath string // absolute path to the source file
Language string // language name as reported by LanguageConfig.Name
NodeKind string // tree-sitter node kind, e.g. "function_declaration", "class_definition"
Name string // symbol name from the NameField, e.g. "MyFunc"; empty if not extractable
Parent string // containing symbol for nested nodes (e.g. class name for a method); empty for top-level
Start Position // start of the chunk in the source file (inclusive)
End Position // end of the chunk in the source file (inclusive)
}
Chunk is one semantically coherent unit of source code, ready for embedding.
type Chunker ¶
type Chunker struct {
// contains filtered or unexported fields
}
Chunker parses and chunks source files. Construct once via NewChunker, reuse across files.
func NewChunker ¶
func NewChunker(langs []LanguageConfig, cfg Config) (*Chunker, error)
NewChunker constructs a Chunker from the provided language configs and size thresholds. Returns an error if langs is empty, any Grammar pointer is nil, or two configs claim the same file extension. The returned Chunker is safe to reuse across files and goroutines.
Example ¶
package main
import (
"fmt"
"github.com/ieshan/codamigo/chunker"
"github.com/ieshan/codamigo/langs"
)
func main() {
c, err := chunker.NewChunker(langs.AllLanguages(), chunker.DefaultConfig())
if err != nil {
panic(err)
}
_ = c
fmt.Println("chunker ready")
}
Output: chunker ready
func (*Chunker) ChunkFile ¶
ChunkFile parses src as the language inferred from filePath's extension and returns semantically coherent chunks ready for embedding. Returns ErrUnsupportedLanguage when no LanguageConfig matches the extension, and ErrParseFailed when tree-sitter cannot parse the source. Empty src returns nil, nil.
Example ¶
package main
import (
"fmt"
"github.com/ieshan/codamigo/chunker"
"github.com/ieshan/codamigo/langs"
)
func main() {
c, err := chunker.NewChunker(
[]chunker.LanguageConfig{langs.GoLanguage()},
chunker.DefaultConfig(),
)
if err != nil {
panic(err)
}
src := `package main
func Hello() string {
return "hello"
}
`
chunks, err := c.ChunkFile("main.go", []byte(src))
if err != nil {
panic(err)
}
for _, ch := range chunks {
fmt.Printf("kind=%s name=%s\n", ch.NodeKind, ch.Name)
}
}
Output: kind=function_declaration name=Hello
type Config ¶
type Config struct {
MaxChunkSize int // split when a node exceeds this; default 1500
MinChunkSize int // merge siblings until this is reached; default 50
}
Config controls the cAST algorithm size thresholds (in bytes).
func DefaultConfig ¶
func DefaultConfig() Config
DefaultConfig returns production-tuned defaults (~500 tokens at ~3 chars/token).
type InjectionRule ¶
type InjectionRule struct {
ContainerKind string // node kind wrapping embedded content, e.g. "script_element"
ContentKind string // node kind of the raw content child, e.g. "raw_text"
LangAttr string // start_tag attribute that selects the grammar, e.g. "lang"
DefaultLang string // grammar key when LangAttr is absent, e.g. "javascript"
Grammars map[string]LanguageConfig // lang attribute value -> LanguageConfig
}
InjectionRule instructs the Chunker to re-parse a container node's embedded content using a secondary grammar, replacing the section-level atom with semantically richer atoms from the injected parse.
type LanguageConfig ¶
type LanguageConfig struct {
Name string // canonical language name, e.g. "go", "python"
Extensions []string // file extensions including dot, e.g. [".go"], [".py", ".pyw"]
Grammar *sitter.Language // tree-sitter grammar; obtain via sitter.NewLanguage(binding.Language())
NodeKinds NodeKindSet
Injections []InjectionRule // optional: re-parse embedded content with secondary grammars; nil means no injection
}
LanguageConfig fully describes one language the Chunker can handle. Construct these in the langs/ package and pass them to NewChunker.
type NodeKindSet ¶
type NodeKindSet struct {
TopLevel []string // node types that become top-level chunk boundaries
Nested []string // node types nested inside top-level nodes (e.g. methods in a class)
NameField string // tree-sitter field name for the symbol identifier, e.g. "name"; empty means skip name extraction
}
NodeKindSet describes which AST node types are semantic boundaries and how to extract symbol names from them.