chunker

package
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 14, 2026 License: MPL-2.0 Imports: 6 Imported by: 0

Documentation

Overview

Package chunker splits source files into semantically coherent chunks using tree-sitter ASTs.

The cAST algorithm recursively traverses a file's AST. Nodes whose byte span exceeds [Config.MaxChunkSize] are split; adjacent small nodes are merged until [Config.MinChunkSize] is reached. The result is a []Chunk where each element is a coherent source unit ready for embedding.

Chunker is constructed once via NewChunker and is safe to reuse across files. Language configurations are injected at construction time; see the langs package for the full set of built-in languages.

Index

Examples

Constants

This section is empty.

Variables

View Source
var (
	// ErrUnsupportedLanguage is returned by ChunkFile when no LanguageConfig
	// matches the file's extension.
	ErrUnsupportedLanguage = errors.New("unsupported language")

	// ErrParseFailed is returned by ChunkFile when tree-sitter cannot produce
	// a valid syntax tree for the given source.
	ErrParseFailed = errors.New("tree-sitter parse failed")
)

Functions

This section is empty.

Types

type Chunk

type Chunk struct {
	Content  string   // raw source text of this chunk
	FilePath string   // absolute path to the source file
	Language string   // language name as reported by LanguageConfig.Name
	NodeKind string   // tree-sitter node kind, e.g. "function_declaration", "class_definition"
	Name     string   // symbol name from the NameField, e.g. "MyFunc"; empty if not extractable
	Parent   string   // containing symbol for nested nodes (e.g. class name for a method); empty for top-level
	Start    Position // start of the chunk in the source file (inclusive)
	End      Position // end of the chunk in the source file (inclusive)
}

Chunk is one semantically coherent unit of source code, ready for embedding.

type Chunker

type Chunker struct {
	// contains filtered or unexported fields
}

Chunker parses and chunks source files. Construct once via NewChunker, reuse across files.

func NewChunker

func NewChunker(langs []LanguageConfig, cfg Config) (*Chunker, error)

NewChunker constructs a Chunker from the provided language configs and size thresholds. Returns an error if langs is empty, any Grammar pointer is nil, or two configs claim the same file extension. The returned Chunker is safe to reuse across files and goroutines.

Example
package main

import (
	"fmt"

	"github.com/ieshan/codamigo/chunker"
	"github.com/ieshan/codamigo/langs"
)

func main() {
	c, err := chunker.NewChunker(langs.AllLanguages(), chunker.DefaultConfig())
	if err != nil {
		panic(err)
	}
	_ = c
	fmt.Println("chunker ready")
}
Output:
chunker ready

func (*Chunker) ChunkFile

func (c *Chunker) ChunkFile(filePath string, src []byte) ([]Chunk, error)

ChunkFile parses src as the language inferred from filePath's extension and returns semantically coherent chunks ready for embedding. Returns ErrUnsupportedLanguage when no LanguageConfig matches the extension, and ErrParseFailed when tree-sitter cannot parse the source. Empty src returns nil, nil.

Example
package main

import (
	"fmt"

	"github.com/ieshan/codamigo/chunker"
	"github.com/ieshan/codamigo/langs"
)

func main() {
	c, err := chunker.NewChunker(
		[]chunker.LanguageConfig{langs.GoLanguage()},
		chunker.DefaultConfig(),
	)
	if err != nil {
		panic(err)
	}

	src := `package main

func Hello() string {
	return "hello"
}
`
	chunks, err := c.ChunkFile("main.go", []byte(src))
	if err != nil {
		panic(err)
	}
	for _, ch := range chunks {
		fmt.Printf("kind=%s name=%s\n", ch.NodeKind, ch.Name)
	}
}
Output:
kind=function_declaration name=Hello

type Config

type Config struct {
	MaxChunkSize int // split when a node exceeds this; default 1500
	MinChunkSize int // merge siblings until this is reached; default 50
}

Config controls the cAST algorithm size thresholds (in bytes).

func DefaultConfig

func DefaultConfig() Config

DefaultConfig returns production-tuned defaults (~500 tokens at ~3 chars/token).

type InjectionRule

type InjectionRule struct {
	ContainerKind string                    // node kind wrapping embedded content, e.g. "script_element"
	ContentKind   string                    // node kind of the raw content child, e.g. "raw_text"
	LangAttr      string                    // start_tag attribute that selects the grammar, e.g. "lang"
	DefaultLang   string                    // grammar key when LangAttr is absent, e.g. "javascript"
	Grammars      map[string]LanguageConfig // lang attribute value -> LanguageConfig
}

InjectionRule instructs the Chunker to re-parse a container node's embedded content using a secondary grammar, replacing the section-level atom with semantically richer atoms from the injected parse.

type LanguageConfig

type LanguageConfig struct {
	Name       string           // canonical language name, e.g. "go", "python"
	Extensions []string         // file extensions including dot, e.g. [".go"], [".py", ".pyw"]
	Grammar    *sitter.Language // tree-sitter grammar; obtain via sitter.NewLanguage(binding.Language())
	NodeKinds  NodeKindSet
	Injections []InjectionRule // optional: re-parse embedded content with secondary grammars; nil means no injection
}

LanguageConfig fully describes one language the Chunker can handle. Construct these in the langs/ package and pass them to NewChunker.

type NodeKindSet

type NodeKindSet struct {
	TopLevel  []string // node types that become top-level chunk boundaries
	Nested    []string // node types nested inside top-level nodes (e.g. methods in a class)
	NameField string   // tree-sitter field name for the symbol identifier, e.g. "name"; empty means skip name extraction
}

NodeKindSet describes which AST node types are semantic boundaries and how to extract symbol names from them.

type Position

type Position struct {
	Line       int // 1-based
	Column     int // 0-based
	ByteOffset int
}

Position is a point in source text.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL