Documentation
¶
Overview ¶
Package chunkx provides AST-based code chunking using the CAST algorithm.
ChunkX implements the CAST (Chunking via Abstract Syntax Trees) method for semantically-aware code chunking. Unlike line-based chunking, CAST respects code structure by parsing source into an AST and creating chunks that align with syntactic boundaries (functions, classes, methods).
Basic usage:
chunker := chunkx.NewChunker() chunks, err := chunker.Chunk(code, chunkx.WithLanguage(languages.Go))
Supports 30+ languages including Bash, C, C++, C#, CSS, Cue, Dockerfile, Elixir, Elm, Go, Groovy, HCL, HTML, Java, JavaScript, Kotlin, Lua, Markdown, OCaml, PHP, Protobuf, Python, Ruby, Rust, Scala, SQL, Svelte, Swift, TOML, TypeScript, and YAML.
For unsupported file types, the chunker automatically falls back to a generic line-based chunking algorithm.
Index ¶
- Constants
- Variables
- func GetLineNumbers(node *sitter.Node) (int, int)
- func GetNodeSize(node *sitter.Node, source []byte, counter TokenCounter) (int, error)
- func GetNodeText(node *sitter.Node, source []byte) string
- type ByteCounter
- type Chunk
- type Chunker
- type LanguageError
- type LineCounter
- type Option
- type ParseResult
- type Parser
- type SimpleTokenCounter
- type TokenCounter
Constants ¶
const ( // DefaultMaxSize is the default maximum chunk size in tokens. DefaultMaxSize = 1500 // DefaultOverlap is the default overlap percentage between chunks. DefaultOverlap = 0 // MaxOverlap is the maximum allowed overlap percentage. MaxOverlap = 50 )
Default configuration values.
Variables ¶
var ( // ErrLanguageNotSpecified is returned when no language is specified for chunking. ErrLanguageNotSpecified = errors.New("language must be specified") // ErrUnsupportedLanguage is returned when the specified language is not supported. ErrUnsupportedLanguage = errors.New("unsupported language") // ErrNoASTSupport is returned when a language doesn't support AST parsing. ErrNoASTSupport = errors.New("language does not support AST parsing") // ErrParseFailed is returned when parsing fails. ErrParseFailed = errors.New("failed to parse code") // ErrNodeSize is returned when node size calculation fails. ErrNodeSize = errors.New("failed to calculate node size") )
Sentinel errors that can be checked with errors.Is().
Functions ¶
func GetLineNumbers ¶
GetLineNumbers returns the start and end line numbers for a node (1-based).
func GetNodeSize ¶
GetNodeSize calculates the size of a node using the provided token counter.
Types ¶
type ByteCounter ¶
type ByteCounter struct{}
ByteCounter counts bytes instead of tokens.
func (*ByteCounter) CountTokens ¶
func (b *ByteCounter) CountTokens(text string) (int, error)
CountTokens returns the number of bytes in the text.
type Chunk ¶
type Chunk struct {
Content string // The actual code content
StartLine int // Starting line number (1-based)
EndLine int // Ending line number (1-based)
StartByte int // Starting byte offset
EndByte int // Ending byte offset
NodeTypes []string // AST node types included in this chunk
Language languages.LanguageName // Programming language of the chunk
}
Chunk represents a semantically coherent unit of code extracted via AST-based chunking.
type Chunker ¶
type Chunker interface {
Chunk(code string, opts ...Option) ([]Chunk, error)
ChunkFile(path string, opts ...Option) ([]Chunk, error)
}
Chunker provides AST-based code chunking capabilities.
type LanguageError ¶
type LanguageError struct {
Language languages.LanguageName
Err error
}
LanguageError wraps language-specific errors with the language name.
func (*LanguageError) Error ¶
func (e *LanguageError) Error() string
func (*LanguageError) Unwrap ¶
func (e *LanguageError) Unwrap() error
type LineCounter ¶
type LineCounter struct{}
LineCounter counts lines instead of tokens.
func (*LineCounter) CountTokens ¶
func (l *LineCounter) CountTokens(text string) (int, error)
CountTokens returns the number of lines in the text.
type Option ¶
type Option func(*config)
Option configures the chunker.
func WithLanguage ¶
func WithLanguage(lang languages.LanguageName) Option
WithLanguage sets the language for parsing. Use the exported constants: languages.Go, languages.Python, etc.
func WithMaxSize ¶
WithMaxSize sets the maximum chunk size in tokens.
func WithOverlap ¶
WithOverlap sets the overlap percentage (0-MaxOverlap).
func WithTokenCounter ¶
func WithTokenCounter(counter TokenCounter) Option
WithTokenCounter sets a custom token counter.
type ParseResult ¶
type ParseResult struct {
Tree *sitter.Tree
Language languages.LanguageName
Source []byte
}
ParseResult contains the parsed AST and metadata.
type Parser ¶
type Parser struct {
// contains filtered or unexported fields
}
Parser provides language-agnostic parsing capabilities using tree-sitter.
func (*Parser) Parse ¶
func (p *Parser) Parse(code string, language languages.LanguageName) (*ParseResult, error)
Parse parses the given code using the specified language.
type SimpleTokenCounter ¶
type SimpleTokenCounter struct{}
SimpleTokenCounter provides a basic whitespace-based token counting implementation.
func (*SimpleTokenCounter) CountTokens ¶
func (s *SimpleTokenCounter) CountTokens(text string) (int, error)
CountTokens returns the number of whitespace-separated words in the text.
type TokenCounter ¶
TokenCounter defines the interface for counting tokens in text.