chunkx

package module

v0.0.3 Latest Latest Go to latest Published: Nov 9, 2025 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gomantics/chunkx

Links

Open Source Insights

README ¶

chunkx

A Go library for AST-based code chunking implementing the CAST (Chunking via Abstract Syntax Trees) algorithm from the paper "cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree".

Features

Syntax-aware chunking: Respects code structure (functions, classes, methods) instead of arbitrarily splitting at line boundaries
Multi-language support: Works with 30+ languages via tree-sitter parsers
Generic fallback: Automatically falls back to line-based chunking for unsupported file types
Configurable chunk sizes: Set maximum chunk size in tokens, bytes, or lines
Custom token counters: Pluggable interface for custom tokenization strategies
Overlap support: Optional chunk overlapping for better context preservation

Installation

go get github.com/gomantics/chunkx

Quick Start

package main

import (
    "fmt"
    "github.com/gomantics/chunkx"
    "github.com/gomantics/chunkx/languages"
)

func main() {
    chunker := chunkx.NewChunker()

    code := `package main

func hello() {
    fmt.Println("Hello, World!")
}

func goodbye() {
    fmt.Println("Goodbye!")
}`

    chunks, err := chunker.Chunk(code,
        chunkx.WithLanguage(languages.Go),
        chunkx.WithMaxSize(50))
    if err != nil {
        panic(err)
    }

    for i, chunk := range chunks {
        fmt.Printf("Chunk %d (lines %d-%d):\n%s\n\n",
            i+1, chunk.StartLine, chunk.EndLine, chunk.Content)
    }
}

Usage

Basic Chunking

chunker := chunkx.NewChunker()

// Chunk code with language specified
chunks, err := chunker.Chunk(code, chunkx.WithLanguage(languages.Python))

File-based Chunking

// Auto-detects language from file extension
chunks, err := chunker.ChunkFile("main.go")

Custom Configuration

chunks, err := chunker.Chunk(code,
    chunkx.WithLanguage(languages.Go),
    chunkx.WithMaxSize(1500),      // Max 1500 tokens per chunk
    chunkx.WithOverlap(10),         // 10% overlap between chunks
)

Custom Token Counter

type MyTokenCounter struct{}

func (m *MyTokenCounter) CountTokens(text string) (int, error) {
    // Your custom tokenization logic
    return len(strings.Fields(text)), nil
}

chunks, err := chunker.Chunk(code,
    chunkx.WithLanguage(languages.Go),
    chunkx.WithTokenCounter(&MyTokenCounter{}))

OpenAI-Compatible Token Counting

For production use with OpenAI models, you can integrate tiktoken-go for accurate token counting:

import (
    "github.com/pkoukk/tiktoken-go"

    "github.com/gomantics/chunkx"
    "github.com/gomantics/chunkx/languages"
)

// TikTokenCounter uses OpenAI's tiktoken for accurate token counting
type TikTokenCounter struct {
    encoding *tiktoken.Tiktoken
}

// NewTikTokenCounter creates a counter for a specific OpenAI model
func NewTikTokenCounter(model string) (*TikTokenCounter, error) {
    encoding, err := tiktoken.EncodingForModel(model)
    if err != nil {
        return nil, err
    }
    return &TikTokenCounter{encoding: encoding}, nil
}

func (t *TikTokenCounter) CountTokens(text string) (int, error) {
    tokens := t.encoding.Encode(text, nil, nil)
    return len(tokens), nil
}

// Usage example
func main() {
    tokenCounter, err := NewTikTokenCounter("gpt-4")
    if err != nil {
        panic(err)
    }

    chunker := chunkx.NewChunker()
    chunks, err := chunker.Chunk(code,
        chunkx.WithLanguage(languages.Python),
        chunkx.WithMaxSize(8000),
        chunkx.WithTokenCounter(tokenCounter))
    if err != nil {
        panic(err)
    }

    // Your chunks are now sized according to GPT-4's tokenization
    for _, chunk := range chunks {
        // Process chunks...
    }
}

This ensures your chunks respect the exact token limits of OpenAI models like GPT-3.5, GPT-4, and GPT-4o.

Built-in Token Counters

SimpleTokenCounter: Whitespace-based word counting (default)
ByteCounter: Counts bytes instead of tokens
LineCounter: Counts lines instead of tokens

// Use byte-based chunking
chunks, err := chunker.Chunk(code,
    chunkx.WithLanguage(languages.Python),
    chunkx.WithMaxSize(4096),
    chunkx.WithTokenCounter(&chunkx.ByteCounter{}))

Supported Languages

ChunkX supports 30+ programming languages via tree-sitter. Use the exported language constants from the languages package (e.g., languages.Go, languages.Python, languages.JavaScript, etc.). See the languages package for the complete list of supported languages and file extensions.

For files with unrecognized extensions or explicitly using languages.Generic, ChunkX automatically falls back to a line-based chunking algorithm that maintains the chunking semantics without requiring AST parsing.

How It Works

ChunkX implements the CAST algorithm which:

Parses source code into an Abstract Syntax Tree (AST)
Recursively traverses the AST to identify semantic units
Groups nodes while respecting the maximum chunk size
Splits large nodes that exceed the size limit
Merges smaller sibling nodes to maximize chunk density

This approach ensures that chunks:

Preserve syntactic integrity (no mid-function splits)
Maintain semantic coherence
Are self-contained and meaningful
Respect language-specific structures

Chunk Structure

type Chunk struct {
    Content    string                // The actual code content
    StartLine  int                   // Starting line number (1-based)
    EndLine    int                   // Ending line number (1-based)
    StartByte  int                   // Starting byte offset
    EndByte    int                   // Ending byte offset
    NodeTypes  []string              // AST node types included
    Language   languages.LanguageName // Programming language
}

Performance

Benchmarks on Apple M4 Max (3s run):

BenchmarkASTChunking-14                              41301     85932 ns/op   19520 B/op     170 allocs/op
BenchmarkLineBasedChunking-14                      4392780       831.6 ns/op   1904 B/op      10 allocs/op
BenchmarkASTChunkingLarge-14                         4681    769800 ns/op  110464 B/op     794 allocs/op
BenchmarkLineBasedChunkingLarge-14                 437184      8273 ns/op   16880 B/op      27 allocs/op
BenchmarkASTChunkingMultipleLanguages-14            22951    156257 ns/op   42336 B/op     336 allocs/op
BenchmarkTokenCounters/SimpleTokenCounter-14        51332     70434 ns/op    4760 B/op      20 allocs/op
BenchmarkTokenCounters/ByteCounter-14               40485     88952 ns/op   21504 B/op     227 allocs/op
BenchmarkTokenCounters/LineCounter-14               51607     70349 ns/op    3224 B/op      19 allocs/op
BenchmarkOverlapChunking/Overlap0-14                42333     85163 ns/op   19544 B/op     172 allocs/op
BenchmarkOverlapChunking/Overlap10-14               41676     85761 ns/op   21832 B/op     187 allocs/op
BenchmarkOverlapChunking/Overlap25-14               42122     85715 ns/op   22032 B/op     187 allocs/op
BenchmarkOverlapChunking/Overlap50-14               41696     85976 ns/op   22360 B/op     187 allocs/op

AST-based chunking is ~100x slower than naive line-based chunking but produces semantically superior chunks that improve RAG performance. The SimpleTokenCounter and LineCounter provide the best performance, while ByteCounter has slightly higher overhead due to more allocations. Chunk overlap has minimal performance impact (~0.5% overhead).

Examples

The testdata/ directory contains real-world code examples in multiple languages, along with their chunked outputs in JSON format. These examples serve as both documentation and regression tests:

testdata/sources/: Example source files in Go, Python, JavaScript, TypeScript, Java, Rust, and C++
testdata/*.approved.json: Snapshot test outputs showing how each example is chunked

To see how chunkx handles different languages and chunk sizes, browse the approved JSON files. They show:

Complete chunk content
Line and byte ranges
AST node types included in each chunk
How semantic boundaries are preserved

The snapshots are automatically verified using go-approval-tests to ensure chunking behavior remains consistent across changes.

Testing

# Run tests
go test ./...

# Run benchmarks
go test -bench=. -benchtime=10s

# Run with coverage
go test -cover ./...

# Run approval tests (regenerate snapshots on first failure)
go test -run TestChunkingExamples

Use Cases

RAG Systems: Improve retrieval quality by providing semantically coherent code chunks
Code Search: Index code at meaningful boundaries
Documentation: Generate documentation from logical code units
Code Analysis: Process code in structured segments
LLM Context Windows: Fit code into token limits while preserving structure

Design Principles

Minimalist: Clean, focused codebase with no unnecessary abstractions
Well-tested: Comprehensive unit, integration, and benchmark tests
Pluggable: Interface-based design for extensibility
Language-agnostic: Works consistently across programming languages

References

License

MIT

Documentation ¶

Overview ¶

Package chunkx provides AST-based code chunking using the CAST algorithm.

ChunkX implements the CAST (Chunking via Abstract Syntax Trees) method for semantically-aware code chunking. Unlike line-based chunking, CAST respects code structure by parsing source into an AST and creating chunks that align with syntactic boundaries (functions, classes, methods).

Basic usage:

chunker := chunkx.NewChunker()
chunks, err := chunker.Chunk(code, chunkx.WithLanguage(languages.Go))

Supports 30+ languages including Bash, C, C++, C#, CSS, Cue, Dockerfile, Elixir, Elm, Go, Groovy, HCL, HTML, Java, JavaScript, Kotlin, Lua, Markdown, OCaml, PHP, Protobuf, Python, Ruby, Rust, Scala, SQL, Svelte, Swift, TOML, TypeScript, and YAML.

For unsupported file types, the chunker automatically falls back to a generic line-based chunking algorithm.

Index ¶

Constants
Variables
func GetLineNumbers(node *sitter.Node) (int, int)
func GetNodeSize(node *sitter.Node, source []byte, counter TokenCounter) (int, error)
func GetNodeText(node *sitter.Node, source []byte) string
type ByteCounter
- func (b *ByteCounter) CountTokens(text string) (int, error)
type Chunk
type Chunker
- func NewChunker() Chunker
type LanguageError
- func (e *LanguageError) Error() string
- func (e *LanguageError) Unwrap() error
type LineCounter
- func (l *LineCounter) CountTokens(text string) (int, error)
type Option
- func WithLanguage(lang languages.LanguageName) Option
- func WithMaxSize(tokens int) Option
- func WithOverlap(percent float64) Option
- func WithTokenCounter(counter TokenCounter) Option
type ParseResult
type Parser
- func NewParser() *Parser
- func (p *Parser) Parse(code string, language languages.LanguageName) (*ParseResult, error)
- func (p *Parser) ParseFile(filepath string, code string) (*ParseResult, error)
type SimpleTokenCounter
- func (s *SimpleTokenCounter) CountTokens(text string) (int, error)
type TokenCounter

Constants ¶

View Source

const (
	// DefaultMaxSize is the default maximum chunk size in tokens.
	DefaultMaxSize = 1500

	// DefaultOverlap is the default overlap percentage between chunks.
	DefaultOverlap = 0

	// MaxOverlap is the maximum allowed overlap percentage.
	MaxOverlap = 50
)

Default configuration values.

Variables ¶

View Source

var (
	// ErrLanguageNotSpecified is returned when no language is specified for chunking.
	ErrLanguageNotSpecified = errors.New("language must be specified")

	// ErrUnsupportedLanguage is returned when the specified language is not supported.
	ErrUnsupportedLanguage = errors.New("unsupported language")

	// ErrNoASTSupport is returned when a language doesn't support AST parsing.
	ErrNoASTSupport = errors.New("language does not support AST parsing")

	// ErrParseFailed is returned when parsing fails.
	ErrParseFailed = errors.New("failed to parse code")

	// ErrNodeSize is returned when node size calculation fails.
	ErrNodeSize = errors.New("failed to calculate node size")
)

Sentinel errors that can be checked with errors.Is().

Functions ¶

func GetLineNumbers ¶

func GetLineNumbers(node *sitter.Node) (int, int)

GetLineNumbers returns the start and end line numbers for a node (1-based).

func GetNodeSize ¶

func GetNodeSize(node *sitter.Node, source []byte, counter TokenCounter) (int, error)

GetNodeSize calculates the size of a node using the provided token counter.

func GetNodeText ¶

func GetNodeText(node *sitter.Node, source []byte) string

GetNodeText returns the text content of a node.

Types ¶

type ByteCounter ¶

type ByteCounter struct{}

ByteCounter counts bytes instead of tokens.

func (*ByteCounter) CountTokens ¶

func (b *ByteCounter) CountTokens(text string) (int, error)

CountTokens returns the number of bytes in the text.

type Chunk ¶

type Chunk struct {
	Content   string                 // The actual code content
	StartLine int                    // Starting line number (1-based)
	EndLine   int                    // Ending line number (1-based)
	StartByte int                    // Starting byte offset
	EndByte   int                    // Ending byte offset
	NodeTypes []string               // AST node types included in this chunk
	Language  languages.LanguageName // Programming language of the chunk
}

Chunk represents a semantically coherent unit of code extracted via AST-based chunking.

type Chunker ¶

type Chunker interface {
	Chunk(code string, opts ...Option) ([]Chunk, error)
	ChunkFile(path string, opts ...Option) ([]Chunk, error)
}

Chunker provides AST-based code chunking capabilities.

func NewChunker ¶

func NewChunker() Chunker

NewChunker creates a new CAST chunker instance.

type LanguageError ¶

type LanguageError struct {
	Language languages.LanguageName
	Err      error
}

LanguageError wraps language-specific errors with the language name.

func (*LanguageError) Error ¶

func (e *LanguageError) Error() string

func (*LanguageError) Unwrap ¶

func (e *LanguageError) Unwrap() error

type LineCounter ¶

type LineCounter struct{}

LineCounter counts lines instead of tokens.

func (*LineCounter) CountTokens ¶

func (l *LineCounter) CountTokens(text string) (int, error)

CountTokens returns the number of lines in the text.

type Option ¶

type Option func(*config)

Option configures the chunker.

func WithLanguage ¶

func WithLanguage(lang languages.LanguageName) Option

WithLanguage sets the language for parsing. Use the exported constants: languages.Go, languages.Python, etc.

func WithMaxSize ¶

func WithMaxSize(tokens int) Option

WithMaxSize sets the maximum chunk size in tokens.

func WithOverlap ¶

func WithOverlap(percent float64) Option

WithOverlap sets the overlap percentage (0-MaxOverlap).

func WithTokenCounter ¶

func WithTokenCounter(counter TokenCounter) Option

WithTokenCounter sets a custom token counter.

type ParseResult ¶

type ParseResult struct {
	Tree     *sitter.Tree
	Language languages.LanguageName
	Source   []byte
}

ParseResult contains the parsed AST and metadata.

type Parser ¶

type Parser struct {
	// contains filtered or unexported fields
}

Parser provides language-agnostic parsing capabilities using tree-sitter.

func NewParser ¶

func NewParser() *Parser

NewParser creates a new parser instance.

func (*Parser) Parse ¶

func (p *Parser) Parse(code string, language languages.LanguageName) (*ParseResult, error)

Parse parses the given code using the specified language.

func (*Parser) ParseFile ¶

func (p *Parser) ParseFile(filepath string, code string) (*ParseResult, error)

ParseFile parses code from a file, auto-detecting the language.

type SimpleTokenCounter ¶

type SimpleTokenCounter struct{}

SimpleTokenCounter provides a basic whitespace-based token counting implementation.

func (*SimpleTokenCounter) CountTokens ¶

func (s *SimpleTokenCounter) CountTokens(text string) (int, error)

CountTokens returns the number of whitespace-separated words in the text.

type TokenCounter ¶

type TokenCounter interface {
	CountTokens(text string) (int, error)
}

TokenCounter defines the interface for counting tokens in text.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
languages

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL