chunkx

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 9, 2025 License: MIT Imports: 8 Imported by: 0

README

chunkx

Go Reference CI Go Report Card

A Go library for AST-based code chunking implementing the CAST (Chunking via Abstract Syntax Trees) algorithm from the paper "cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree".

Features

  • Syntax-aware chunking: Respects code structure (functions, classes, methods) instead of arbitrarily splitting at line boundaries
  • Multi-language support: Works with 30+ languages via tree-sitter parsers
  • Generic fallback: Automatically falls back to line-based chunking for unsupported file types
  • Configurable chunk sizes: Set maximum chunk size in tokens, bytes, or lines
  • Custom token counters: Pluggable interface for custom tokenization strategies
  • Overlap support: Optional chunk overlapping for better context preservation

Installation

go get github.com/gomantics/chunkx

Quick Start

package main

import (
    "fmt"
    "github.com/gomantics/chunkx"
    "github.com/gomantics/chunkx/languages"
)

func main() {
    chunker := chunkx.NewChunker()

    code := `package main

func hello() {
    fmt.Println("Hello, World!")
}

func goodbye() {
    fmt.Println("Goodbye!")
}`

    chunks, err := chunker.Chunk(code,
        chunkx.WithLanguage(languages.Go),
        chunkx.WithMaxSize(50))
    if err != nil {
        panic(err)
    }

    for i, chunk := range chunks {
        fmt.Printf("Chunk %d (lines %d-%d):\n%s\n\n",
            i+1, chunk.StartLine, chunk.EndLine, chunk.Content)
    }
}

Usage

Basic Chunking
chunker := chunkx.NewChunker()

// Chunk code with language specified
chunks, err := chunker.Chunk(code, chunkx.WithLanguage(languages.Python))
File-based Chunking
// Auto-detects language from file extension
chunks, err := chunker.ChunkFile("main.go")
Custom Configuration
chunks, err := chunker.Chunk(code,
    chunkx.WithLanguage(languages.Go),
    chunkx.WithMaxSize(1500),      // Max 1500 tokens per chunk
    chunkx.WithOverlap(10),         // 10% overlap between chunks
)
Custom Token Counter
type MyTokenCounter struct{}

func (m *MyTokenCounter) CountTokens(text string) (int, error) {
    // Your custom tokenization logic
    return len(strings.Fields(text)), nil
}

chunks, err := chunker.Chunk(code,
    chunkx.WithLanguage(languages.Go),
    chunkx.WithTokenCounter(&MyTokenCounter{}))
OpenAI-Compatible Token Counting

For production use with OpenAI models, you can integrate tiktoken-go for accurate token counting:

import (
    "github.com/pkoukk/tiktoken-go"

    "github.com/gomantics/chunkx"
    "github.com/gomantics/chunkx/languages"
)

// TikTokenCounter uses OpenAI's tiktoken for accurate token counting
type TikTokenCounter struct {
    encoding *tiktoken.Tiktoken
}

// NewTikTokenCounter creates a counter for a specific OpenAI model
func NewTikTokenCounter(model string) (*TikTokenCounter, error) {
    encoding, err := tiktoken.EncodingForModel(model)
    if err != nil {
        return nil, err
    }
    return &TikTokenCounter{encoding: encoding}, nil
}

func (t *TikTokenCounter) CountTokens(text string) (int, error) {
    tokens := t.encoding.Encode(text, nil, nil)
    return len(tokens), nil
}

// Usage example
func main() {
    tokenCounter, err := NewTikTokenCounter("gpt-4")
    if err != nil {
        panic(err)
    }

    chunker := chunkx.NewChunker()
    chunks, err := chunker.Chunk(code,
        chunkx.WithLanguage(languages.Python),
        chunkx.WithMaxSize(8000),
        chunkx.WithTokenCounter(tokenCounter))
    if err != nil {
        panic(err)
    }

    // Your chunks are now sized according to GPT-4's tokenization
    for _, chunk := range chunks {
        // Process chunks...
    }
}

This ensures your chunks respect the exact token limits of OpenAI models like GPT-3.5, GPT-4, and GPT-4o.

Built-in Token Counters
  • SimpleTokenCounter: Whitespace-based word counting (default)
  • ByteCounter: Counts bytes instead of tokens
  • LineCounter: Counts lines instead of tokens
// Use byte-based chunking
chunks, err := chunker.Chunk(code,
    chunkx.WithLanguage(languages.Python),
    chunkx.WithMaxSize(4096),
    chunkx.WithTokenCounter(&chunkx.ByteCounter{}))

Supported Languages

ChunkX supports 30+ programming languages via tree-sitter. Use the exported language constants from the languages package (e.g., languages.Go, languages.Python, languages.JavaScript, etc.). See the languages package for the complete list of supported languages and file extensions.

For files with unrecognized extensions or explicitly using languages.Generic, ChunkX automatically falls back to a line-based chunking algorithm that maintains the chunking semantics without requiring AST parsing.

How It Works

ChunkX implements the CAST algorithm which:

  1. Parses source code into an Abstract Syntax Tree (AST)
  2. Recursively traverses the AST to identify semantic units
  3. Groups nodes while respecting the maximum chunk size
  4. Splits large nodes that exceed the size limit
  5. Merges smaller sibling nodes to maximize chunk density

This approach ensures that chunks:

  • Preserve syntactic integrity (no mid-function splits)
  • Maintain semantic coherence
  • Are self-contained and meaningful
  • Respect language-specific structures

Chunk Structure

type Chunk struct {
    Content    string                // The actual code content
    StartLine  int                   // Starting line number (1-based)
    EndLine    int                   // Ending line number (1-based)
    StartByte  int                   // Starting byte offset
    EndByte    int                   // Ending byte offset
    NodeTypes  []string              // AST node types included
    Language   languages.LanguageName // Programming language
}

Performance

Benchmarks on Apple M4 Max (3s run):

BenchmarkASTChunking-14                              41301     85932 ns/op   19520 B/op     170 allocs/op
BenchmarkLineBasedChunking-14                      4392780       831.6 ns/op   1904 B/op      10 allocs/op
BenchmarkASTChunkingLarge-14                         4681    769800 ns/op  110464 B/op     794 allocs/op
BenchmarkLineBasedChunkingLarge-14                 437184      8273 ns/op   16880 B/op      27 allocs/op
BenchmarkASTChunkingMultipleLanguages-14            22951    156257 ns/op   42336 B/op     336 allocs/op
BenchmarkTokenCounters/SimpleTokenCounter-14        51332     70434 ns/op    4760 B/op      20 allocs/op
BenchmarkTokenCounters/ByteCounter-14               40485     88952 ns/op   21504 B/op     227 allocs/op
BenchmarkTokenCounters/LineCounter-14               51607     70349 ns/op    3224 B/op      19 allocs/op
BenchmarkOverlapChunking/Overlap0-14                42333     85163 ns/op   19544 B/op     172 allocs/op
BenchmarkOverlapChunking/Overlap10-14               41676     85761 ns/op   21832 B/op     187 allocs/op
BenchmarkOverlapChunking/Overlap25-14               42122     85715 ns/op   22032 B/op     187 allocs/op
BenchmarkOverlapChunking/Overlap50-14               41696     85976 ns/op   22360 B/op     187 allocs/op

AST-based chunking is ~100x slower than naive line-based chunking but produces semantically superior chunks that improve RAG performance. The SimpleTokenCounter and LineCounter provide the best performance, while ByteCounter has slightly higher overhead due to more allocations. Chunk overlap has minimal performance impact (~0.5% overhead).

Examples

The testdata/ directory contains real-world code examples in multiple languages, along with their chunked outputs in JSON format. These examples serve as both documentation and regression tests:

  • testdata/sources/: Example source files in Go, Python, JavaScript, TypeScript, Java, Rust, and C++
  • testdata/*.approved.json: Snapshot test outputs showing how each example is chunked

To see how chunkx handles different languages and chunk sizes, browse the approved JSON files. They show:

  • Complete chunk content
  • Line and byte ranges
  • AST node types included in each chunk
  • How semantic boundaries are preserved

The snapshots are automatically verified using go-approval-tests to ensure chunking behavior remains consistent across changes.

Testing

# Run tests
go test ./...

# Run benchmarks
go test -bench=. -benchtime=10s

# Run with coverage
go test -cover ./...

# Run approval tests (regenerate snapshots on first failure)
go test -run TestChunkingExamples

Use Cases

  • RAG Systems: Improve retrieval quality by providing semantically coherent code chunks
  • Code Search: Index code at meaningful boundaries
  • Documentation: Generate documentation from logical code units
  • Code Analysis: Process code in structured segments
  • LLM Context Windows: Fit code into token limits while preserving structure

Design Principles

  1. Minimalist: Clean, focused codebase with no unnecessary abstractions
  2. Well-tested: Comprehensive unit, integration, and benchmark tests
  3. Pluggable: Interface-based design for extensibility
  4. Language-agnostic: Works consistently across programming languages

References

License

MIT

Documentation

Overview

Package chunkx provides AST-based code chunking using the CAST algorithm.

ChunkX implements the CAST (Chunking via Abstract Syntax Trees) method for semantically-aware code chunking. Unlike line-based chunking, CAST respects code structure by parsing source into an AST and creating chunks that align with syntactic boundaries (functions, classes, methods).

Basic usage:

chunker := chunkx.NewChunker()
chunks, err := chunker.Chunk(code, chunkx.WithLanguage(languages.Go))

Supports 30+ languages including Bash, C, C++, C#, CSS, Cue, Dockerfile, Elixir, Elm, Go, Groovy, HCL, HTML, Java, JavaScript, Kotlin, Lua, Markdown, OCaml, PHP, Protobuf, Python, Ruby, Rust, Scala, SQL, Svelte, Swift, TOML, TypeScript, and YAML.

For unsupported file types, the chunker automatically falls back to a generic line-based chunking algorithm.

Index

Constants

View Source
const (
	// DefaultMaxSize is the default maximum chunk size in tokens.
	DefaultMaxSize = 1500

	// DefaultOverlap is the default overlap percentage between chunks.
	DefaultOverlap = 0

	// MaxOverlap is the maximum allowed overlap percentage.
	MaxOverlap = 50
)

Default configuration values.

Variables

View Source
var (
	// ErrLanguageNotSpecified is returned when no language is specified for chunking.
	ErrLanguageNotSpecified = errors.New("language must be specified")

	// ErrUnsupportedLanguage is returned when the specified language is not supported.
	ErrUnsupportedLanguage = errors.New("unsupported language")

	// ErrNoASTSupport is returned when a language doesn't support AST parsing.
	ErrNoASTSupport = errors.New("language does not support AST parsing")

	// ErrParseFailed is returned when parsing fails.
	ErrParseFailed = errors.New("failed to parse code")

	// ErrNodeSize is returned when node size calculation fails.
	ErrNodeSize = errors.New("failed to calculate node size")
)

Sentinel errors that can be checked with errors.Is().

Functions

func GetLineNumbers

func GetLineNumbers(node *sitter.Node) (int, int)

GetLineNumbers returns the start and end line numbers for a node (1-based).

func GetNodeSize

func GetNodeSize(node *sitter.Node, source []byte, counter TokenCounter) (int, error)

GetNodeSize calculates the size of a node using the provided token counter.

func GetNodeText

func GetNodeText(node *sitter.Node, source []byte) string

GetNodeText returns the text content of a node.

Types

type ByteCounter

type ByteCounter struct{}

ByteCounter counts bytes instead of tokens.

func (*ByteCounter) CountTokens

func (b *ByteCounter) CountTokens(text string) (int, error)

CountTokens returns the number of bytes in the text.

type Chunk

type Chunk struct {
	Content   string                 // The actual code content
	StartLine int                    // Starting line number (1-based)
	EndLine   int                    // Ending line number (1-based)
	StartByte int                    // Starting byte offset
	EndByte   int                    // Ending byte offset
	NodeTypes []string               // AST node types included in this chunk
	Language  languages.LanguageName // Programming language of the chunk
}

Chunk represents a semantically coherent unit of code extracted via AST-based chunking.

type Chunker

type Chunker interface {
	Chunk(code string, opts ...Option) ([]Chunk, error)
	ChunkFile(path string, opts ...Option) ([]Chunk, error)
}

Chunker provides AST-based code chunking capabilities.

func NewChunker

func NewChunker() Chunker

NewChunker creates a new CAST chunker instance.

type LanguageError

type LanguageError struct {
	Language languages.LanguageName
	Err      error
}

LanguageError wraps language-specific errors with the language name.

func (*LanguageError) Error

func (e *LanguageError) Error() string

func (*LanguageError) Unwrap

func (e *LanguageError) Unwrap() error

type LineCounter

type LineCounter struct{}

LineCounter counts lines instead of tokens.

func (*LineCounter) CountTokens

func (l *LineCounter) CountTokens(text string) (int, error)

CountTokens returns the number of lines in the text.

type Option

type Option func(*config)

Option configures the chunker.

func WithLanguage

func WithLanguage(lang languages.LanguageName) Option

WithLanguage sets the language for parsing. Use the exported constants: languages.Go, languages.Python, etc.

func WithMaxSize

func WithMaxSize(tokens int) Option

WithMaxSize sets the maximum chunk size in tokens.

func WithOverlap

func WithOverlap(percent float64) Option

WithOverlap sets the overlap percentage (0-MaxOverlap).

func WithTokenCounter

func WithTokenCounter(counter TokenCounter) Option

WithTokenCounter sets a custom token counter.

type ParseResult

type ParseResult struct {
	Tree     *sitter.Tree
	Language languages.LanguageName
	Source   []byte
}

ParseResult contains the parsed AST and metadata.

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser provides language-agnostic parsing capabilities using tree-sitter.

func NewParser

func NewParser() *Parser

NewParser creates a new parser instance.

func (*Parser) Parse

func (p *Parser) Parse(code string, language languages.LanguageName) (*ParseResult, error)

Parse parses the given code using the specified language.

func (*Parser) ParseFile

func (p *Parser) ParseFile(filepath string, code string) (*ParseResult, error)

ParseFile parses code from a file, auto-detecting the language.

type SimpleTokenCounter

type SimpleTokenCounter struct{}

SimpleTokenCounter provides a basic whitespace-based token counting implementation.

func (*SimpleTokenCounter) CountTokens

func (s *SimpleTokenCounter) CountTokens(text string) (int, error)

CountTokens returns the number of whitespace-separated words in the text.

type TokenCounter

type TokenCounter interface {
	CountTokens(text string) (int, error)
}

TokenCounter defines the interface for counting tokens in text.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL