chunk

package
v2.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 18, 2026 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

Package chunk splits text bodies into heading-aware sections for indexing.

Index

Constants

View Source
const NoParent = -1

NoParent is the sentinel used by SectionWithLineage.ParentIndex to mark a chunk as a root (a flat top-level chunk or the top of a hierarchy). Any non-negative value is interpreted as the slice index of another SectionWithLineage in the same Policy result.

Variables

View Source
var ErrTooManySections = errors.New("chunk: section count exceeds configured limit")

ErrTooManySections is returned by MarkdownWithOptions when the input body produces more sections than Options.MaxSections allows. Enclosing errors wrap this sentinel via fmt.Errorf with %w so callers can use errors.Is to distinguish a pathological-input rejection from other failures (e.g., to skip the record instead of aborting the whole rebuild).

Functions

This section is empty.

Types

type KindRouterPolicy

type KindRouterPolicy struct {
	Default Policy
	ByKind  map[string]Policy
}

KindRouterPolicy dispatches chunking to a per-Kind Policy, falling back to Default when no specific policy is registered for the record's Kind. Use it when one corpus mixes record kinds with different optimal chunking shapes (e.g., reference docs with LateChunkPolicy, raw notes with MarkdownPolicy).

A nil Default with a Kind miss returns an error rather than silently producing zero sections; the caller almost always wants to know.

func (KindRouterPolicy) Chunk

Chunk implements Policy.

type LateChunkPolicy

type LateChunkPolicy struct {
	ParentMaxTokens    int
	ChildMaxTokens     int
	ChildOverlapTokens int
	MaxSections        int
}

LateChunkPolicy is a reference implementation of the late-chunking pattern from the research signal in #16 (Late Chunking, hierarchical text segmentation). It produces a parent span per heading-aware section, then token-budget-split children that point at the parent.

Concretely, for each heading-aware section the policy emits:

  • one parent SectionWithLineage holding the full section body, with ParentIndex == NoParent. Parents are not embedded by the index layer (Q decision in the design spec); they live as storage-only context that ExpandContext can surface on IncludeParent.
  • N child SectionWithLineage rows, produced by SplitSection on the parent body with (ChildMaxTokens, ChildOverlapTokens), each carrying ParentIndex pointing at the parent's slice index. Children are embedded and participate in retrieval as usual.

ParentMaxTokens caps the parent span itself: if a heading-aware section exceeds it, that section is split into multiple parents (each with its own children). Zero means "one parent per heading-aware section, regardless of size."

MaxSections caps the total chunk count (parents + leaves combined) per record, mirroring chunk.Options.MaxSections semantics. Zero means no cap.

func (LateChunkPolicy) Chunk

Chunk implements Policy.

type MarkdownPolicy

type MarkdownPolicy struct {
	Options Options
}

MarkdownPolicy is the default Policy. It reproduces the pre-1.0 chunking pipeline exactly: heading-aware Markdown sectioning followed by optional token-budget splitting. Every returned section has ParentIndex == NoParent — the policy emits flat chunks only — so a snapshot built with a MarkdownPolicy is byte-equivalent to today's output, with parent_chunk_id NULL on every row.

Plaintext records (BodyFormat corpus.FormatPlaintext) bypass Markdown parsing and become a single section keyed on the record title; MaxTokens still drives token-budget splitting if positive.

func (MarkdownPolicy) Chunk

Chunk implements Policy.

type Options

type Options struct {
	// MaxTokens is the approximate maximum number of tokens (words) per
	// section. Sections exceeding this limit are split into sub-sections
	// unless OverlapTokens is an invalid value that disables splitting.
	// Zero disables token-budget splitting.
	MaxTokens int

	// OverlapTokens is the approximate number of tokens to overlap between
	// adjacent sub-sections when a section is split. Zero disables overlap.
	// Values greater than or equal to MaxTokens are treated as invalid and
	// leave oversized sections unsplit.
	OverlapTokens int

	// MaxSections caps the number of heading-aware sections
	// MarkdownWithOptions will emit for a single body. Zero means no
	// cap (backward-compatible default for direct callers); any positive
	// value causes MarkdownWithOptions to return ErrTooManySections when
	// the body would produce more sections than the cap. Stroma's index
	// layer applies a conservative default (see index.DefaultMaxChunkSections)
	// so a pathological body can't DoS the embedder or balloon the snapshot.
	MaxSections int
}

Options controls how Markdown sections are split.

type Policy

type Policy interface {
	Chunk(ctx context.Context, record corpus.Record) ([]SectionWithLineage, error)
}

Policy decides how a record's body becomes chunks. The default MarkdownPolicy reproduces the pre-1.0 chunking pipeline exactly (heading-aware Markdown sectioning followed by optional token-budget splitting), so callers that do not opt in see no behavior change.

A Policy may emit hierarchical chunks (parent + leaves with ParentIndex set) so consumers like Hippocampus and Pituitary can retrieve a small leaf and walk back to a broader parent span via Snapshot.ExpandContext (#16). Substrate-neutrality is preserved by keeping the contract narrow: in goes a corpus.Record, out comes a flat slice of (Section, ParentIndex) pairs. Policies have no awareness of indexing, embeddings, or storage — those concerns remain in the index package.

Implementations must be safe for concurrent use: the index session invokes Chunk on one record at a time today, but library callers that drive chunking themselves (tests, offline pipelines) should be able to fan out across goroutines without hidden shared state. The shipped Policy types (MarkdownPolicy, KindRouterPolicy, LateChunkPolicy) are immutable post-construction and derive all mutable state from the per-call record.

type Section

type Section struct {
	Heading string
	Body    string
}

Section is one heading-aware Markdown chunk.

func Markdown

func Markdown(title, body string) []Section

Markdown splits Markdown into heading-aware sections. No cap on section count — direct callers who need DoS protection should use MarkdownWithOptions with a positive MaxSections.

func MarkdownWithOptions

func MarkdownWithOptions(title, body string, opts Options) ([]Section, error)

MarkdownWithOptions splits Markdown into heading-aware sections and then applies token-budget splitting when sections exceed opts.MaxTokens, unless opts.OverlapTokens is invalid for splitting. Zero-value options produce the same output as Markdown with a nil error.

If opts.MaxSections is positive and either the heading-aware parse or the post-split pass would exceed that many sections, returns ErrTooManySections (wrapped with the observed count) and a nil slice. MaxSections is enforced inside the parser (see markdownBounded) so a pathological 10^6-heading body aborts after allocating one section past the cap rather than after materializing the full list. The token-budget pass then re-checks after each SplitSection call so one huge section split into many sub-sections can't amplify past the cap either — the residual risk is bounded by the fan-out of a single SplitSection invocation (~body_words / step).

func SplitSection

func SplitSection(s Section, maxTokens, overlapTokens int) []Section

SplitSection splits a single section into sub-sections that each contain at most maxTokens words, with overlapTokens words of overlap between them. The original text is preserved by slicing at word boundaries rather than reconstructing from tokenized words.

If maxTokens is <= 0 or overlapTokens >= maxTokens the section is returned unchanged.

type SectionWithLineage

type SectionWithLineage struct {
	Section
	ParentIndex int
}

SectionWithLineage decorates a Section with optional parent linkage inside a Policy's returned slice. ParentIndex is either NoParent (root) or the slice index of another SectionWithLineage. Forward references — a chunk pointing at an index later in the slice — are rejected when the index session validates topology, so the resulting FK chain is always acyclic and parents always precede their leaves at insert time.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL